As our (or our clients) infrastraucture grows and run for longer duration I have noticed there are certaion parts of our infrastructure that are known only by certain people to certain extent. Due to the nature of IT operations most of the engineers stay in firefighting mode, and fixture to some of the problem (be it stability related issue or security related issue or performance related issue) is employed as manual hotfix. Over time these pieces of infrastructure (or infrastructure services) accumulates some feature or functionality that are not automated (or documented) , and slowly it attains a state where if you kill that server it will be difficult to recreate it, not only because you dont know what exact steps needs to be taken to bring back it in the original state, but also the dependenies with other integration points. In the community we call them 'Work of Art'.
There are many ways to fix them, but this post is about how to catch them.
Prevention is better than cure.
I prefer to kill the whole environment (staging, preprod, uat) every weekend or have non-functional relases where I just recreate the production infrastructure at regular interval. This does not eliminate the accumulation of manual fixes, but this does indicate if any such thing present (which are crtical for the services to run), and by doing this more frequently i try to reduce the risk into smaller, affordable ones. To me this is a litmus test or Gold Standard for Automated Infgrastructure