There's an old Internet fable about a plumber charging an obscene amount of money for tapping a pipe with a hammer. When asked to justify the cost, the plumber replies that hitting the pipe was cheap but knowing where to hit was the true reason for the cost. The story actually predates the Internet by several decades but found new life as justification for, among other things, high-priced consultancy fees. I make no judgment on this – deep knowledge is invaluable.
One of the drawbacks of amassing this knowledge, this Lore, is the tendency to apply obscure remediations to pedestrian problems. After all, this is how we learn. When a known solution exists for a given set of symptoms, we recognize those symptoms and use our knowledge to resolve the issue. Problems arise when we focus on the known solution instead
of the set of possible solutions. There's even a name for this - the Einstellung Effect.
The Einstellung effect is the counterintuitive finding that prior experience or domain-specific knowledge can under some circumstances interfere with problem solving performance.
(Jessica J. Ellis and Eyal M. Reingold)
Late last year this effect bit me. During an extended period of working from home due to the pandemic, my car had sat unmoved for several months. When I did start it to run a quick errand, it went into a "limp home" mode where it would not leave first gear and the speed stayed under 5 MPH. After watching several YouTube videos I was able to solve the problem by running a sequence of operations including pressing the accelerator at specific intervals, turning the key switch, and possibly dribbling goat blood on the steering wheel. To my elation, it worked.
Fast forward a few months and the similar symptoms presented themselves. So sure was I that the ritual would solve it that I tried for two hours to replicate my earlier success. No joy. It wasn't until a knowledgeable person ran through proper diagnostics that the culprit – a failed accelerator position sensor – revealed itself.
I was not immune to making this same mistake in IT. While troubleshooting a particularly irksome issue with a cluster deployment, I found myself looking at DNS, at NTP, at I/O latency on underlying volumes, even kernel tunings to increase timeout thresholds. Ultimately all these were red herrings. And after literal days of troubleshooting the culprit emerged to be a mismatch in a network configuration in this particular environment. How was this mismatch discovered? By automating a set of tests for everything.
To be clear, infrastructure tests already existed. We checked everything from storage space to firewall access to availability of required commands.These tests run prior to deployment and throughout troubleshooting.
However, there were many troubleshooting related commands that we ran throughout the effort. These included IP address validation, ping latencies, process status, SSL checks, and a host of other CLI issued commands. We ran these as needed. And to our considerable bewilderment, the results from these tests were almost always as expected. All endpoints were accessible, all processes started as needed, all resource utilization metrics were nominal. Yet the issues persisted.
By running our ad hoc commands we were in essence using a spotlight to locate an elusive
squirrel that popped up randomly, chewed through a wire or two, then disappeared. Now, of course, modern computers are not random but chaotic. A combination of variables – VMs appearing on least-busy hosts, a downstream timeout issue caused by a previous error, a pod failure and restart that mostly worked — all contributed to this appearance of randomness.
So we turned on the floodlights.
Instead of ad hoc commands, we looked at everything continuously. For example, should latency from one host to another be useful, we logged it to a file for all hosts. We captured the IPs as they were allocated in case some were somehow being blocked. We captured SSL certificates from each endpoint in case the infrastructure was intercepting and rewriting them. By scripting the troubleshooting suite and running across multiple nodes from multiple endpoints, we gained a holistic view of the environment.
As the floating IP appeared on machines on a specific switch, they disappeared from an external TCP/443 query, even as they were accessible within the network and even externally via SSH. After correlating the apparently random error with the IPs and underlying physical hosts, the network mismatch revealed itself.
We had defeated the Einstellung Effect by forcing ourselves to look everywhere rather than chasing squirrels.
Among the lessons learned was that if something was worth checking, it was worth adding to a test harness. We used a combination of RobotFramework and ad hoc scripts for this harness, some initiated from scripts and some kicked off manually. The salient point is that instead of looking at specific metrics we looked at the entire set.