Schrodinger's Outage

A couple months ago we had an incident, in which a legacy recovery mechanism proved to be inadequate to our current scale. In our internal post-incident review, we asked if we should improve this seldom-used capability. I decided not to, because the plan is to completely replace the part of the platform that it serves. My judgment was that we were not likely to need it, and it would be a lot of time and effort to improve.

Shortly thereafter, we did need it again, and again experienced the same pains. Was the decision wrong?

Raindrop Puddle

This story should be familiar if you operate in any domain in which there’s uncertainty about the future return on any investment (or risk due to lack of investment). And really, who doesn’t—this story should be relevant to most of us!

These decisions are rarely wrong or right at the time they’re made. Like Schrödinger’s cat, you won’t know whether they live or die until subsequent events occur—events whose occurrence, and outcomes, are unknowable at the time.

And yet, it’s common to impose right/wrong judgments in hindsight. Assuming the decision hasn’t been obsoleted at the point of hindsight judgment, there’s only one “good judgment” box in the decision matrix:

Hindsight 1

And if the system is decommissioned and the decision is obsolete, the best-case is a 5050 split:

Hindsight 2

Rarely are there clear errors of judgment when the future is uncertain. Equating outcomes with soundness of judgment is a fallacy: “good” decisions don’t necessarily produce good outcomes, and “bad” decisions aren’t necessarily punished by subsequent events. The decisions themselves aren’t good or bad; they are simply a necessary factor in an outcome that had many necessary, but only jointly sufficient, conditions.

Be vigilant against the tendency to ask “why didn’t you.” It’s dangerous: it’s a counterfactual, a question about decisions not taken in a hypothetical past that never existed. This puts people on the defensive against the indefensible, leads to finger-pointing, and tempts leadership to “solve” the problem by punishing those to “blame.” Worst of all, it is a serious obstacle to learning. And the goal of any post-incident review should be for the organization to learn so it can become more resilient.

Instead of counterfactuals, ask what the person knew, what pressures they were under, what they thought might happen, what types of risk-reward and cost-benefit estimates they were making. And rather than judging those, seek instead to understand why those seemed to be wise at the time. Because no one got up that morning and said, “today I’m going to be lazy and foolish, not caring about the obvious negative consequences of my action or inaction.” That’s another past that didn’t happen. Don’t act as if it did.

PS: nobody blamed me for my decision and I’m not defending myself; I’m sharing this story because I think it can be helpful.

Photo Credit

See Also

I'm Baron Schwartz, the founder and CEO of VividCortex. I am the author of High Performance MySQL and lots of open-source software for performance analysis, monitoring, and system administration. I contribute to various database communities such as Oracle, PostgreSQL, Redis and MongoDB. More about me.