Is automated failover the root of all evil?
Github’s recent post-mortem is well worth reading. They had a series of interrelated failures that caused their MySQL servers to become unavailable. The money quote:
The automated failover of our main production database could be described as the root cause of both of these downtime events. In each situation in which that occurred, if any member of our operations team had been asked if the failover should have been performed, the answer would have been a resounding no. There are many situations in which automated failover is an excellent strategy for ensuring the availability of a service. After careful consideration, we’ve determined that ensuring the availability of our primary production database is not one of these situations.
Most automated failover tools receive a lot of engineering effort to answer questions like these: Is the system really dead, or just unreachable? Do I have a quorum, or is there a split brain? Is failover really the right thing to do?
The category I don’t often see targeted as much is this question, which I consider far more important: Is the system in a situation that I [the system] know I am capable of resolving correctly? Is anything in a state that I haven’t been programmed to understand and assess?
I haven’t looked recently at the source code of the systems that Github identified as making wrong decisions, and I don’t know Pacemaker. But I know I have spent a lot of time and effort trying to write a specification for a system that could detect whether automated failover would be safe to attempt, and it’s hard. One thing that’s not all that hard, though, is making sure that only one attempt is made to do a failover. One of the best ways to create a nightmare situation is to fail to a standby, then fail away from it. If I ever create an HA tool such as the ones I’ve designed-but-not-implemented, there will be a hard stop after one attempt. If the standby doesn’t look healthy, someone should call a human, end of story.



Nice story in Gödel, Escher, Bach about a system trying to disgnose itself. There’s always a META system watching over, and a META system watching over that one. Eventually you reach G.O.D.
Shlomi Noach
17 Sep 12 at 1:42 pm
Turtles all the way down. There, I said it.
Xaprb
17 Sep 12 at 3:17 pm
This reminds me that I should write up a post on the approach we use (which is human intervention required for all master swaps and such)…
Jeremy Zawodny
17 Sep 12 at 5:12 pm
Would MHA have done any better?
http://code.google.com/p/mysql-master-ha/
Jeremy, that’s what you taught me a decade ago; Yahoo still lives by it.
Rick James
17 Sep 12 at 7:42 pm
Hi Baron, nice article. I was going to post a comment but just wrote an article of my own (http://scale-out-blog.blogspot.com/2012/09/automated-database-failover-is-weird.html). Failover automation works for a lot of applications. I’m still studying up on the turtles, though.
Robert Hodges
18 Sep 12 at 1:32 am
“Failover automation works for a lot of applications.”
No, I think the statement is
“Failover automation usually works.”
But GIT’s disaster is an example of how really badly it can go wrong.
Rick James
18 Sep 12 at 12:03 pm
I agree with Rick’s summary, and I disagree with a lot of the points in Robert’s blog, such as failure to accept a TCP/IP connection being a reliable indicator that a server is down. Failover needs to include strong fencing in most cases, too. At a minimum, if you fail over to a replica, the replica should not keep replicating (or trying to) from the failed master. A failed server needs to be completely excluded IMO.
I have some significant experience with a lot of failovers, too. Probably over the last 4 years I’ve helped clean up more than 100 situations caused by failover gone wrong. I’ve seen it go wrong many more times than right. Some of the cases took months to put right again.
In one case I remember, in early 2012 I was working on data that had been wrong since 2009, if I recall, because a failover script (triggered manually) was coded wrong (a synchronization mistake I’ve seen probably a dozen times). Depending on which server was primary, users would see their order history appear and vanish, or see other users’ order history. Fixing this required reconciling one customer at a time, one row at a time, one invoice and one line item at a time, one user account and one subscription at a time — and in many cases there just wasn’t a right answer. The man with two watches doesn’t know the time.
Other systems are significantly easier in some cases, but MySQL is rarely a joy to fail over or fix after a botched failover.
Xaprb
18 Sep 12 at 7:39 pm
I loved this tweet from @DEVOPS_BORAT:
In devops is turtle all way down but at bottom is perl script.
Xaprb
20 Sep 12 at 10:11 am