Why high-availability is hard with databases
A lot of systems are relatively easy to make HA (highly available). You just slap them into a well-known HA framework such as Linux-HA and you’re done. But databases are different, especially replicated databases, especially replicated MySQL.
The reason has to do with some properties that hold for many systems, but not for most databases. Most systems that you want to make HA are relatively lightweight and interchangeable, with little to zero statefulness, easy to start, easy to stop, don’t care a lot about storage (or at least don’t write a lot of data; that’s usually delegated to the database), and there’s little or no harm done if you ruthlessly behead them. The classic example is a web server or even most application servers. Most of the time these things are all about CPU power and network bandwidth. If I were to compare them to a car, I’d say they are like matchbox cars: there are many of them, and they are cheap and easy to replace.
Databases are different. With or without replication, you’re looking at a system that is complex, stateful, heavyweight, and cares a lot about storage. It runs on bigger hardware with fast disks and a lot of memory. It’s usually disk-bound, and it does a lot of writes. It’s hard to start — it takes a long time to warm up and really get ready to serve production workloads (many minutes, hours, or even days). It tends to run with a lot of data in memory in a dirty state, so shutdown is slow, because a clean shutdown requires flushing a bunch of data to disk. If you yank its power plug or kill-dash-nine it, it’ll have to perform recovery on startup, which slows the startup process even more. If I were to compare a database server to a car, I wouldn’t even use a car as the analogy: I’d use one of those big-ass mining trucks. If your mining truck breaks down, you don’t just toss it in the trash and pull another off the shelf.
The problem with a lot of HA solutions is that they want to deal with inconsistencies or irregularities by killing the resource and replacing it in another location. This works fine with web servers, but not with database servers. Doing that will cause serious pain and downtime, defeating the point of HA. And when you add replication into the mix, it gets even worse. A system that wants to manage replication needs to deal with very complex conditions. A lot of replication failures are delicate matters that require skilled human intervention to solve. The HA solution must insulate the application from the misbehaving resource, but leave it running so the human can handle things.
This is not the way most applications are made HA. It’s different with databases, and it’s much harder.
Further Reading:






What?!? You mean multi-master circular replication won’t make all the db problems go away?
Rob Wultsch
27 Apr 10 at 1:50 am
I’m trying to think of a suitably funny response, but I can’t find one that someone won’t take seriously and use as their basis for architecting a system.
Xaprb
27 Apr 10 at 6:18 am
amen. i’m fresh out of HA pixie dust.
sarah novotny
27 Apr 10 at 5:33 pm
In your web server example, you are create a HA solution out of essentially non-HA hardware through the application. If one of your servers goes toes up, the application works around it. In the database example, that’s really not an option. As you’ve described, if a server fails, you’ve got problems. The obvious solution is to use highly reliable, fault tolerant hardware to ensure that the server DOESN’T fail. That’s the approach that we take at Stratus Technologies. Instead of worrying about recovering from a failure we put our effort into avoiding the failure in the first place (our servers have 99.999% uptime)
GCM
28 Apr 10 at 1:41 pm
That doesn’t scale very far and gets extremely expensive (in the general case; in your specific case it sounds like you have crafted a great solution). A lot of Percona’s customers are also hosted on hardware that simply can’t be made reliable.
Related link I just saw, which has premises I basically agree with: http://highscalability.com/blog/2010/4/28/elasticity-for-the-enterprise-ensuring-continuous-high-avail.html
Xaprb
28 Apr 10 at 3:10 pm
> A lot of Percona’s customers are also hosted on hardware that simply can’t be made reliable.
Well..the servers we sell are Intel based running Windows, Linux or VmWare. What sort of HW are you using? As for the cost…compared to what? Downtime costs a bunch. For that matter replicated servers and the related infrastructure isn’t exactly cheap either
GCM
28 Apr 10 at 4:37 pm
As one example, a lot of people are hosted in various clouds or high-volume resellers where there simply aren’t any guarantees.
Cost… it all depends on your point of view. I have a blog post drafted on the illusion that “the cloud is low cost.” Many businesses subsist on illusions. It is sometimes easier to create a convincing illusion, than a convincing real thing.
Xaprb
28 Apr 10 at 4:46 pm