Xaprb

Stay curious!

Why high-availability is hard with databases

with 7 comments

A lot of systems are relatively easy to make HA (highly available). You just slap them into a well-known HA framework such as Linux-HA and you’re done. But databases are different, especially replicated databases, especially replicated MySQL.

Matchbox CarThe reason has to do with some properties that hold for many systems, but not for most databases. Most systems that you want to make HA are relatively lightweight and interchangeable, with little to zero statefulness, easy to start, easy to stop, don’t care a lot about storage (or at least don’t write a lot of data; that’s usually delegated to the database), and there’s little or no harm done if you ruthlessly behead them. The classic example is a web server or even most application servers. Most of the time these things are all about CPU power and network bandwidth. If I were to compare them to a car, I’d say they are like matchbox cars: there are many of them, and they are cheap and easy to replace.

Mining TruckDatabases are different. With or without replication, you’re looking at a system that is complex, stateful, heavyweight, and cares a lot about storage. It runs on bigger hardware with fast disks and a lot of memory. It’s usually disk-bound, and it does a lot of writes. It’s hard to start — it takes a long time to warm up and really get ready to serve production workloads (many minutes, hours, or even days). It tends to run with a lot of data in memory in a dirty state, so shutdown is slow, because a clean shutdown requires flushing a bunch of data to disk. If you yank its power plug or kill-dash-nine it, it’ll have to perform recovery on startup, which slows the startup process even more. If I were to compare a database server to a car, I wouldn’t even use a car as the analogy: I’d use one of those big-ass mining trucks. If your mining truck breaks down, you don’t just toss it in the trash and pull another off the shelf.

The problem with a lot of HA solutions is that they want to deal with inconsistencies or irregularities by killing the resource and replacing it in another location. This works fine with web servers, but not with database servers. Doing that will cause serious pain and downtime, defeating the point of HA. And when you add replication into the mix, it gets even worse. A system that wants to manage replication needs to deal with very complex conditions. A lot of replication failures are delicate matters that require skilled human intervention to solve. The HA solution must insulate the application from the misbehaving resource, but leave it running so the human can handle things.

This is not the way most applications are made HA. It’s different with databases, and it’s much harder.

Further Reading:

Written by Xaprb

April 26th, 2010 at 7:53 am

Posted in High Availability,SQL

Tagged with

7 Responses to 'Why high-availability is hard with databases'

Subscribe to comments with RSS

  1. What?!? You mean multi-master circular replication won’t make all the db problems go away?

    Rob Wultsch

    27 Apr 10 at 1:50 am

  2. I’m trying to think of a suitably funny response, but I can’t find one that someone won’t take seriously and use as their basis for architecting a system.

    Xaprb

    27 Apr 10 at 6:18 am

  3. amen. i’m fresh out of HA pixie dust.

    sarah novotny

    27 Apr 10 at 5:33 pm

  4. In your web server example, you are create a HA solution out of essentially non-HA hardware through the application. If one of your servers goes toes up, the application works around it. In the database example, that’s really not an option. As you’ve described, if a server fails, you’ve got problems. The obvious solution is to use highly reliable, fault tolerant hardware to ensure that the server DOESN’T fail. That’s the approach that we take at Stratus Technologies. Instead of worrying about recovering from a failure we put our effort into avoiding the failure in the first place (our servers have 99.999% uptime)

    GCM

    28 Apr 10 at 1:41 pm

  5. That doesn’t scale very far and gets extremely expensive (in the general case; in your specific case it sounds like you have crafted a great solution). A lot of Percona’s customers are also hosted on hardware that simply can’t be made reliable.

    Related link I just saw, which has premises I basically agree with: http://highscalability.com/blog/2010/4/28/elasticity-for-the-enterprise-ensuring-continuous-high-avail.html

    Xaprb

    28 Apr 10 at 3:10 pm

  6. > A lot of Percona’s customers are also hosted on hardware that simply can’t be made reliable.

    Well..the servers we sell are Intel based running Windows, Linux or VmWare. What sort of HW are you using? As for the cost…compared to what? Downtime costs a bunch. For that matter replicated servers and the related infrastructure isn’t exactly cheap either

    GCM

    28 Apr 10 at 4:37 pm

  7. As one example, a lot of people are hosted in various clouds or high-volume resellers where there simply aren’t any guarantees.

    Cost… it all depends on your point of view. I have a blog post drafted on the illusion that “the cloud is low cost.” Many businesses subsist on illusions. It is sometimes easier to create a convincing illusion, than a convincing real thing.

    Xaprb

    28 Apr 10 at 4:46 pm

Leave a Reply