Xaprb

Stay curious!

MySQL disaster recovery by promoting a slave

with 5 comments

I was just talking to someone who backs up their MySQL servers once a day with mysqldump, and I said in a catastrophe, you’re going to have to reload from a backup; that’s some amount of downtime, plus up to a day of lost data.

And they said “We can just promote a slave, we’ve done it before. It works fine.”

Granted, in some/many cases, this is fine. There are all sorts of caveats — for example, you either know that your slave has the same data as the master or you don’t care. But it’s fine for some things.

So then I said “what about DROP TABLE?”

And there was a pause. I assume they were realizing that the chance of accidental or malicious destruction of data is much higher than the chance of multiple servers dying at once. This is why slave != backup.

How about you?

Granted, you can use a delayed slave to protect against this particular scenario. But you still need “real” backups, and you still have to think about the worst case — restoring that backup.

Written by Baron Schwartz

January 20th, 2009 at 7:33 pm

5 Responses to 'MySQL disaster recovery by promoting a slave'

Subscribe to comments with RSS

  1. Yep, right on about a slave not being your backup. It is one layer of many layers one should have. What we have includes:

    1) Local slaves in the data center
    2) Remote slaves in a remote data center (offsite)
    3) Nightly pause a slave, make snapshot and tgz each database/table to its individual tgz. This allows us to quickly pull back a dropped or corrupted table.
    4) Make tape backup of the tgz files.
    5) Rotate the backups.

    We have thought of using a delayed slave as well just so we have a more recent copy of a table that was dropped or corrupted.

    One last thing. We assume our backups are not good, so we test quite often to ensure they work ;-). Believe it or not this has uncovered some gaps that we just didn’t realize – so test your backups regularly.

    Good topic,

    Dale

    Dale

    20 Jan 09 at 9:12 pm

  2. Is there a general solution that everyone uses for rapidly-available backup restoration? We have a separate DB that we’ve duplicated our data operations on (not my idea, but we think it works), which wouldn’t work in a lot of cases. Is it just an inevitable bucket of suck?

    Chris D

    20 Jan 09 at 9:13 pm

  3. It really depends on your situation. There are a surprising number of nuances. Setting up backups isn’t simple. For example, Dale’s solution wouldn’t work for some scenarios. As an example, it wouldn’t work for InnoDB tables — you can’t copy individual tables around, even with innodb-file-per-table. So you could back up this way, but you can’t just restore a single table at will. You need some advice from someone who knows your application and your requirements in depth.

    That said, most of the backups I set up are LVM snapshots with a binary copy to a remote destination. There are details here — binlog positions, etc — but that’s the gist of it.

    If you want to sponsor the work, a Percona consultant has a draft of some differential InnoDB backup tools, and there are features we could build into InnoDB/XtraDB to solve the problems with things like importing or exporting files at will. Many of these types of things are surprisingly easy in concept, and just need someone to step up and say “I want it and I’m willing to pay to make it happen.”

    Xaprb

    20 Jan 09 at 9:30 pm

  4. For our database, which:

    1. Isn’t too big (< 5 GB for a compressed dump)
    2. Doesn’t have super-high update rates.

    we do:

    1. mysqldump of a (remote) slave once every night after stopping the slave replication.
    2. We do nightly compares with the master, to make sure that things haven’t gotten out of sync.
    3. We flush the binary logs every 30 mins, and copy them offsite continuously.
    4. We have a local slave for failover purposes.

    So, if there’s a “normal” failure, we can fail over to the local slave.

    If someone drops a table, or the database gets wiped and the slaves are impacted, we will:
    1. Restore from last night’s mysqldump.
    2. Replay the binlogs up to the last point before the offending transaction (or as far as we have binlogs for).

    We will probably have to eventually move to a model where we shut the slave and do a file-based backup. This will allow for a quicker restore, and then we drop the binlog replay on top of that.

    Jon Gardner

    20 Jan 09 at 10:07 pm

  5. The problem with “disaster recovery” is that really there isn’t one disaster.

    Also, I’ve found that many folks say “we’ll just promote our slave” even though the slave is already doing some read offloads of the master. When we ask “What happens to the read-only queries when the slave is promoted?” the dead silence indicates they have not thought everything through.

    None of these situations take into account “what if the president of the company deletes their ‘friends list’”?

    One thing I recommend some companies do is to have a mysqldump of each table with –skip-extended-insert. Each table has its own file. In that way, they can easily grep for information, like

    grep INSERT.INTO.`friend_list`.(123456 friend_list.sql

    to find the friend list for user 123456 from yesterday.

    This is in addition to a full backup, usually a physical cold backup or physical hot backup with innobackup.pl. We recommend 2 or 3 different backup strategies to deal with different failure scenarios.

    Sheeri Cabral

    22 Jan 09 at 7:57 pm

Leave a Reply