Xaprb

Stay curious!

How to scale writes with master-master replication in MySQL

with 40 comments

This post is SEO bait for people trying to scale MySQL’s write capacity by writing to both servers in master-master replication. The short answer: you can’t do it. It’s impossible.

I keep hearing this line of reasoning: “if I make a MySQL replication ‘cluster’ and move half the writes to machine A and half of them to machine B, I can increase my overall write capacity.” It’s a fallacy. All writes are repeated on both machines: the writes you do on machine A are repeated via replication on machine B, and vice versa. You don’t shield either machine from any of the load.

In addition, doing this introduces a very dangerous side effect: in case of a problem, neither machine has the authoritative data. Neither machine’s data can be trusted, but neither machine’s data can be discarded either. This is a very difficult situation to recover from. Save yourself grief, work, and money. Never write to both masters.

Written by Xaprb

August 6th, 2008 at 8:17 am

Posted in SQL

Tagged with , , ,

40 Responses to 'How to scale writes with master-master replication in MySQL'

Subscribe to comments with RSS or TrackBack to 'How to scale writes with master-master replication in MySQL'.

  1. Yes. Data inconsistent data is a major headache if you going down the path of master-master replication or “multi master” as some put it. And there is no performance gain either by distributing the write because both servers have to do the write operations regardless.

    There is a use though for master-master replication. That is as a hot backup solution. Here’s the example.
    Both servers run the database in master mode. But the application keeps track of which one is the current master and only send write request to that particular server. In case of the failure on the master part, the application, either automatically or with admin intervention, failover to the other server as the “current” master. When the old master server is back online, it now becomes the slave as the second master takes over its job as the current master server until something happens that turns the switch back to the first server.

    It does requires that the application be somewhat aware of the “curent” master server. But you know which one have the authoritative data based on which server is now the master.

    paan

    6 Aug 08 at 8:31 am

  2. I’ve never understood why people suggest MM replication will increase performance for writes on MySQL. I suppose partitioning your data on separate physical disks will help to some degree but surely sharding or connecting via MySQL Proxy is the only realistic way forward?

    More complicated I know but there seems no easy answer.

    James

    6 Aug 08 at 9:02 am

  3. Why even play with Master-Master? I’ve started using a combination of Heartbeat, DRBD and Mysql for the Master, and slaves working off the virtual ip used by heartbeat. The instruction on setup are here.

    Niall

    6 Aug 08 at 10:09 am

  4. paan, Baron never implied that master-master replication has no uses, his only point was that it doesn’t scale writes.

  5. paan, master-master for high availability is extremely useful. But I don’t like the application keeping track of which is the active master. There needs to be global agreement about whether the server is active or passive. That is what MMM is for (http://code.google.com/p/mysql-master-master/).

    James, you are right. There are special cases where NDB Cluster is the right solution, but in general sharding is the solution.

    Niall, for some uses DRBD is better; for some master-master is better. It depends on the requirements. DRBD adds latency to every write. How many writes do you have? How fast do they have to be? I am currently setting up a new server for a client with high I/O load. It’s so high that we created a special patched version of tar, because GNU tar causes unacceptable extra I/O load during backups. DRBD’s added latency would not be acceptable for this client — not even close. But master/master is a fine solution.

    Xaprb

    6 Aug 08 at 10:55 am

  6. Xaprb:

    Fair point, however where I deployed DRBD, High Availability was a higher priority than latency, and it’s doing a brilliant job so far. Saying that, even when abusing the system, the latency is negligible. Horses for courses and all that.

    Niall.

    Niall

    6 Aug 08 at 11:07 am

  7. Hi Baron,

    Your patched version of tar sounds interesting. Would you consider releasing this to the general public?

    Mike Montano

    6 Aug 08 at 11:17 am

  8. Sure. If you can’t afford to lose a transaction, then DRBD gives a stronger guarantee than master-master. But your original comment said why even play with master-master, and I felt obligated to point out that it is damn useful sometimes :-) For example, you can use it to run ALTER TABLE on a huge table on the passive server, switch passive/active and repeat. Maybe the ALTER TABLE takes 5 hours — but you can still avoid downtime with master-master this way.

    Xaprb

    6 Aug 08 at 11:22 am

  9. > DRBD adds latency to every write.

    DRBD (as of 8.0 iirc) has multiple levels of synchronization, and full async/semi-sync (Protocols A and B) are rather fast; the latter assuming a very low latency interconnect.

    http://www.drbd.org/users-guide/s-replication-protocols.html

    It is extremely useful in environments where mysql need to be HA’ed, or if virtualization is being used, and minimal data loss is acceptable but consistency of the failover server is necessary.

    Having said that, in a dedicated mysql setup I’d almost definitely stick with MMM. Strike the almost if extra slaves might show up to the party :)

    daniel

    6 Aug 08 at 11:59 am

  10. looks like I accidentally tossed in some markup

    > It is extremely useful in environments where mysql need to be HA’ed

    should be:

    > It is extremely useful in environments where mysql plus(anything else) need to be HA’ed

    e.g. a server running cpanel or plesk needs to be able to fail over for disaster recovery

    daniel

    6 Aug 08 at 12:04 pm

  11. Daniel, sure; we measured DRBD in the real world: http://www.mysqlperformanceblog.com/2008/06/02/how-much-overhead-drdb-could-cause/

    The point is that “fast” is relative. Relative to the speed of a good-quality RAID card with a battery-backed write cache, DRBD’s overhead is noticeable. And our scenario really is real-world, with InnoDB and a transactional workload — you may be able to escape the costs of some of the fsync operations if you’re using MyISAM, which doesn’t even try to actually write your data to disk; but that’s not HA. “I have a copy of my disk” doesn’t help if the data was only in the memory of the machine that just failed :)

    DRBD is really kick-ass technology, don’t get me wrong, I just don’t think it’s very useful to compare with master-master replication. It’s not really even apples and oranges; there are so many big, big differences. For example, the DRBD standby machine isn’t warmed up. Let’s suppose DRBD’s latency wasn’t an issue for the client I mentioned before, and I was using it there. In a failover, the standby would take some time to fsck (not sure; I have not timed but it’s a very large disk), about 30-45 minutes to run InnoDB recovery, and another 5-10 hours to warm up the caches fully. In a master-master replication setup, the standby machine is ready to go.

    This is why I actually like using DRBD for just the binary logs, and not all the data. If you have a slave, you can probably count on it being approximately close to the master. It maybe is missing the last few transactions. But those are contained in the binary logs, which are on your DRBD device; fine. Just replay the last few from there and you got all your data back, and a warmed-up server to boot.

    Xaprb

    6 Aug 08 at 12:38 pm

  12. Oh, actually I meant to say: I think DRBD is for situations where NO data loss is acceptable. If minimal data loss is acceptable, plain old replication gets you there with no further ado.

    Xaprb

    6 Aug 08 at 12:39 pm

  13. Master-Master Replication is for HA only, and watch out for split brain when doing active writes on both sides.

    To scale writes, partition your data, i.e. shard it. If you need to see an example of what sharding is and how to use it.

    Then on top of that scaling writes requires good I/O, memory and few range updates / deletes.

    Cluster is really good for Hash (big-O(1)) lookups, this makes sense to use if your using a lookup method for what an object shard is.

  14. I’m trying to decide if I am splitting hairs, but how exactly is a system “master-master” if you can only write to one of the servers at a time? What you guys keep referring to sounds an awful lot like master-slave with node promotion, which is certainly simpler to implement than a real master-master, and should generally have less overhead. Are you guys confused, or am I? :-)

    xzilla

    7 Aug 08 at 11:11 am

  15. xzilla, you’re right: if you take master/slave and on the master run CHANGE MASTER TO [slave], then you have master/master. But there’s one crucial difference: both servers are configured symmetrically. Suddenly, just because of the symmetry, all sorts of neat things are possible.

    Xaprb

    7 Aug 08 at 10:21 pm

  16. Oh — I would also say, master/master is no more complex than master/slave, and has no more overhead.

    Xaprb

    7 Aug 08 at 10:22 pm

  17. Hey, Baron… Excellent discussion. You mentioned creating a special patched version of GNU tar, because the regular version caused unacceptable I/O load. Completely unrelated to MySQL, we have a couple of Linux servers that also have some performance impact when they are being backed up, and I think the backup script uses some form of tar. What about tar did you patch, and to what performance effect? Is this a public patch, or one Percona developed in-house? I’m very curious.

    Thanks!

    Jeff Saxe

    8 Aug 08 at 12:05 am

  18. I didn’t patch it, so I can’t brag about it (I don’t even know where the patch is…) but I have sent an email on our internal list to see if I can get someone to step forward. Seems it would be pretty useful.

    Xaprb

    8 Aug 08 at 12:56 am

  19. Hi Baron,

    Nice article as usual. There is a corner case in which writes scale very well if you have multiple masters. If you are replicating rows *and* the write request is an update or delete that performs a large read in order to find rows on which to operate, then you get a benefit from spreading them over multiple servers.

    Certification-based clustering approaches like Postgres-R apply this result in a general way. These work by allowing transactions to execute locally and then replicating changed tuples very efficiently to other nodes. In this case you take away most work other than applying changes to the log, so you can actually see write scaling as more nodes are added to the cluster. Among others, there was a well-known paper by Bettina Kemme that demonstrated the effect. (See http://www.cs.mcgill.ca/~kemme/papers/vldb00.html.) Unfortunately, there are still no systems that really work with this approach yet–they are hard to build and require a lot of expertise in group communication.

    Robert Hodges

    8 Aug 08 at 1:37 am

  20. Robert, that’s a good point, both about the effect (I wrote several pages about this general approach to reducing the workload on slaves in High Performance MySQL, Second Edition beginning on page 399) and why it isn’t out in the wild :-)

    Xaprb

    8 Aug 08 at 8:12 am

  21. Watch mysqlperformanceblog.com for more about the patch to tar, as well as a crazy inventive trick to I/O limiting even an unpatched tar. (hint: it involves signals.)

    Xaprb

    8 Aug 08 at 10:16 am

  22. @xzilla – this confused me too and it still scares me. By master-master they mean that both nodes can accept writes and replicate to the peer master. This can produce inconsistent data. You prevent that by trusting your apps to do the right thing (guaranteeing inconsistent data) or by putting one of the servers in readonly mode.

    Using readonly mode can be tricky. First, any account with SUPER can still change rows on a server in readonly mode. Second, you want failover to be fast, which means you also want to avoid a restart on failover, which means that you want to use ’set global read_only=1′ or ’set global read_only=0′ to change this value. Alas, if the master in readonly mode restarts (things happen), then the value of read_only is reset to the default, or whatever is set in my.cnf. So you probably want to run with read_only=1 in my.cnf and have whatever software promotes a node to be the master responsible for running ’set global read_only=1′.

    Mark Callaghan

    8 Aug 08 at 12:14 pm

  23. Mark nailed it. Our MMM toolkit assumes that you have set read_only=1 in your my.cnf and that you have not promiscuously granted SUPER.

    Xaprb

    8 Aug 08 at 2:06 pm

  24. There is one scenario where writing to both masters makes sense: when the write load is just a little too much for the (singlethreaded) slave.

    With massive concurrent writes the work is done with N threads on the master, but only 1 thread on the slave. Evenly distributed to both instances we keep N/2 1 threads busy on both machines.

    Yes, this scenario is obscure. Just wanted to mention it.

    XL

    8 Aug 08 at 6:14 pm

  25. Ah HAH! *Baron pounces* I was waiting for someone to suggest this.

    It’s also a fallacy. Let’s say your server can do 1000 writes/sec in 10 threads (we’ll pretend that this is massive concurrent writes, because it makes the math easier). So each thread is doing 100 writes/sec.

    Now say your slave can do 120 writes/sec. Because there is only one thread and it isn’t battling 9 others, it can do more than 100 writes/sec.

    So in the model where you write to only one master, your whole system is limited to 120 writes/sec. If you go over this, the slave falls behind.

    Now you decide to share the load, as it were. Put half the writers on one node and half on the other. Look at node A: all the writes on it get serialized and replayed on node B, at a maximum of 120 writes per second. Ditto for the vice-versa. So your total write capacity is… assuming the slave thread can somehow manage to keep up its pace of 120 writes/sec while battling 5 other writers and all is perfect in every way… 240 writes/sec.

    What’s the capacity of a single server again?

    1000 writes/sec.

    You shared the multi-threaded penalty evenly among the two servers. But you didn’t scale writes. In MySQL replication, the single-threadedness *always* means replication reduces your overall write capacity below the capacity of a single server. Stick a slave onto a server, and you drop its write capacity. And that is the secret: replication can help scale reads, but not writes. For that, you need sharding.

    Xaprb

    8 Aug 08 at 6:35 pm

  26. Baron,

    you must not compare the 240 distributed writes with the 1000 for single server. It’s a completely different setup.

    In many cases there is replication at work. And sometimes the slave is lagging. People don’t like lagging slaves because they are of no use at all. In this situation distributing writes might push the system over the edge. Not for long, that’s sure. Didn’t I say the scenario is obscure?

    XL

    8 Aug 08 at 7:16 pm

  27. I have to disagree pretty vigorously. This whole blog post is about the fallacy of using replication to scale writes, so I think it is appropriate to point out that replication of any kind decreases write capacity. The fact that a single server is different from master-master is exactly the point. Your suggestion doesn’t scale writes — it just helps distribute the reduction in write capacity so all the links are just as weak as the others instead of one being weaker than the others. There are many ways to reduce the performance penalty of replication, most of them safer than writing to both nodes.

    Xaprb

    8 Aug 08 at 10:12 pm

  28. Thanks Baron. I’ll keep an eye out for it.

    Mike Montano

    10 Aug 08 at 2:52 pm

  29. In short, replication doesn’t magically increase the capacity for writes on the system. If your system tops out at, say, 2,000 writes per second, replication isn’t going to make your system able to do 3,000 writes per second. (neither will scaling strategies like partitioning; they just remove the need for 2-3,000 writes per second by moving the writes elsewhere — the only thing that will is to change the nature of the write queries, ie making them faster to do with a query or schema change or perhaps server tuning).

    So, master/master replication doesn’t increase the throughput of writes. However, simply because replication “spaces out” writes, it may actually help during spikes — you just need a tolerance to be somewhat behind. If you have servers that can do max 2,000 writes per second each, with 1 server you can do 2,000 writes per second. With 2 servers using master/master replication, you can actually direct 1500 writes per second to each server — 3,000 writes will happen in that one second, but they won’t *replicate* in that one second.

    If what the organization needs is to handle a 2-3 minute load spike, replication may be the key to spreading out the load. However, master/master replication doesn’t scale (circular replication is even more dangerous than master/master replication), so it’s not really an architecture design decision, more like an “uh oh, what now?!” solution.

    I do agree that replication is a strategy for HA, not for scaling. But I also feel that it is an acceptable short-to-medium term solution for a growing company that needs to spread load a bit (not reduce the load, just spread it).

    There’s a difference between the *database* being able to handle 3,000 concurrent writes, and the *application* being able to handle 3,000 concurrent writes. No architectural solution can make the database able to handle more writes; only query/schema optimizations and server tuning will do that. However, many architectural solutions can make the *application* able to handle more writes.

    Sheeri K. Cabral

    11 Aug 08 at 9:27 pm

  30. Sheeri,

    Your master-master pair of machines capable of 2000 writes per second *cannot* handle 3,000 queries per second. It cannot even handle 2,000 queries per second, because 1/2 the workload on each server is serialized, not run in parallel, and because the total write load on each server is equal to the total of the writes on the pair — the writes are not “spread” or “shared,” they are duplicated, plain and simple. I would love for someone to attempt to do this and show it happening. I maintain that it cannot be done, period. Have you actually done this in real life?

    Xaprb

    12 Aug 08 at 1:13 am

  31. Baron, you are correct, but you are not disagreeing with me. The *databases* cannot handle more than 2,000 writes per second. But the *application* can.

    When I talked about “spreading” I referred to time — I specifically said that “you just need a tolerance to be somewhat behind” — as in, it will lag behind.

    I *started* by saying that replication cannot make a database server have more than 2,000 writes per second. Why would I end by saying I could?

    The master-master pair *can* handle “3,000 concurrent writes” — that’s exactly how I stated it — after 1 second, there will be lag, but it can handle the concurrent writes. This is why it can smooth out spikes — if an organization has 3,000 writes per second at peak time and their db server can only do 2,000 writes per second, splitting those writes across 2 masters will allow the application to work — *assuming you can tolerate the replication lag*.

    Does that make more sense?

    (again, it’s not a long-term solution, but certainly a short-to-medium term one).

    Sheeri K. Cabral

    12 Aug 08 at 3:33 am

  32. OK, I see your point about spreading the writes “over time” — full speed ahead and damn the slave lag, they’ll catch up later. I’m not sure about the math. I suppose it matters how many threads you have writing on each master; I would have to test it but I think if the pair of servers has only 2 connections (one per server) then 1500 writes/sec would be too much, really – you could only get 2000 writes/sec on each server, so all things being equal that would be 1000 from the “real” connection and 1000 from the slave thread. But this also assumes that the slave thread will effectively yield to the client connections to permit the client connections to write more than it — I can see this in theory if there are more client connections than slave connections, but I’d still want to test it in real life to see how much you could actually force into the servers to get this effect for a short while. It shouldn’t be hard to set up and test.

    Xaprb

    12 Aug 08 at 8:20 am

  33. Who better at tests than Percona? ;)

  34. Indeed, indeed. I think I have time for it between 3 and 4AM on August 18th. 2009. ;-)

    Xaprb

    12 Aug 08 at 11:21 am

  35. Duh… you just have to have four masters!

    :-P

    Kevin Burton

    20 Aug 08 at 8:58 pm

  36. The I/O tricks for slow-tar and other things is here: http://www.mysqlperformanceblog.com/2008/11/06/living-with-backups/

    Xaprb

    6 Nov 08 at 8:53 pm

  37. I am very new to the mysql replication and I apologize if the question is stupid.

    Is it possible to have the following topology?: one active master with two other passive masters.

    Dimitri

    4 Dec 08 at 9:09 pm

  38. Dmitri, it’s not stupid, but I’m not sure I understand, anyway. You can have any number of slaves, but I don’t recommend more than 2 masters.

    Xaprb

    5 Dec 08 at 12:39 pm

  39. @Xaprb: thank you for the reply. In this case I think I will be interested in implementing two masters with one slave. My idea is to have one active master for all writes, a passive master for all reads and another slave for backups and cron jobs that involve big table scans and put additional load to mysql. I found brief description of this scheme at:

    http://code.google.com/p/mysql-master-master/wiki/TypicalUseCases

    However one thing that I am concerned of is that the description states: “IT could not be implemented without using some third party software or MySQL Cluster solutions (with well known set of limitations on database size)”.

    I don’t want to use Mysql Cluster. What kind of 3-rd party software is needed for this scheme?

    Dmitri

    5 Dec 08 at 2:33 pm

  40. Hi Dmitri,

    I’m sorry, but that is a very long conversation. It is more advice than I want to give on my blog, since I am a consultant and give this advice via percona.com for a living :) You can also learn a lot about this via the MySQL manual, or by reading our book, but I don’t want to repeat it here.

    Xaprb

    5 Dec 08 at 3:18 pm

Leave a Reply