Xaprb

Stay curious!

Failure scenarios and solutions in master-master replication

with 27 comments

I’ve been thinking recently about the failure scenarios of MySQL replication clusters, such as master-master pairs or master-master-with-slaves. There are a few tools that are designed to help manage failover and load balancing in such clusters, by moving virtual IP addresses around. The ones I’m familiar with don’t always do the right thing when an irregularity is detected. I’ve been debating what the best way to do replication clustering with automatic failover really is.

I’d like to hear your thoughts on the following question: what types of scenarios require what kind of response from such a tool?

I can think of a number of failures. Let me give just a few simple examples in a master-master pair:

Problem: Query overload on the writable master makes mysqld unresponsive
Do nothing. Moving the queries to another server will cause cascading failures.
Problem: The writable master is completely unreachable
Fence the writable master and promote the standby master.
Problem: The writable master is reachable but unresponsive due to overload-induced swapping
Do nothing. Moving the load to another server will cause cascading failures.

I don’t want to bias the jury, so I’ll stop there and ask you to contribute your failure scenarios and what you think the correct action should be.

Written by Xaprb

August 30th, 2009 at 3:08 pm

Posted in SQL

Tagged with , , , , , ,

27 Responses to 'Failure scenarios and solutions in master-master replication'

Subscribe to comments with RSS

  1. This is slightly off-topic but we had the same problem with Tungsten. It’s really a couple of meta-problems:

    1.) There are a lot of conditions to cover.
    2.) Conditions tend to overlap both in terms of symptoms as well as actions.
    3.) Not everyone wants to take the same action for each case.

    We solved this by adding a production rules engine (JBoss DROOLS) layered on top of monitoring. It allows us to have a broad set of rules that are individually easy to verify and work together to take action (or not) depending on the situation. You can also tweak them quickly to change behavior, for example agressive fencing of suspect servers versus simple notifications to a sysadmin or monitoring system.

    To get back to your question, one interesting class of problems is what we call “time-boxed failures” where we have not received status from a database for some period of time. In this case you often want to failover/fence (master) or fence (slave) after a configurable time period expires. On the other hand you might just want to notify somebody.

    Robert Hodges

    30 Aug 09 at 5:28 pm

  2. For every failure scenario one must have several sub scenarios for either IO or SQL threads lagging. Everything would be much less complicated with synchronous replication. With asynchronous I don’t see good alternatives to having a human actually look at the box and make decisive decisions.

    Consider what happens if the SQL thread is lagging and one fails over to the slave. A connection inserts into a table with an auto increment col and takes values that the SQL thread was expecting to take. There now exists a terrible mess to clean up without data lose.

    Even if we had synchronous replication I imagine in most cases a human should look at the boxes and make a decision about the right course of action. Is the problem one of hardware, networking, application, mysql, etc? Will a failover actually help resolve the problem? Also, a small amount of downtime would allow IO threads to catch up on slaves that were slighly lagging.

    Recently I have become interested in DRBD as it could somewhat replicate synchronous replication. It seems like this approach while attractive on its face is probably not ideal for mysql because one doesn’t really get a hot slaves, myisam corruption is still very problematic, etc.

    I really wish and hope I am wrong in my view of such things. Perhaps I am overly afraid of split brain or automatic failovers switching back and forth many times a second between servers.

    * I have no idea if the semi-synchronous patches from google have a prayer of going into core.

    Rob Wultsch

    30 Aug 09 at 8:34 pm

  3. This was raised earlier on the MMM list, why not discuss there?

    Arjen Lentz

    30 Aug 09 at 9:19 pm

  4. 1) a wider audience.

    2) I didn’t explain it clearly, but I’m looking for a different kind of discussion.

    The meta-question underlying all of this is “is it better to do all this with heartbeat + a load balancer.” Some people tell me it is. They say that anything of MMM’s ilk is reinventing the wheel badly. So the question is, when do you really want to fail over, and when do you want to do load balancing? And are there cases where you want a soft failover instead of a hard failover such as what you’ll get with heartbeat? If there are none, then it would seem that heartbeat + load balancing is the right approach.

    Xaprb

    30 Aug 09 at 9:56 pm

  5. baron: righty. OQ has been assessing that as well.
    For VirtualIP flipping there are indeed more mature solutions.
    There are other aspects in both the monitoring and the failover that are MySQL specific though, and not covered by other tools.
    Using MMM for loadbalancing seems silly (we never use it that way); it could assist, where a slave that’s too far behind gets taken out of a pool, but otherwise no.

    Arjen Lentz

    30 Aug 09 at 10:05 pm

  6. @Rob

    DRBD is an excellent solution for MySQL.

    “if your CIO hates the idea of idle hardware, run two cluster services on two servers, with Primary and Secondary roles defined reversely. Just stick an additional NIC in for another dedicated DRBD link, and use separate physical disks for the data used by your services. In that case you still take a performance hit in case of failover, but only in case of failover”

    http://fghaas.wordpress.com/2007/06/22/performance-tuning-drbd-setups/

    With the availability of 3rd party fulltext engines I can’t think of any good reason except for GIS types that require using MyISAM. Most people should just use InnoDB or some other transactional engine.

    You can use dopd or some other stonith/fencing config to avoid split-brain.

    http://www.drbd.org/users-guide/s-heartbeat-dopd.html

    The Google semi-sync patches have already been reimplemented and were merged into MySQL 6.0 some time ago.

    http://dev.mysql.com/doc/refman/6.0/en/replication-semisync.html

    Enjoy!

    Matthew Montgomery

    31 Aug 09 at 12:02 am

  7. @Matthew

    DRBD is a wonderful stuff for a lot of things, but I think for mysql is like “we got nothing better”. DRBD at first “waste” one of your (probably very strong) server to do nothing. Just wait for faiulers. And of course, DRBD slow down your writes if its working well (you know, drbd device doesn’t return a success for your write request while the other machine doesn’t got it) and here you are, we waste another lots of money or performance by using it. Because the flush method O_DIRECT loose a lot and your expensive raid cards worth “almost nothing” against performance. So for me (after 2years of experience with drbd+mysql) its like “we got nothing better”.

    And yes, you can use dopd to avoid split brain, but screwing up drbd is pretty easy. In the most of the cases when you got a bigger system, you are the primary admin/engineer but maybe you got a team or guys in duty 24/7. Can you be sure, that those guys doesn’t do anything nasty when its stop in the midle of the night when your phone’s battery goes empty? Sh1t happens.

    One more thing why drbd could worth nothing its happened to me. So you have a large database and one of your tables got overloaded or something happens to your server (drbd could do this earlier too) and you have to kill your mysqld or reset the machine. The other node invalidate the primary node and your passive master will take over, yeah sure, but check out that mysql initd file? A lots of people (as I did for first) forget about the timeout and the innodb recovery. Basicly in debian its 15second. So when its timed out, heartbeat will shut it down (its doing recovery man don’t do that! would you say..) and when its see the other node still offline, try it again.. another failed recovery another kill and so on. How to fix it? By hand, and when you fix (mean bring up the HB cluster by hand, set drbd primary, mount, bring up the IP and start mysql) there you go, you lost you HA cluster replaced by a manualy configured one. And to restore it, you have to shut it down, rollback your actions and let it fall back to the previously primary master. Yep, thats downtime. And of course, you won’t care about this (usualy) when your site is down. So I really DON’T think that drbd is good for mysql.

    So everytime when I met with people who _understand_ replication, I recommend MMM.

    Istvan Podor

    31 Aug 09 at 4:42 am

  8. @Baron

    I have to argue with you about the possibilities you got mentioned in your examples.

    As using MMM for a almost a few months now, I think those cases (The writable master is reachable but unresponsive …, Query overload on the writable master…). So I think the cascading is hardly depend on your application.

    I had both cases in the last few years.
    In the third case you mention, query overload. Just a few days ago I had that. And its cascaded, and let me say thank you for maatkit ;). But I still don’t know why because I doesn’t moved the IP the other master so I think it was not because of MMM.

    Anyhow what I want to say is, cascading can mean only duplicated for me. Lets say while a server overloaded, you move the role and some queries written down to the binlog on the prevoius active master, but not replicated to the new master. And your application will execute the queries again after changing your master. But in an MMM managed cluster all your previously slave hosts will remain consistent. (I tested some cases I can imagine with a forkbomb under linux I mentioned on the MMM mailing list). And the worst case was that the replication between the masters had broken and I had to fix by my hand. Skipping some inserts what should be there. And of course MMM is almost like useless without maatkit. :)

    In this case I have to agree with Rob but not about semi-sync replication. As far as I know its only make sure that ONE of the slaves replicated that. This could slow down your application a lot and of course, even if you want, you can’t (or hardly) possible to do something like an alter table on your cluster using mmmm advantages to avoid downtime (never tried, just thinking). Patches like that I think could only work well in that environment what its made for. In this case its google.

    So as I experienced in the last years:
    - The only not enterprise solution (like ndb) I find pretty useful for mysql is MMM.
    - I recognized, I can’t find a good automatic failover solution for mysql. (not even drbd)
    - With mysql I think we can set up failovers only for well known and already experienced possibilities. Like network downtime etc, but not for every cases. Its much better to wake up and fix like spending days hunting your important lines, reading binlogs for hours to track the processes back and cleaning up things or the worst, loose data.

    Actually I think mysql should have a solution to support master-master replication possibilities, because “we” can’t make it work even with the best scripts we can imagine.

    Using MMM and maatkit works for me anyhow. I’m risking, but what I’m risking is getting duplicated errors and I’m risking that I have to fix the issues by my hand. But with MMM we can put less load on the servers like drbd do and minimalize the possibility of downtime.

    I hope it was useful.

    Istvan Podor

    31 Aug 09 at 5:24 am

  9. I guess by now everybody understands what DRBD can and can’t do, and that you can mess up a DRBD-based cluster by mis-configuring either DRBD or your cluster manager is pretty clear too. Sorry Istvan, complaining about a cluster manager and at the same time conceding you “forgot” about InnoDB recovery times is hardly a compelling argument against the cluster manager as such. And there’s no reason for a “manual” recovery in that situation; instead fix the timeouts, do “crm resource cleanup “, and you’re good to go.

    So for a large portion of those setups where people fiddle extensively with multi-master replication, they would probably be better off with just using DRBD with a decent cluster manager, and then investing a bit of effort into tuning and testing that cluster setup properly.

    That being said, however, there may well be situations where maximum uptime is more of a concern than maximum data integrity, where performance is so crucial that one can do without a guarantee against transaction loss, etc. In such cases, even though it solves a different problem than DRBD does, MySQL replication may be a perfectly valid alternative. Still, it does not make sense to build a whole cluster manager around it if there are perfectly usable pieces of software out there (such as Pacemaker, the direct successor of the Heartbeat CRM). Instead, it’s a much better idea to seek integration with Pacemaker, and give users a central point of administration where they get MySQL replication-based HA, and also get DRBD-based HA, Oracle HA, PostgreSQL HA, IP address failover, load-balanced cluster IP addresses etc. etc. for free.

    Florian Haas

    31 Aug 09 at 7:35 am

  10. I’ve tried to learn Pacemaker. I’m fairly technical, and my impression is that you’re not going to be able to build and administer it correctly without a very smart person. Make that “people,” since having a single person who knows that system is a really bad idea if you care about HA. You don’t want the only knowledgeable person to go on vacation.

    Is there some less expensive solution for those who don’t need to hire two $120k/year admins and spend a lot of money and time learning, setting up, testing, administering a Pacemaker cluster?

    Xaprb

    31 Aug 09 at 7:42 am

  11. @Xaprb
    Certainly the barrier for entry was quite high in the early days and all that XML played a big role in scaring people off. That said, I think with Dejan’s ‘crm’ shell in Pacemaker 1.0, we’ve finally got something thats both powerful and easy to use.

    The biggest problem has been documentation, but thats slowly being addressed ( http://clusterlabs.org/wiki/Documentation ). The latest Configuration Explained PDF is quite comprehensive but is intended to be more of a reference than entry-level HOWTO. To address that need, I’ve recently started on a “Clusters from Scratch” series that aims to demonstrate how to use Pacemaker with start-to-finish worked examples.

    So although I’m clearly biased as hell, I’d encourage you to take another look at Pacemaker and if you have concerns, feel free to voice them on the project mailing list.

    Beekhof

    31 Aug 09 at 7:58 am

  12. @Florian
    Thanx you are actualy right. Forgeting about timeout and then when things goes wrong restoring the cluster by hand is a mistake. Each of them. But saving my @ss, all of us have to learn somehow :)

    @everybody:

    So basicly, everybody think a solution like heartbeat || drbd || pecamaker is a good and reliable solution instead of using some multi-master environemnt?

    To answers Baron’s questions after considering the comments I think is something like: We don’t have a solution yet. The situations in your examples I think mean high risk for us who use multi-master.
    If we even considering the risk, choose multi-master, I think we must clear with the requirements and we should plan our application to handle this issues (and keep working on the solution) right?

    Istvan Podor

    31 Aug 09 at 8:16 am

  13. @Baron: “Is there some less expensive solution for those who don’t need to hire two $120k/year admins and spend a lot of money and time learning, setting up, testing, administering a Pacemaker cluster?”

    Funny. I often hear almost exactly the same thing, just replace “a Pacemaker cluster”, with “MySQL replication”. I take both notions to be incorrect, and see them as examples of FUD. Both approaches have their valid usage scenarios, and both require some investment to set up correctly.

    Florian Haas

    31 Aug 09 at 1:31 pm

  14. [...] expect my own view to be fairly well defined, and it is. But make up your own [...]

  15. @Istvan and others
    Please add Tungsten clustering to your list of solutions. We have implemented full clustering based on master/slave–it does everything that multi-master does (failover, provisioning, simple install, routing) but uses master/slave, which is a very robust solution. We are also improving it *very* rapidly, so if you find things that are missing we will fix them ASAP.

    Check us out at http://www.continuent.com. As a previous poster said, I may be biased but you should definitely look at Tungsten. Incidentally the full clustering is commercial but we are going to open up basic HA solutions to open source during September.

    Robert Hodges

    31 Aug 09 at 1:51 pm

  16. Just to try (lost cause? :-) to get this back on topic, let me restate the original question a little differently: are there, in fact, any cases in which you want to fail over, and you can tolerate wishy-washy failover that might mess up your data because it doesn’t fence, etc etc etc?

    In other words, if you want failover, do you want potentially violent failover a la Pacemaker/heartbeat? Or are there cases where you want a best-effort, without Shooting The Other Node In The Head? If so, what good does it do you to “try” to fail over?

    Perhaps that will crystallize the choices a bit.

    Xaprb

    31 Aug 09 at 2:02 pm

  17. PS, I have my own answers to the above.

    Xaprb

    31 Aug 09 at 2:03 pm

  18. @Xaprb

    I think there is no global answer for your question. For example, let me share my opinion:

    I can tolerate the “wishy-washy” failover :) It’s because we at ustream have limited resources and every single machine matter. And the data we are working with can tolerate “some outages” like a longer restore procedure after a master failure.

    As its an “if so”, my choice was MMM. MMM is just like that. You risk when you let mmm to decide to fail over to another node or not, but you know, you can fix it and the advantages what it comes with worth more *in our environment*.

    I hope it is an answer because :)

    Istvan Podor

    31 Aug 09 at 2:37 pm

  19. In my case availability is more important than data consistency. I can fix my data later on, but I can’t turn back time and prevent the downtime.

    If my active master is overloaded my application is not available. If I switch to another master and it gets overloaded, nothing changes. My application will be unavailable. Nothing won, nothing lost. :) So MMM does not help me here, but it helps in other cases.

    IMHO application overload is a problem were no HA solution can help. Buy more hardware and/or optimize your app.

    OT: I recently had an idea on how to make MMM balance the load better between multiple slaves, will blog about this in the next time. (it’s stupid and simple, but should work).

    Pascal Hofmann

    31 Aug 09 at 3:56 pm

  20. @Xaprb
    “do you want potentially violent failover a la Pacemaker/heartbeat”

    What you talkin’ ’bout Willis?
    Pacemaker will only resort to “violent” failover under certain conditions, the most important pre-condition being that stonith is actually enabled. Don’t care about data consistency? Fine, just turn off stonith.

    But anyway the answer to your question is “yes”.
    “Wishy-washy” failover has, and will remain, good enough for some people in some scenarios – even though I’d never use it myself.
    Its just a matter of how much risk the admin/company is prepared to tolerate and how badly they’re impacted by data corruption.

    Beekhof

    31 Aug 09 at 4:50 pm

  21. I actually covered this question using a custom solution that involves load-balancing the DB connections with a software load-balancer (HAProxy).

    It doesn’t perform any sort of consistency checks, but it works pretty much the same as MMM in regards to deciding which Master and Slaves should be active (for writes/reads).

    You can view the entire article w/ comments here: http://www.alexwilliams.ca/2009/08/10/using-haproxy-for-mysql-failover-and-redundancy/

    Alex Williams

    5 Sep 09 at 7:32 pm

  22. Though not directly related to the topic itself, I’ve been wondering whether the MMM monitor server presents a single-point of failure (thinking of the worst case of where first the monitor fails the followed by the actual monitored servers themselves). I’m curious as to whether it’s possible to have multiple instances of the monitor running (on different servers) in case one fails (outside of using heartbeat et al).

    imran

    7 Sep 09 at 12:50 am

  23. @Imran

    you can use heartbeat :)
    But monitor is really not a single-point of failure. yes its stand-alone, but MMM agents will keep the last state if something goes wrong. Regarding to the topic, this is another thing what, lets say, I can afford. If the mmmd_mon failed, and the active master failed, something veery nasty thing must happened. And in that case (based on my experiences) our smallest problem was the mysql master’s auto fail-over status :)

    But, there is another way, what if mmm_mon and the current active master is on the same uplink and something network outages comes in? Now, this is more like a tip, use different network and power supply for each of your mysql master and mmm_mon :)

    Istvan Podor

    7 Sep 09 at 5:01 am

  24. @Istvan

    Thanks for the response and for the tips! I’d need to dig the source a little bit but I’m also wondering what the consequence of running two mmm_mon instances would be on 2 separate servers with identical configurations. Would this cause a conflict with respect to the agents/nodes being monitored or can that be considered as a potential workaround to the mmm_mon failing along with the active master?

    Thanks!

    imran

    7 Sep 09 at 5:34 am

  25. @imran

    I think this isn’t the best place to discuss this. Let me invite you to #mmm on freenode and to our mailing list what you can find here: http://groups.google.com/group/mmm-devel

    Istvan Podor

    7 Sep 09 at 6:24 am

  26. [...] Future of Database Clustering Baron Schwartz started a good discussion about MMM use cases that quickly veered into an argument about clustering in general. As Florian Haas put it on his [...]

Leave a Reply