Xaprb

Stay curious!

What’s wrong with MMM?

with 29 comments

I am not a fan of the MMM tool for managing MySQL replication. This is a topic of vigorous debate among different people, and even within Percona not everyone feels the same way, which is why I’m posting it here instead of on an official Percona blog. There is room for legitimate differences of opinion, and my opinion is just my opinion. Nonetheless, I think it’s important to share, because a lot of people think of MMM as a high availability tool, and that’s not a decision to take lightly. At some point I just have to step off the treadmill and write a blog post to create awareness of what I see as a really bad situation that needs to be stopped.

I like software that is well documented and formally tested. A lot of software is usable even if it isn’t created by perfectionists. But there are two major things in the MySQL world for which I think we can all agree we need strong guarantees of correctness. One is backups. The other is High Availability (HA) tools. And this leads me to my position on MMM.

MMM is 1) fundamentally broken and unsuitable for use as a HA tool, and 2) absolutely cannot be fixed. I’ll take that in two parts.

First, it’s broken and untrustworthy. I could go into the technical details of why MMM is broken at the architectural and implementation level. I could talk about the way that it uses a distributed set of agents, which do not have a reliable communications channel, all maintain their own state which is not communicated or agreed upon across nodes, and don’t even share configuration. I could talk about the fact that MMM itself can’t be made HA or redundant — you can only have a single instance of it.

I could talk about lots of things, but you can argue with every one of those assertions. You can’t argue with the list of failures I’ve personally seen. It fails over with no reason when nothing is wrong — and botches it up, causing the entire replication cluster to get out of sync and break. It tries to fail over when something actually is wrong with the cluster, but it does things out of order and with no synchronization amongst the agents, leading to chaos. It can’t handle anything unexpected, such as the ordinary kinds of network, disk, etc failures you’d expect in systems that have something wrong (which is exactly when an HA tool is supposed to function). It doesn’t protect itself against the human doing something wrong, such as mixing up the agent configuration on different hosts. There are many bizarre ways MMM can fail, but these are all theoretical — until you witness them. I’ve witnessed them, and new customer cases on MMM failures are filed on a regular basis. Here’s one:

In the recent past, we have had a couple of bad experiences with mmm-monitor tool which broke replication and brought our website down for a few hours.

And another:

We have recently started testing MMM for MySQL and when using it under write load we have been experiencing ‘Duplicate entry’ (1062) errors.

In short, MMM causes more downtime than it prevents. It’s a Low-Availability tool, not a High-Availability tool. It only takes one really good serious system-wide mess to take you down for a couple of days, working 24×7 trying to scrape your data off the walls and put it back into the server. MMM brings new meaning to the term “cluster-f__k”.

Now, why isn’t it possible to fix it? One simple reason: MMM is completely untested and untestable. Change one line of code in Agent.pm’s master control flow and tell me that you’re confident that you know what it has just done to the whole system? You can’t do it. If you don’t have tests, you can’t change the code with confidence, period. And as I said before, HA and backup tools are where we need a zero-tolerance policy. “I think this fixed the bug” or “I think it’s safe to change this code” are not acceptable. I have seen a lot of bug fixes that cause new and interesting bugs. I appreciate the variety — life is boring if all we’re doing is seeing the same old bugs — but this isn’t what we need in an HA tool.

In order to fix MMM, it has to be completely rewritten from scratch. Among other things, decisions and actions need to be completely separated. Then the decisions can be verified with a test suite, and the actions can be verified independently. But if you do that, you don’t have MMM anymore, you have a new tool. Therefore MMM can’t be fixed, it can only be thrown out and reimplemented.

Note that I’m not claiming that MMM was developed by bad programmers or that it is bad quality. I am only claiming that a) it demonstrably doesn’t work correctly, and b) it can’t be fixed without a rigorous test suite, which can’t be added to it without a complete reimplementation.

I will go further and claim that the architecture of MMM is fundamentally unreliable, and it isn’t a good idea to reimplement it (it’s already been done once!). This we could argue for a long time, but I know of so many better architectures that I wouldn’t entertain the notion of building a new tool with the same architecture.

I have seen a number of people reach the same conclusions and then implement new systems in the same general vein as MMM, with a limited set of functionality to avoid some of the problems. For instance, Flipper is a single tool with no agents, so that’s an improvement. Unfortunately, these tools all suffer from the same problem: they aren’t formally tested. I simply can’t accept that in an HA tool.

If I’m such a perfectionist, why haven’t I built a tool that solves this problem? I have a limited amount of time, and at some point, I don’t do things for free. I’ve had multiple conversations that go like this: “My last replication downtime incident cost me $75k. I can’t let that happen again. What will it cost to build a correct solution? No way — I can’t pay $20k for a high availability tool that really works.”

There is active development on something related that I can’t talk much about now. But if you want, you can come to Percona Live and be among the first to find out.

Written by Xaprb

May 4th, 2011 at 3:43 pm

29 Responses to 'What’s wrong with MMM?'

Subscribe to comments with RSS

  1. Even though I love that piece of software, I agree with you :)

    Istvan

    4 May 11 at 3:58 pm

  2. Hi Baron,

    Did you attend the Google talk about automatic failover at MySQL Con?

    Do you think that floater ip based HA solutions are sane?

    Robert Wultsch

    4 May 11 at 4:06 pm

  3. I’m the original author of MMM and I completely agree with you. Every time I try to add HA to my clusters I remember MMM and want to stab myself because I simply could not trust my data to the tool and there is nothing else available on the market to do the job reliably.

    Alexey Kovyrin

    4 May 11 at 4:11 pm

  4. What do you think about the failover solutions from:

    1) SeveralNines
    http://johanandersson.blogspot.com/2011/04/setting-up-multi-master-and-read-slaves.html

    2) Yoshinori Matsunobu
    http://www.slideshare.net/matsunobu/automated-master-failover

    Are they any better than MMM?

    Andy

    4 May 11 at 4:55 pm

  5. Tungsten Enterprise solves the MySQL failover problem with group communication and automated rules. It’s currently available only as commercial software. Our experience has been that if you solve this problem you need to take a theoretically sound distributed systems approach. The problem is not easy to solve well.

    @Robert Wultsch I have had some experiences with floating IP addresses that were very unpleasant. Partly as a result I did a full write up on some of the problems a while back. (http://scale-out-blog.blogspot.com/2011/01/virtual-ip-addresses-and-their.html)

    Robert Hodges

    4 May 11 at 5:49 pm

  6. You can learn more about Tungsten Enterprise at http://www.continuent.com/solutions/tungsten-enterprise.

    Commercial often equals a four letter curse word within open source realm, but it also equals ‘tested and supported’, both of which are very important for HA solutions, as Baron rightly so points out.

    Commercial does not necessarily mean high license cost. You can get started just couple hundred dollars per month. And if you sponsor Tungsten development or engage consulting project with Continuent, Tungsten software could be free.

    Eero Teerikorpi

    4 May 11 at 6:10 pm

  7. @Robert Hodges:
    Thank you for the link.

    I have had (different) unpleasant experiences with floating IP’s. A person favorite was an incorrectly configured network where Gratuitous ARP did not work at all.

    My conclusion is that DSN’s must be abstracted in some manner that is easy to reconfigure.

    Robert Wultsch

    4 May 11 at 6:13 pm

  8. What is your opinion of Tungsten Cluster?

    I think the Continuent folks would probably tell you that they are often considered along with MMM. And on the face of it, Tungsten seems to address the failings of MMM.

    With the recent hire of Giuseppe Maxia one gets a warm fuzzy that the testing will be extremely robust.

    Cheers

    Jason

    4 May 11 at 6:19 pm

  9. Eero, I could cite lots of examples of untested commercial software. I’m not opposed to commercial software, but I like to be able to inspect the software’s test suite. It’s easy to claim that something is tested, but most people don’t even know what the word means. I used to work at a company where the buzzword was “unit testing” because someone started slinging the phrase around. Unfortunately nobody there actually knew what it meant — literally. “Unit testing” was using a fake credit card to browse the website and order a product, then opening the order processing app and pushing it through invoicing and such. So I’m always on alert when people say their software is tested. It’s like “linear scalability.” Best to be sure you can back the claim up if you say it in my vicinity.

    Xaprb

    4 May 11 at 7:42 pm

  10. @Baron, Eero Just about all software is tested. The question is how much, by whom, and under what sort of conditions. Distributed failover is definitely in a unique category of applications that are very hard to test well. It tends to have what I call “this can’t be happening” bugs from corner cases are hard to create outside of production environments or even to imagine before you actually seem them in a real system.

    Robert Hodges

    4 May 11 at 7:53 pm

  11. Robert, true, but software that is written to be testable can isolate the code that deals with the distributed failover to a very small portion of the code, the rest of which can be exhaustively tested with suites of unit and integration tests. The problem is, developers who know how to do this (and have the patience to tolerate it) are one in a thousand, at least. I put myself in the category of “has the patience” because I’m actually a very bad coder. Thus, I’m forced to write tests before code, and design everything to be testable, or I wouldn’t be able to write a single line of correct code. I spent last weekend writing about 200 lines of code, for example.

    I know Giuseppe has the tester’s mindset, but I haven’t looked at a line of the Tungsten source code or test suite, so I have no opinion on it.

    Xaprb

    4 May 11 at 8:11 pm

  12. @Baron, I completely agree on unit testing. I also do it a lot because my code is really awful without a lot of unit tests. I don’t see how you can build a big system without them.

    Robert Hodges

    4 May 11 at 8:15 pm

  13. What a timing. I was about to write a lengthy post myself on the woes of MMM.
    In these past three days I’ve had MMM cause 4 downtimes on two different deployments, and break havok in my replication environment.

    I’ve decided to finally dump it.
    Guess you’ve just saved me a lot of typing.

    Shlomi Noach

    4 May 11 at 11:04 pm

  14. Cédric

    5 May 11 at 6:19 am

  15. I agree with this whole-heartedly. I too have seen MMM fail more than I’ve seen it work correctly. It was originally written on behalf of a Proven Scaling customer, but we passed the development off to Percona largely because the design that the customer wanted wasn’t something we were comfortable with implementing or interested in furthering as a sanctioned tool. We ended up writing Flipper as a compromise — it takes the actions, but it leaves the decision-making entirely up to the humans. The decision-making part on the face of complex failure scenarios is the really hard (and dangerous) part.

    Jeremy Cole

    5 May 11 at 2:13 pm

  16. I have deployed number of MMM which worked pretty well. I however only set it up in the “Manual” mode which makes human to make a decision. There have been some bugs in this mode but in general it worked pretty well helping to automate switching for maintainence operation. Most of my experience though comes with MMM version 1. MMM version 2 has started not even having the manual mode which I think was a poor design choice.

    Peter Zaitsev

    5 May 11 at 3:16 pm

  17. Wait, why isn’t everyone just using MySQL Cluster for these things?!?!

    Harrison

    5 May 11 at 3:16 pm

  18. More to the point, why isn’t everyone just using the cloud instead?

    Xaprb

    5 May 11 at 4:00 pm

  19. Like… Mongodb? It is web scale!

    Alexey Kovyrin

    5 May 11 at 4:36 pm

  20. If only I could run NDB as the backend storage for mongoDB, and host it in the cloud. Hey, I’m 2/3 serious. If NDB were the storage engine for mongoDB, that would actually be interesting.

    Xaprb

    5 May 11 at 9:02 pm

  21. MMM is definitely a part of MySQL HA big picture, despite of being the best solution for some kind of applications, I have to totally agree with you. In order to demonstrate this, one only have to issue an echo c > /proc/sysrq-trigger, and watch the monitor node getting to an endless loop, and the cluster not failing over.

    The real problem is that currently, one can write a what’s wrong article with almost all of the high availability solutions for mysql today. The problem is that we don’t have a silver bullet (at least in the open source world). Although, it is technically possible to have a solution, which is better than all the current ones in virtually every aspect. Considering what we have now, it could be some kind of silver bullet.

    I can’t wait to read about percona’s secret HA stuff:).

    Peter Boros

    6 May 11 at 5:17 am

  22. Although I agree with some (most?) of the bashing, I also feel like an important point is not mentioned: the fact that it does give us the ability to work on one server without disrupting service.occasional reboots, database changes, server upgrades etc are all no problem thanks to mmm. Whether or not we really need mmm for that is another discussion.it’s the best we have right now..

    Walter heck

    7 May 11 at 2:03 am

  23. @walter: They are a problem: when performing planned maintenance and you switch writer roles between, MMM does not even try to move the slaves properly to remain consistency.
    (https://code.launchpad.net/~gryp/mysql-mmm/mysql-mmm-MoveSlavesMoreConsistent)

    The ‘Duplicate entry’ (1062) errors’ baron talked about can even occur when doing ‘set_offline’!

    Kenny Gryp

    10 May 11 at 8:12 am

  24. It is easy to criticize any software, but sometimes you have to use it as that might be the best solution. MMM may have limitations , but it works for almost everytime. Does anyone have a better solution , which is easy to configure and free to use.

    I thought everyone in this group have used MySQL which comes with its own share of issues

    Vipul

    10 Jun 11 at 4:07 pm

  25. Folks, MMM works in most scenarios, edit the configuration and not make automatic decisions. It’s free and open-source, please contribute to make enhancements to it.

    I downloaded Percona Server and xtrabackup. I did not see any extensive tests. Why so much MMM bashing?

    Tom Penn

    14 Jun 11 at 1:35 am

  26. @Baron: There is active development on something related that I can’t talk much about now. But if you want, you can come to Percona Live and be among the first to find out.

    Which was the session at Percona Live about this?

    Ricardo Santos

    14 Jun 11 at 11:27 am

  27. Ricardo Santos

    14 Jun 11 at 1:27 pm

  28. Ricardo,

    That’s right, PRM is what we’re working on. There is no documentation or automated test suite yet, and the code is not done, so we are not promoting it as a real solution at this time. The last thing I want to do is throw Percona’s reputation into a toilet by suggesting that people try out an unproven High Availability tool.

    Xaprb

    15 Jun 11 at 6:48 pm

  29. What about Corosync/Pacemaker (linux-ha) managing MySQL clusters?

    coredump

    24 Feb 12 at 11:55 am

Leave a Reply