Archive for the ‘High Availability’ Category
When can I have a big server in the cloud?
I was at a conference recently talking with a Major Cloud Hosting Provider and mentioned that for database servers, I really want large instances, quite a bit larger than the largest I can get now. The lack of cloud servers with lots of memory, many fast cores, and fast I/O and network performance leads to premature sharding, which is costly. A large number of applications can currently run on a single real server, but would require sharding to run in any of the popular cloud providers’ environments. And many of those applications aren’t growing rapidly, so by the time they outgrow today’s hardware we can pretty much count on simply upgrading and staying on a single machine.
The person I was talking to actually seemed to become angry at me, and basically called me an idiot. This person’s opinion is that no one should be running on anything larger than 4GB of memory, and anyone who doesn’t build their system to be sharded and massively horizontally scaled is clueless.
I’ve received similar push-back from a lot of cloud hosting providers. When I work through the math with clients, a lot of them don’t like the ultimate price/performance ratio offered by cloud hosting. Hype doesn’t drive everyone’s business decisions, so a lot of people are wisely staying far away from cloud hosting for their applications, or even moving whole applications out of cloud hosting into real hardware to consolidate machines and save a lot of money. Some of them are using flash storage devices such as Fusion-io to further lower their TCO (this isn’t the right answer for every app, though).
Why do cloud hosting providers work so hard to make everyone buy lots of anemic machines and shard their applications an order of magnitude more than is required? Why aren’t they jumping to offer really beefy instances? I think there are a couple of simple reasons.
First, they want to colocate virtual machines and over-provision, just as airlines sell more tickets than there are seats in the plane. It’s a numbers game: sell more capacity than you really have, and bet on some of the instances not using all resources allocated to them. Win! Of course, this is only possible with lots of small instances; the law of large numbers doesn’t work without lots of instances, and large instances can’t be colocated. Cloud providers tend to dislike dedicated instances, which leads to the second reason. They don’t want to make strong claims about the availability of any particular machine. This is where the cloud paradigm of “you must build to recover from machines vanishing without warning” comes from. A dedicated beefy instance wouldn’t let the hosting provider push that responsibility onto the application.
There are lots more reasons — all of them combining into one big overall “cloud application architecture best practice” — but I think those are two of the showstoppers.
I really think this is a wrong paradigm. People talk about the cloud being the technology of the future, but in many ways it’s pretty stone-age compared to what smart system architects can achieve with high-quality hardware and networking at a much lower cost, with very strong guarantees of performance, consistency, and availability.
Cloud computing is new enough that we don’t understand, in a collective sense, how to think about it. (I know that lots of individuals do, but as a whole, there isn’t much of a shared understanding.) The real value proposition that I want to see emerge from cloud computing is pretty much orthogonal to what everyone’s raving about these days. I want to see the DevOps engineering discipline build momentum around the idea that systems should be treated as services, with architectural components provisioned and controlled through APIs. That can be done completely independently of many of the characteristics of current cloud computing platforms (virtualization, ephemerality, horizontally scaled architectures…)
And like most people, I’ve got an ego and I don’t appreciate repeatedly being called a moron by cloud computing providers’ sales people, who don’t know anything about running database servers. I can do math and understand price/performance, and I know the cost and difficulty of building a sharded application. I look forward to the day when I don’t have to just bite my tongue and walk on to the next booth. I look forward to cloud hosting providers advancing to the year 2005 or so. I’m sure it will happen as we figure this all out.
Feel free to comment, but don’t expect me to approve your comment if you’re from a cloud provider and you’re plugging your platform :)
What’s wrong with MMM?
I am not a fan of the MMM tool for managing MySQL replication. This is a topic of vigorous debate among different people, and even within Percona not everyone feels the same way, which is why I’m posting it here instead of on an official Percona blog. There is room for legitimate differences of opinion, and my opinion is just my opinion. Nonetheless, I think it’s important to share, because a lot of people think of MMM as a high availability tool, and that’s not a decision to take lightly. At some point I just have to step off the treadmill and write a blog post to create awareness of what I see as a really bad situation that needs to be stopped.
I like software that is well documented and formally tested. A lot of software is usable even if it isn’t created by perfectionists. But there are two major things in the MySQL world for which I think we can all agree we need strong guarantees of correctness. One is backups. The other is High Availability (HA) tools. And this leads me to my position on MMM.
MMM is 1) fundamentally broken and unsuitable for use as a HA tool, and 2) absolutely cannot be fixed. I’ll take that in two parts.
First, it’s broken and untrustworthy. I could go into the technical details of why MMM is broken at the architectural and implementation level. I could talk about the way that it uses a distributed set of agents, which do not have a reliable communications channel, all maintain their own state which is not communicated or agreed upon across nodes, and don’t even share configuration. I could talk about the fact that MMM itself can’t be made HA or redundant — you can only have a single instance of it.
I could talk about lots of things, but you can argue with every one of those assertions. You can’t argue with the list of failures I’ve personally seen. It fails over with no reason when nothing is wrong — and botches it up, causing the entire replication cluster to get out of sync and break. It tries to fail over when something actually is wrong with the cluster, but it does things out of order and with no synchronization amongst the agents, leading to chaos. It can’t handle anything unexpected, such as the ordinary kinds of network, disk, etc failures you’d expect in systems that have something wrong (which is exactly when an HA tool is supposed to function). It doesn’t protect itself against the human doing something wrong, such as mixing up the agent configuration on different hosts. There are many bizarre ways MMM can fail, but these are all theoretical — until you witness them. I’ve witnessed them, and new customer cases on MMM failures are filed on a regular basis. Here’s one:
In the recent past, we have had a couple of bad experiences with mmm-monitor tool which broke replication and brought our website down for a few hours.
And another:
We have recently started testing MMM for MySQL and when using it under write load we have been experiencing ‘Duplicate entry’ (1062) errors.
In short, MMM causes more downtime than it prevents. It’s a Low-Availability tool, not a High-Availability tool. It only takes one really good serious system-wide mess to take you down for a couple of days, working 24×7 trying to scrape your data off the walls and put it back into the server. MMM brings new meaning to the term “cluster-f__k”.
Now, why isn’t it possible to fix it? One simple reason: MMM is completely untested and untestable. Change one line of code in Agent.pm’s master control flow and tell me that you’re confident that you know what it has just done to the whole system? You can’t do it. If you don’t have tests, you can’t change the code with confidence, period. And as I said before, HA and backup tools are where we need a zero-tolerance policy. “I think this fixed the bug” or “I think it’s safe to change this code” are not acceptable. I have seen a lot of bug fixes that cause new and interesting bugs. I appreciate the variety — life is boring if all we’re doing is seeing the same old bugs — but this isn’t what we need in an HA tool.
In order to fix MMM, it has to be completely rewritten from scratch. Among other things, decisions and actions need to be completely separated. Then the decisions can be verified with a test suite, and the actions can be verified independently. But if you do that, you don’t have MMM anymore, you have a new tool. Therefore MMM can’t be fixed, it can only be thrown out and reimplemented.
Note that I’m not claiming that MMM was developed by bad programmers or that it is bad quality. I am only claiming that a) it demonstrably doesn’t work correctly, and b) it can’t be fixed without a rigorous test suite, which can’t be added to it without a complete reimplementation.
I will go further and claim that the architecture of MMM is fundamentally unreliable, and it isn’t a good idea to reimplement it (it’s already been done once!). This we could argue for a long time, but I know of so many better architectures that I wouldn’t entertain the notion of building a new tool with the same architecture.
I have seen a number of people reach the same conclusions and then implement new systems in the same general vein as MMM, with a limited set of functionality to avoid some of the problems. For instance, Flipper is a single tool with no agents, so that’s an improvement. Unfortunately, these tools all suffer from the same problem: they aren’t formally tested. I simply can’t accept that in an HA tool.
If I’m such a perfectionist, why haven’t I built a tool that solves this problem? I have a limited amount of time, and at some point, I don’t do things for free. I’ve had multiple conversations that go like this: “My last replication downtime incident cost me $75k. I can’t let that happen again. What will it cost to build a correct solution? No way — I can’t pay $20k for a high availability tool that really works.”
There is active development on something related that I can’t talk much about now. But if you want, you can come to Percona Live and be among the first to find out.
Why high-availability is hard with databases
A lot of systems are relatively easy to make HA (highly available). You just slap them into a well-known HA framework such as Linux-HA and you’re done. But databases are different, especially replicated databases, especially replicated MySQL.
The reason has to do with some properties that hold for many systems, but not for most databases. Most systems that you want to make HA are relatively lightweight and interchangeable, with little to zero statefulness, easy to start, easy to stop, don’t care a lot about storage (or at least don’t write a lot of data; that’s usually delegated to the database), and there’s little or no harm done if you ruthlessly behead them. The classic example is a web server or even most application servers. Most of the time these things are all about CPU power and network bandwidth. If I were to compare them to a car, I’d say they are like matchbox cars: there are many of them, and they are cheap and easy to replace.
Databases are different. With or without replication, you’re looking at a system that is complex, stateful, heavyweight, and cares a lot about storage. It runs on bigger hardware with fast disks and a lot of memory. It’s usually disk-bound, and it does a lot of writes. It’s hard to start — it takes a long time to warm up and really get ready to serve production workloads (many minutes, hours, or even days). It tends to run with a lot of data in memory in a dirty state, so shutdown is slow, because a clean shutdown requires flushing a bunch of data to disk. If you yank its power plug or kill-dash-nine it, it’ll have to perform recovery on startup, which slows the startup process even more. If I were to compare a database server to a car, I wouldn’t even use a car as the analogy: I’d use one of those big-ass mining trucks. If your mining truck breaks down, you don’t just toss it in the trash and pull another off the shelf.
The problem with a lot of HA solutions is that they want to deal with inconsistencies or irregularities by killing the resource and replacing it in another location. This works fine with web servers, but not with database servers. Doing that will cause serious pain and downtime, defeating the point of HA. And when you add replication into the mix, it gets even worse. A system that wants to manage replication needs to deal with very complex conditions. A lot of replication failures are delicate matters that require skilled human intervention to solve. The HA solution must insulate the application from the misbehaving resource, but leave it running so the human can handle things.
This is not the way most applications are made HA. It’s different with databases, and it’s much harder.


