Xaprb

Stay curious!

What kind of High Availability do you need?

with 13 comments

Henrik just wrote a good article on different ways of achieving high availability with MySQL. I was going to respond in the comments, but decided it is better not to post such a long comment there.

One of the questions I think is useful to ask is what kind of high availability is desired. It is quite possible for a group of several people to stand in a hallway and talk about high availability, all of them apparently discussing the same thing but really talking about very different things.

Henrik says “At MySQL/Sun we recommended against asynchronous replication as a HA solution so that was the end of it as far as MMM was concerned. Instead we recommended DRBD, shared disk or MySQL Cluster based solutions.” Notice that all of those are synchronous technologies (at least, the way MySQL recommended them to be configured), generally employed to ensure a specific desirable property — no loss of data. But “I must not lose any committed transaction” and “my database must be available” are actually orthogonal requirements. One is about availability, the other is durability.

A lot of people who say they want High Availability actually want High Durability.

There are a great many MySQL users for whom writes are much less valuable than reads. I would point to an advertising-supported website as a canonical example. If the system isn’t available — that is, available to serve read queries — then a lot of money is lost. If someone’s latest comment on a blog post is lost — who cares? Money continues to flow.

This is why a lot of people want a system that keeps the database online, even if some writes are lost. Note that loss of writes is not the same thing as consistency — consistency and durability are also orthogonal for most users’ purposes. So we aren’t talking about eventual consistency or any of the other buzzwords, but simply “the system must respond to read queries.”

Asynchronous replication is well suited to many such users’ availability requirements, as long as replication does not fail (halt) through a write conflict or some other failure mode. (It is often perfectly acceptable for it to fail in other ways, as long as it does not halt.) That’s why a lot of users are interested in the specific type of “high availability” that a system such as MMM is intended to provide (but, as I mentioned, actually doesn’t provide). In other words, MMM would be great for a lot of people, if it worked correctly.

I have also been exposed to applications for which this kind of availability-trumps-durability paradigm is absolutely unacceptable. The advertising system upon which the advertising-supported website relies for its income is a good example. Users know they can build sites that only need to be available for reads, precisely because they are trusting that Google AdSense is highly available for writes! Delegating writes to someone else is the easiest way to build systems.

There is a place for DRBD and MySQL Cluster, and there are also many situations that are served by neither the DRBD nor the MMM type of solution.

Josh Berkus wrote a while back about three types of cluster users, as opposed to three types of clusters. I think it’s helpful to approach the conversation from that angle sometimes too. As a consultant, I almost always do that when I enter a discussion with a customer who wants a “cluster” or “high availability.” Those are basically code phrases that tell me I need to start at the beginning and ensure we are all talking about the same requirements!

I also agree with Henrik about the need to turn off automatic failover. In many, many situations this is by far the best approach. Sometimes people state requirements that, if one steps back and looks at them afresh, quite obviously indicate that an automatic failover is the last thing that’s desirable. For example, if someone tells me that he expects failover to be required less than once a year, this is almost guaranteed not to be a good case for automatic failover. A system that’s tested so infrequently is almost certainly not going to work right when it’s needed. In such cases, it’s far better to leave everything alone until an expert human can resolve the problem, rather than have a stupid machine destroy what would otherwise be a fixable system.

Written by Xaprb

May 15th, 2011 at 10:37 pm

Posted in SQL

Tagged with , ,

13 Responses to 'What kind of High Availability do you need?'

Subscribe to comments with RSS

  1. Hi Baron

    Thanks for continuing discussion on this topic. You are right, of course, that there is a business justification for being happy with async replication and accept data loss. Kind of like using MyISAM in the 90′s…

    *But*,

    1) if there was a better solution available, then clearly even your discussion forum with advertising would use it too. (This is analogous to xtrabackup: for so many years people accepted using mysqldump or some lvm based hacks for backup, yet now that xtrabackup is available, everyone will use it since it is the right way to solve that problem.)

    2) At some scale, even for a forum (or other app you may have cited), losing data simply isn’t an option anymore. If you are a globally known brand with thousands of writes (“forum posts”, “tweets”, whatever) per second, then you can’t lose those writes even if they themselves don’t contain a financial transaction. Because if you do, thousands of users will be affected, many will notice, and they will talk about it on Twitter and Facebook and blogs. So losing a few comments is an acceptable business risk for a small startup that only runs in one Amazon zone, but only for small startups.

    Henrik Ingo

    16 May 11 at 3:09 am

  2. But at what cost? Leaving aside MMM, replication itself is a very cheap solution for high availability with loose consistency and durability guarantees, and failover is practically instantaneous. The other “right ways to do it” often results in costs such as days of slow performance until the standby server warms up, a lot of extra hardware sitting on standby, hiring someone to get a specialized product installed correctly and running well, and the list goes on. Nothing is as cheap as just setting up a couple of servers and making them replicate the old-fashioned way.

    And there are a lot of businesses that still really don’t need the writes; many of them don’t even keep their own data, but rather just fetch data from users’ Facebook accounts, for example. Their own unique data is very small and infrequently written, e.g. let’s say you are building a new Facebook app that just shows who your friends are. The writes are essentially like this:

    - new account creation
    - meta-info such as sessions and last-visited
    - fetching data from Facebook and caching it locally

    The only thing of value there is new account creation, and if you switch replication carelessly and lose a few seconds of writes, who cares.

    This isn’t everyone, but there’s a big need for this kind of high availability.

    Xaprb

    16 May 11 at 7:36 am

  3. At what cost? What I’m proposing would of course need major re-architecture of both replication and connectors, even table format. (I think the global trx id probably has to be part of each record…)

    So yes, I understand why we don’t have the solution yet. I’m just trying to understand how we even should do it, if we had the resources. (As opposed to having resources and not understanding what to do with them. Been there, seen that.)

    By the way, “Is a lion a cat?” is a trick question. I know I have 50% change of answering it correctly, but doesn’t really make for a good turing test.

    Henrik Ingo

    17 May 11 at 4:29 am

  4. “…and lose a few seconds of writes, who cares. This isn’t everyone, but there’s a big need for this kind of high availability.”

    Losing writes doesn’t sound like the definition of High Availability to me. Manual failover of services doesn’t fit the definition of High Availability either. Why exactly are you on a rampage against H/A lately? Is RAC useless next? Should we all abandon any hope of self-recovering services just because MMM doesn’t work and Heartbeat is outdated? I think not. Setting up a bunch of replication slaves and then waking up at 3am to manually failover a master server to a slave, then reconfiguring all of the other slaves to point to the new master is not something that takes a couple of minutes – it’s a process that wastes time and is not fun in the middle of the night. Active/Passive master servers have their place in H/A. Right along with Active/Active I don’t see how you can define an H/A setup without them.

    Perhaps you could tell us what method you think does work, of course it should fit the definition of H/A and not require manual processes or >1min downtime.

    Matt Reid

    17 May 11 at 7:02 pm

  5. Matt, given that your definition of Availability is “not losing writes” I’m not sure we’re going to see eye to eye. I call that Durability.

    Xaprb

    17 May 11 at 7:05 pm

  6. Durable only applies to writes sticking around. The DB has to be available to receive writes in order to have any sort of durability (or lack there of). If the DB is down there are no ACID properties at all. But if you want to argue semantics that’s sort of getting off point; What do you consider a proper solution to H/A since the proposed replication and manual failover process doesn’t provide H/A.

    Matt Reid

    18 May 11 at 12:10 am

  7. I’m not proposing any solution. I’m pointing out that there are different kinds of HA and that we need to be careful about the names, lest we talk about different things. Have I offended you somehow?

    Xaprb

    18 May 11 at 8:09 am

  8. No, no offense at all. I suppose we had a misunderstanding. Usually when I read blogs that discuss something that isn’t working or solutions that are invalid (MMM) the author proposes a solution or preference to a better method to solve the issue at hand. Given the number of posts lately regarding H/A systems that don’t work I was hoping that you would offer a discussion of a system that does work. One that has worked rather well for my clusters over the years is RHCS+SAN (assuming the SAN is not oversubscribed and has good throughput): it has none of the issues of DRBD but all of the good aspects of LVS/Pacemaker yet also includes a nice ILO/iLOM capable fencing agent that solves a lot of the splitbrain issues of LVS. Have you come across RHCS+SAN much, and if so do you consider it to have any issues?

    Matt Reid

    18 May 11 at 1:38 pm

  9. Oh good, I was worried that I’d come across as being on too much of a rampage :)

    There are a number of types of architectures for HA systems. There are good solutions for sync replication, block level replication, proxies, etc. But there isn’t a good tool for moving virtual IP addresses and promoting a slave to replace a failed master. Yves is trying to solve that right now. It’s too early to say anything definitive but I’m really optimistic. This model of HA tool is the only really useful one that isn’t really provided well, IMO, among the major types of HA solutions that people want/need.

    Personally I haven’t seen any RHCS+SAN deployments. A lot of our customers are running in Rackspace/Softlayer/etc and that’s why they want plain old replication and VIPs. That’s why I’ve been exposed to that more.

    Xaprb

    18 May 11 at 2:04 pm

  10. Very true, there isn’t (to my knowledge) a well tested and public app that allows auto slave failover. Would sure be nice to have one for environments that don’t have SAN. I would assume that cost is that major reason why RHCS+SAN isn’t seen more often. I can personally attest to it’s stability for several of the large Acrobat.com environments that I built while at OpSource along with some other exciting projects, but they all had budgets that included 15K SAN. A lot of the issues the mysql community voices seem to desire are also solved via SAN: cloning environments without LVM, offsite DR, snapshot scheduling, etc. I’m currently testing Equalogic, HP Lefthand, and Sun 7240+6540 SANs for dedicated MySQL 5.5 use and will be writing loadtesting result articles this summer – perhaps that will help people see how RHCS+SAN can fulfill the H/A requirement. Always nice to have more options.

    Matt Reid

    18 May 11 at 2:31 pm

  11. Matt, I’d just like to echo what Baron kind of already says in latest comment. In the complaints I receive (and really, to me this is second hand information as I’m not an ops guy) DRBD is not the problem, the monitoring/failover solution is. Sure, DRBD adds latency to disk writes, but if that was the only problem I could live with that. (The biggest issue with DRBD is the InnoDB recovery time, which is the same also for a SAN.)

    As far as solutions, Baron seems to be on the hunt for a better than MMM/Heartbeat/RHCS/… whereas I have brought up the concept of not really needing to have a decision between master vs slave. Systems that provide such feature seem to be either synchronous replication (MySQL Cluster, Galera) or NoSQL CAP theorem based solutions (Voldemort, Dynamo, etc…).

    Henrik Ingo

    18 May 11 at 3:07 pm

  12. Henrik, very true about innodb crash recovery being a limiting issue with heartbeat based systems. Every time I install new cluster I test the recovery time for a generic innodb crash (kill -9 while sysbench is hitting with 32 threads) and record the average recovery time (minus outliers) for that cluster. I’ve been using the following settings to keep the recovery time to a minimum. Perhaps there are other settings to help improve crash recovery time as well, let me know if you know.

    1. keeping ib_logfileN files on the fastest storage available for the budget and on a separate array (not just partition) when possible. In regard to filesystems for best recovery timings on this specific process I’ve found the following to be in order of fast->slow time: Reiser > XFS > Ext2 > Ext3. Haven’t seen anyone else posting test results for this issue, perhaps I’ll put some graphs up on my site later.
    2. setting innodb_log_files_in_group to a value > 2 where applicable to allow faster I/O processing of the crash related log data.
    3. setting innodb_log_file_size to a minimum sane value (as required by the size of the buffer pool and workload type)
    4. setting innodb_support_xa=1 on master servers to ensure binlog contents are in proper order (more of an assurance than speed thing though as I have not tested the speed difference on/off)

    Of course LVS, Pacemaker, RHCS are not perfect. In an ideal world InnoDB would just work in an active/active/active/etc role like NDB. Then we wouldn’t have to worry about the current issues with active node failover and slave (or passive node) promotion. Likewise, replication as it is currently implemented is not perfect either. Semi-Synchronous is a nice addition but still does not solve the performance requirement of statement processing in heavy traffic environments.

    Back to the Active/Passive setups: I have noticed less false positive failovers and less split brain scenarios when using RHCS vs LVS. I suppose it has to do with the fencing method to prevent split brain and the method in which RHCS monitors the MySQL process. In regard to missed heartbeats causing failovers when not needed I find that direct (no switch layer) crossover cables utilizing bonded interfaces prevents that issue almost entirely, to the point that I have never had false failovers when using this method of heartbeat setup. I’ve seen a lot of people asking for help with their H/A pair where the issue can be traced back to not using bonded interfaces for heartbeat (and their VIP on a separate set of interfaces as well). My minimum requirement for all clusters I admin is bond0 for the DB traffic VIP and bond1 for the heartbeat interfaces. Works well.

    Matt Reid

    18 May 11 at 4:18 pm

  13. Matt,

    Thanks for sharing your experiences regarding RHCS and crossover cables. Very useful to know, you seem to have at least some experience in testing various systems.

    Henrik Ingo

    18 May 11 at 4:22 pm

Leave a Reply