Comments on: What kind of High Availability do you need? http://www.xaprb.com/blog/2011/05/15/what-kind-of-high-availability-do-you-need/ Stay curious! Thu, 02 May 2013 12:36:53 +0000 hourly 1 http://wordpress.org/?v=3.5.1 By: Henrik Ingo http://www.xaprb.com/blog/2011/05/15/what-kind-of-high-availability-do-you-need/#comment-19383 Henrik Ingo Wed, 18 May 2011 20:22:39 +0000 http://www.xaprb.com/blog/?p=2327#comment-19383 Matt,

Thanks for sharing your experiences regarding RHCS and crossover cables. Very useful to know, you seem to have at least some experience in testing various systems.

]]>
By: Matt Reid http://www.xaprb.com/blog/2011/05/15/what-kind-of-high-availability-do-you-need/#comment-19382 Matt Reid Wed, 18 May 2011 20:18:32 +0000 http://www.xaprb.com/blog/?p=2327#comment-19382 Henrik, very true about innodb crash recovery being a limiting issue with heartbeat based systems. Every time I install new cluster I test the recovery time for a generic innodb crash (kill -9 while sysbench is hitting with 32 threads) and record the average recovery time (minus outliers) for that cluster. I’ve been using the following settings to keep the recovery time to a minimum. Perhaps there are other settings to help improve crash recovery time as well, let me know if you know.

1. keeping ib_logfileN files on the fastest storage available for the budget and on a separate array (not just partition) when possible. In regard to filesystems for best recovery timings on this specific process I’ve found the following to be in order of fast->slow time: Reiser > XFS > Ext2 > Ext3. Haven’t seen anyone else posting test results for this issue, perhaps I’ll put some graphs up on my site later.
2. setting innodb_log_files_in_group to a value > 2 where applicable to allow faster I/O processing of the crash related log data.
3. setting innodb_log_file_size to a minimum sane value (as required by the size of the buffer pool and workload type)
4. setting innodb_support_xa=1 on master servers to ensure binlog contents are in proper order (more of an assurance than speed thing though as I have not tested the speed difference on/off)

Of course LVS, Pacemaker, RHCS are not perfect. In an ideal world InnoDB would just work in an active/active/active/etc role like NDB. Then we wouldn’t have to worry about the current issues with active node failover and slave (or passive node) promotion. Likewise, replication as it is currently implemented is not perfect either. Semi-Synchronous is a nice addition but still does not solve the performance requirement of statement processing in heavy traffic environments.

Back to the Active/Passive setups: I have noticed less false positive failovers and less split brain scenarios when using RHCS vs LVS. I suppose it has to do with the fencing method to prevent split brain and the method in which RHCS monitors the MySQL process. In regard to missed heartbeats causing failovers when not needed I find that direct (no switch layer) crossover cables utilizing bonded interfaces prevents that issue almost entirely, to the point that I have never had false failovers when using this method of heartbeat setup. I’ve seen a lot of people asking for help with their H/A pair where the issue can be traced back to not using bonded interfaces for heartbeat (and their VIP on a separate set of interfaces as well). My minimum requirement for all clusters I admin is bond0 for the DB traffic VIP and bond1 for the heartbeat interfaces. Works well.

]]>
By: Henrik Ingo http://www.xaprb.com/blog/2011/05/15/what-kind-of-high-availability-do-you-need/#comment-19381 Henrik Ingo Wed, 18 May 2011 19:07:15 +0000 http://www.xaprb.com/blog/?p=2327#comment-19381 Matt, I’d just like to echo what Baron kind of already says in latest comment. In the complaints I receive (and really, to me this is second hand information as I’m not an ops guy) DRBD is not the problem, the monitoring/failover solution is. Sure, DRBD adds latency to disk writes, but if that was the only problem I could live with that. (The biggest issue with DRBD is the InnoDB recovery time, which is the same also for a SAN.)

As far as solutions, Baron seems to be on the hunt for a better than MMM/Heartbeat/RHCS/… whereas I have brought up the concept of not really needing to have a decision between master vs slave. Systems that provide such feature seem to be either synchronous replication (MySQL Cluster, Galera) or NoSQL CAP theorem based solutions (Voldemort, Dynamo, etc…).

]]>
By: Matt Reid http://www.xaprb.com/blog/2011/05/15/what-kind-of-high-availability-do-you-need/#comment-19380 Matt Reid Wed, 18 May 2011 18:31:03 +0000 http://www.xaprb.com/blog/?p=2327#comment-19380 Very true, there isn’t (to my knowledge) a well tested and public app that allows auto slave failover. Would sure be nice to have one for environments that don’t have SAN. I would assume that cost is that major reason why RHCS+SAN isn’t seen more often. I can personally attest to it’s stability for several of the large Acrobat.com environments that I built while at OpSource along with some other exciting projects, but they all had budgets that included 15K SAN. A lot of the issues the mysql community voices seem to desire are also solved via SAN: cloning environments without LVM, offsite DR, snapshot scheduling, etc. I’m currently testing Equalogic, HP Lefthand, and Sun 7240+6540 SANs for dedicated MySQL 5.5 use and will be writing loadtesting result articles this summer – perhaps that will help people see how RHCS+SAN can fulfill the H/A requirement. Always nice to have more options.

]]>
By: Xaprb http://www.xaprb.com/blog/2011/05/15/what-kind-of-high-availability-do-you-need/#comment-19379 Xaprb Wed, 18 May 2011 18:04:56 +0000 http://www.xaprb.com/blog/?p=2327#comment-19379 Oh good, I was worried that I’d come across as being on too much of a rampage :)

There are a number of types of architectures for HA systems. There are good solutions for sync replication, block level replication, proxies, etc. But there isn’t a good tool for moving virtual IP addresses and promoting a slave to replace a failed master. Yves is trying to solve that right now. It’s too early to say anything definitive but I’m really optimistic. This model of HA tool is the only really useful one that isn’t really provided well, IMO, among the major types of HA solutions that people want/need.

Personally I haven’t seen any RHCS+SAN deployments. A lot of our customers are running in Rackspace/Softlayer/etc and that’s why they want plain old replication and VIPs. That’s why I’ve been exposed to that more.

]]>