How to measure MySQL replica lag accurately

Kevin Burton wrote recently about why SHOW SLAVE STATUS is really not a good way to monitor how far behind your replica servers are, and how replica network timeouts can mess up the replica lag. I’d like to chime in and say this is exactly why I thought Jeremy Cole’s MySQL Heartbeat script was such a natural fit for the MySQL Toolkit. It measures replica lag in a “show me the money” way: it looks for the effects of up-to-date replication, rather than asking the replica how far behind it thinks it is.

The replica doesn’t even need to be running. In fact, the tool doesn’t use SHOW SLAVE STATUS at all. This has lots of advantages: for example, it tells you how far the replica lags behind the ultimate master, no matter how deep in the replication daisy-chain it is. In other words, unlike SHOW SLAVE STATUS, it won’t tell you a replica is up-to-date just because it’s caught up to its master. If a replica’s master is an hour behind, it will report that the replica is an hour behind, too – because it is.

It’s a really smart approach. And you can daemonize it, and it’ll keep a file up-to-date with running averages (by default it averages the last one, five and fifteen minutes, but of course you can choose that). Now your monitoring scripts can be as simple as “cat /var/log/replica-delay” or some such.

It’s not a hard tool to write, and I suspect lots of people have done it, but I bet that between Jeremy, whoever worked on it at Six Apart, and me, we’ve produced a pretty good version of the tool. It’s part of the MySQL Toolkit, and the full manual is online.

I'm Baron Schwartz, the founder and CEO of VividCortex. I am the author of High Performance MySQL and lots of open-source software for performance analysis, monitoring, and system administration. I contribute to various database communities such as Oracle, PostgreSQL, Redis and MongoDB. More about me.