Binary log checksums in MySQL 5.6
MySQL 5.6 will have “checksums in the binary log,” which can be variously described, but one phrase I’ve heard a few times is, loosely, that it helps ensure replication integrity. This isn’t specific enough to make it clear what it does, and when I’ve talked about pt-table-checksum and its purpose (for example, on webinars), people often ask whether pt-table-checksum will be obsoleted by replication checksums in MySQL 5.6. The answer is no, they do completely different things. But it’s kind of confusing, a bit like semi-synchronous replication in that regard.
pt-table-checksum ensures that your replicas have the same logical dataset as their masters. They can drift for any number of reasons — someone changes data directly on the replica, there is an error in replication, a nondeterministic change is made on the master in STATEMENT binlog format — the list goes on. MySQL 5.6 will add many safeguards to help prevent or avoid some of these, but they are still possible. You need a tool like pt-table-checksum to verify data integrity on replicas. The server has no built-in way to do that for you.
Binary log event checksums ensure that the binary log events are transmitted without corruption when replicas connect to the master and retrieve its binary log. This prevents problems such as bit-flips in memory, bugs in the I/O thread when it reads the log events and writes them to the relay log, network corruption, and so forth. It does not verify that the data that’s changed by the binary log event will match the changes on the master.
I’m really happy with the binary log checksum feature, and glad that it’s enabled by default. I have fixed many replication problems caused by binary logs being transmitted to a relay log incorrectly. Preventing them from happening in the first place, or detecting when they do and halting replication, is a great enhancement. By the way, I requested this feature, so, thanks Oracle!



For those of us who were around before 2009 and remember seeing fairly common relay log corruption, the critical fix was http://bugs.mysql.com/bug.php?id=26489 in 5.0.56 and 5.1.24. The problem turned out to be the master resending partial events after a timeout. That and two smaller fixes hugely reduced the incidence of relay log corruption reports. They are now rare for people using 2009 and later versions of MySQL.
The checksums are good additional protection because errors do still occasionally happen on the wire or disks.
Views are my own, for an official Oracle view, seek a PR person.
James Day, MySQL Senior Principal Support Engineer, Oracle.
James Day
30 Sep 12 at 11:58 pm
The checksums should also help catch and avoid similar bugs in the future, as long as they don’t originate on the master side before the checksum is computed.
Xaprb
1 Oct 12 at 7:23 pm
Every once in a while we have seen replication glitches at customers. The suspect was a faulty network driver where the (offloaded) checksumming for ethernet/ip/tcp was broken. This theory was backed by the fact that enabling SSL for the replication channel fixed the problem (SSL uses its own, stronger checksum).
With SBR, such glitches typically manifested in broken SQL syntax and forced the SQL thread to stop. So they were at least detectable.
With RBR however, even a garbled binlog event looks valid. This increased the pressure to come up with an integrated strong checksum. It’s nice to see it happen.
XL
2 Oct 12 at 5:56 am
I keep running up against this confusion as well where people mix up the purpose of each. I have therefore started to refer to checksums used in the way pt-table-checksum uses it as “digest” (e.g., as in “Message Digest 5″) or “hash” (e.g., as in “Secure Hash Algorithm 1″). The term “checksums” have a very clear meaning in data communication literature where it is used to check absence of errors that have occured during transfer or storage. Might help avoiding some confusion on your behalf as well.
Mats Kindahl
3 Oct 12 at 7:51 am
Baron, yes, they will help with that. Partly by increasing the confidence that it’s really a bug and not bad luck and partly because verifying the checksum can be enabled at different places to narrow down the location of the problem.
James Day, MySQL Senior Principal Support Engineer, Oracle
James Day
4 Oct 12 at 5:57 am