Ironically, the Stream algorithm I wrote as the simplest possible syncing algorithm does what the much more efficient algorithm I wrote some time ago can’t do: sync a table without a primary key, as long as there are no duplicate rows. In fact, it’s so dumb, it will happily sync any table, even if there are no indexes.
The flash of inspiration I had on Friday has turned out to be good insight. It immediately showed me how I can re-use a lot of the hard work I’ve already done for other tools. Chunking and nibbling are the names I’ve given to two algorithms I’ve developed for processing tables a little at a time. Chunking looks for a suitable index and generates an array of WHERE clauses that will divide the table into pieces of approximately a given size (number of rows or number of bytes). I use this on mk-table-checksum and mk-parallel-dump. It requires an indexed column I can treat as a number one way or another. That includes temporal values.
Nibbling is related. It’s an efficient way to do something like LIMIT X, Y without scanning through the first X rows. It also requires a suitable index, but the code I wrote to make it work with mk-archiver is really generous about what “suitable” means. It’ll basically work with any index. It selects some rows with LIMIT Y and uses the last-fetched row to start the next select.
Both algorithms will adapt well to finding and resolving differences in rows, a chunk at a time. The general procedure is to create the WHERE clauses that define boundaries around the chunk of rows, then checksum the whole chunk. The result is a tiny little hash value. If this differs between the source and destination tables, the next step is to checksum the rows individually and fetch their primary or unique key columns. This uses a little more network bandwidth, but it’s still not bad unless the key (or the chunk) is huge. Any rows whose checksums differ can then be fetched by the key and synced.
The more complex algorithms, which were in the original table-sync tool, will come later. They can be potentially much more efficient in terms of network traffic, but they have drawbacks too, such as the granularity of locking. The mk-table-sync tool will soon be able to choose the best algorithm that causes the least locking and just do it without any fuss. For example, if it sees a nice primary key it can use for chunking, and it sees that the table is InnoDB, it’ll just chunk and use SELECT FOR UPDATE. Voila, no table locks, and not much of the table will be locked at a time (it’ll commit between chunks).
Right now I’ve gotten a simple interface for code that finds differences in rows, a plugin-like interface for implementing each of the algorithms uniformly, an interface for resolving differences, and a few other things. I’m about to embark on the Chunk algorithm for syncing.
I don’t think most people will consider this a big deal, but don’t expect the final product to correctly sync tables without a primary key and with duplicate rows. Comparing tables with duplicates is absolutely meaningless. If you can’t write a WHERE clause that uniquely identifies a row, you’re done.

Hi Baron,
I’ve received strange results using mk-table-checksum.
mk-table-checksum -u sugarcrm –askpass -d Stats server1 server2
It found one table that had different checksums. I set out to find out the difference. I used mysqldump of the offending table on the master and copied it to and loaded it into the test database on the slave.
I did a query to find any differences and it came up empty.
I then ran CHECKSUM TABLE from within the MySQL client on both tables on the slave and the checksums were equal.
I did the same back on the master and it was still different.
So we have a case where MySQL is returning different checksums for the same table on the same version of MySQL.
I’m running another mk-table-checksum as above, but this time with –algorithm=BIT_XOR. It’s already processed the previously offending table, but the checksums now are matching.
I recall you mentioning MySQL’s CHECKSUM TABLE has some weaknesses, but do you know why the same table would return different checksums?
Thank you and keep up the great work!!
-Bruce
I understand what you’re saying, but it is possible to compare 2 tables even if there are duplicates. For instance, compare the sets:
{a,a,b,c}
{a,b,b,c}
In order to sync them, we’d need to add “b” to set 1 and “a” to set 2 so both sets are:
{a,a,b,b,c}
It’s not meaningless, it just means that you need a better way to get chunks.
Doesn’t MySQL already use pages? Why not just use the same thing MySQL does for “chunking”?
Hi Bruce,
you were probably using the ACCUM checksums, which are order-dependent.
Sheeri, I need to meditate on the edge cases. Maybe you’re right.
Hi Baron,
I in fact was using the default - CHECKSUM. I’ve not yet used ACCUM.
-Bruce
Hi Bruce,
Check and make sure that a) it’s really doing a CHECKSUM TABLE (the –explain option should show you, but I’m writing this from memory) and b) your tables aren’t being changed.
Hi Baron,
It was doing a CHECKSUM as I was watching it on the servers as it was running.
I also just did the –explain as you asked and it shows the same -using CHECKSUM.
I re-read your earlier comments to be sure I understand. Are the storage engines different on the master and slave?
(generally that doesn’t make any difference either).
It sounds like you might have found a bug in MySQL.
Yes, table formats are both ENGINE=MyISAM DEFAULT CHARSET=utf8.
Hi Baron,
I just did another full test using ACCUM and it all came back fine. So it certainly does seem there’s some sort of bug with MySQL’s CHECKSUM routine.
I’m curious as to what it is…
You should report it :) http://bugs.mysql.com/
Done!
http://bugs.mysql.com/33015
Hi Baron,
Just to follow-up, MySQL said my bug report was a duplicate of bug #30041.
Seems tables -can- be different if they contain doubles or floats.
Anyway, looks like it’s patched for MySQL 5.2 and 6.0, but not 5.0 or 5.1.
-Bruce