I’ve been working on how to efficiently compare and synchronize data between two tables on different MySQL servers. I’ve also been working on a tool, sort of like rsync for database tables, which implements both algorithms. I profiled it to see how well the comparison algorithms work on real data. This article is about the results.
The data and profiling results
I used a sample of real data from a production database. It’s fairly simple—just a bunch of numbers and timestamps, 13 columns in all. The primary key is an integer, and there is a secondary index on another two columns, which makes it a good candidate for a three-stage drilldown via the top-down algorithm. There are exactly 50,000 rows in the sample I used, and the indexes and data come to just over 8MB in an InnoDB table.
For the tests, I created two copies of the data on the same server, and then changed one of the tables in four different ways. I deleted five rows randomly, 500 rows randomly, and where col2=60, which is 11,424 rows. Finally, I randomly incremented col6 in one row. These are the kinds of data corruptions I see on this table in production.
I ran these tests on a Dell Inspiron 5000 laptop with a 500MHz processor from 1999. Don’t pay attention to the absolute numbers, as I’m sure you will not be serving data from a laptop whose circuitry starts to screech when it displays an animated cursor (yes, that does happen…).
You can download the test data and the profiling results (1.4MB) if you wish.
The following table shows some key statistics from the profiling. ‘td’ stands for top-down and ‘bu’ stands for bottom-up, which are the two algorithms I’m comparing here.
|Test||Delete 5||Delete 500||Delete chunk||Change 1|
In most cases, the top-down algorithm outperforms the bottom-up. The case where it doesn’t is if there’s a lot of corruption scattered randomly through the table. It is especially good at detecting the large chunk of missing rows in the third test—it terminates in just a few queries instead of eleven thousand. This is as I predicted several weeks ago.
The top-down approach is a fair amount faster than the bottom-up on this data, and as long as only a small number of rows are bad, ought to issue a comparable number of queries. It ends up causing fewer reads than bottom-up also. Surprisingly, it’s actually more network-efficient when the number of corrupt rows increases; in the case where 500 rows are bad, it moves about half as much data over the network as bottom-up. I attribute this to it being able to use the drill-down and tree-pruning strategies to good advantage. Or, if you look at it another way, the bottom-up algorithm tends to have fairly predictable costs, while the top-down costs vary depending on the nature of the corruption.
There is one caveat. I have benchmarked the “find differences” aspect of the tool here, but the tool is written to fetch the bad rows over the network in anticipation of fixing them. That’s why the top-down algorithm fetches 1.5 MB over the network in the third test. It actually only needs to fetch a few kB over the network to find the differences; the 1.5 MB is it retrieving the rows it will need to insert to sync the second table. The same note applies to the number of rows read.
These results are encouraging. They tell me that I’ve designed a good algorithm for fixing replicas that drift, and they reinforced my belief that the two algorithms are highly effective for different scenarios.