Archive for the ‘OOP’ tag
Ironically, the Stream algorithm I wrote as the simplest possible syncing algorithm does what the much more efficient algorithm I wrote some time ago can’t do: sync a table without a primary key, as long as there are no duplicate rows. In fact, it’s so dumb, it will happily sync any table, even if there are no indexes.
The flash of inspiration I had on Friday has turned out to be good insight. It immediately showed me how I can re-use a lot of the hard work I’ve already done for other tools. Chunking and nibbling are the names I’ve given to two algorithms I’ve developed for processing tables a little at a time. Chunking looks for a suitable index and generates an array of
WHERE clauses that will divide the table into pieces of approximately a given size (number of rows or number of bytes). I use this on mk-table-checksum and mk-parallel-dump. It requires an indexed column I can treat as a number one way or another. That includes temporal values.
Nibbling is related. It’s an efficient way to do something like LIMIT X, Y without scanning through the first X rows. It also requires a suitable index, but the code I wrote to make it work with mk-archiver is really generous about what “suitable” means. It’ll basically work with any index. It selects some rows with LIMIT Y and uses the last-fetched row to start the next select.
Both algorithms will adapt well to finding and resolving differences in rows, a chunk at a time. The general procedure is to create the
WHERE clauses that define boundaries around the chunk of rows, then checksum the whole chunk. The result is a tiny little hash value. If this differs between the source and destination tables, the next step is to checksum the rows individually and fetch their primary or unique key columns. This uses a little more network bandwidth, but it’s still not bad unless the key (or the chunk) is huge. Any rows whose checksums differ can then be fetched by the key and synced.
The more complex algorithms, which were in the original table-sync tool, will come later. They can be potentially much more efficient in terms of network traffic, but they have drawbacks too, such as the granularity of locking. The mk-table-sync tool will soon be able to choose the best algorithm that causes the least locking and just do it without any fuss. For example, if it sees a nice primary key it can use for chunking, and it sees that the table is InnoDB, it’ll just chunk and use
SELECT FOR UPDATE. Voila, no table locks, and not much of the table will be locked at a time (it’ll commit between chunks).
Right now I’ve gotten a simple interface for code that finds differences in rows, a plugin-like interface for implementing each of the algorithms uniformly, an interface for resolving differences, and a few other things. I’m about to embark on the Chunk algorithm for syncing.
I don’t think most people will consider this a big deal, but don’t expect the final product to correctly sync tables without a primary key and with duplicate rows. Comparing tables with duplicates is absolutely meaningless. If you can’t write a
WHERE clause that uniquely identifies a row, you’re done.
My initial plans got waylaid! I didn’t pull out the checksumming code first, because the code wasn’t at all as I remembered it. Instead, I began writing code to handle the more abstract problem of accepting two sets of rows, finding the differences, and doing something with them. I’m ending up with a little more complicated system than I thought I would. However, it’s also significantly simpler in some ways. Instead of just passing references to subroutines to use as callbacks, I’m object-ifying the entire synchronization concept.
What’s the advantage of doing this? Well, as some of you may know, there are two fairly complex algorithms in the tool at present, which handle synchronization in a hierarchical manner, zooming in on the rows that need to be changed. There are a lot of complexities in them. If I wrap all that up into modules, and make them have a uniform interface (real OO interfaces would be delightful here, but Perl doesn’t support them), I can simplify the project significantly by…
…throwing them out the window! That’s right, I’m tossing out the ‘top-down’ and ‘bottom-up’ algorithms. What I want to develop, first and foremost, is the code that does the synchronization, not the really twisted code that does bitwise XORs on groupwise slices of checksums and has recursion and all that stuff. So I decided on a generic data-syncing interface, and wrote the simplest possible implementation of that, which I’m going to use to help me deal with complexity. This algorithm is called ‘stream’ (for lack of a better word). It has no hierarchical drill-down or any other complexities. It amounts to “select * from source, select * from dest, diff and resolve.”
It’s not a very efficient algorithm for comparing and syncing data, at least not by my standards. (It amounts to a
FULL OUTER JOIN implemented in Perl). But boy, does it make it easier to start cleaning up the nasty spaghetti code that handles locking, waiting for a slave to catch up, actually changing the data that turns out to be different, and so on.
Of course, I’ll add back the top-down and bottom-up algorithms later, as well as some others. They should turn out to be pretty simple to implement, since they won’t have, for example, locking code intertwined with them. When done, the tool will examine the table and figure out the best algorithm to use. This will go a good way towards another of my goals, which is that you should be able to just point it at two tables and tell it to sync them, and it should do it in the most efficient way possible, without needing lots of command-line options.