A challenge: partition a character set in MySQL

How good are your SQL and/or general coding skills? I have a specific challenge I’d like your help solving. I am not sure it’s possible, but I’d love to be proven wrong.

I’ll explain some background for the problem first, and then pose the challenge at the end of the article.

The problem

Several of the algorithms I’ve been implementing require data to be partitioned for a divide-and-conquer approach. This is easy enough with numeric and even with temporal data, but character data is more difficult, and I don’t have a good strategy yet.

The problem is how to work on only part of the data at a time, yet not miss any of it. In geeky terms, I want to partition the set, which means to divide it into disjoint subsets that complete a cover over the set.

To give a little more context, I need to use this in several ways. In one application, I want to checksum tables a little at a time. The idea is to reduce the impact on the server, with a sleep between the checksums. Another application, which finds and resolves differences between MySQL tables, needs to partition coarsely at first, then further subdivide partitions that are “interesting.”

At the moment, I think doing this efficiently requires finding the partition endpoints. This is easy with numeric data. It’s very efficient to select the MIN() and MAX() from an indexed column, EXPLAIN a query to see how many rows the query optimizer thinks are in it, and do a little subtraction and division to find the endpoint of each range. These can then go into a BETWEEN clause to implement the partitioning.

I can treat temporal data the same way. Instead of subtracting numbers, I can use DATEDIFF() or similar to find out how large the range is, and then use other date math functions to generate successive endpoints.

Character data isn’t so simple. Character sets and collations seem to make it harder to find endpoints and be assured of a cover over the set. If the characters were restricted to 0-9 and A-F, of course I could just treat them as hexadecimal and do ordinary math. Even generating a cover isn’t all that hard, but making sure they are disjoint strikes me as hard.

One idea is a binary search, or trial and error, to find endpoints. I could use EXPLAIN to ask how many rows are greater than some value, and vary the value to discover the approximate distribution of the data, choosing endpoints along the way. I think this path is fraught with difficulties, though.

A challenge

The above paragraphs should give you an idea what I’ve been considering for this problem. It might make it easier for people to help me if I give a specific example and ask for a solution. Here we go:

Can you think of any solutions?


Comments