A challenge: partition a character set in MySQL
How good are your SQL and/or general coding skills? I have a specific challenge I’d like your help solving. I am not sure it’s possible, but I’d love to be proven wrong.
I’ll explain some background for the problem first, and then pose the challenge at the end of the article.
The problem
Several of the algorithms I’ve been implementing require data to be partitioned for a divide-and-conquer approach. This is easy enough with numeric and even with temporal data, but character data is more difficult, and I don’t have a good strategy yet.
The problem is how to work on only part of the data at a time, yet not miss any of it. In geeky terms, I want to partition the set, which means to divide it into disjoint subsets that complete a cover over the set.
To give a little more context, I need to use this in several ways. In one application, I want to checksum tables a little at a time. The idea is to reduce the impact on the server, with a sleep between the checksums. Another application, which finds and resolves differences between MySQL tables, needs to partition coarsely at first, then further subdivide partitions that are “interesting.”
At the moment, I think doing this efficiently requires finding the partition endpoints. This is easy with numeric data. It’s very efficient to select the MIN() and MAX() from an indexed column, EXPLAIN a query to see how many rows the query optimizer thinks are in it, and do a little subtraction and division to find the endpoint of each range. These can then go into a BETWEEN clause to implement the partitioning.
I can treat temporal data the same way. Instead of subtracting numbers, I can use DATEDIFF() or similar to find out how large the range is, and then use other date math functions to generate successive endpoints.
Character data isn’t so simple. Character sets and collations seem to make it harder to find endpoints and be assured of a cover over the set. If the characters were restricted to 0-9 and A-F, of course I could just treat them as hexadecimal and do ordinary math. Even generating a cover isn’t all that hard, but making sure they are disjoint strikes me as hard.
One idea is a binary search, or trial and error, to find endpoints. I could use EXPLAIN to ask how many rows are greater than some value, and vary the value to discover the approximate distribution of the data, choosing endpoints along the way. I think this path is fraught with difficulties, though.
A challenge
The above paragraphs should give you an idea what I’ve been considering for this problem. It might make it easier for people to help me if I give a specific example and ask for a solution. Here we go:
- You have a single table in MySQL with a
varchar(50)primary key. - The minimum value in the primary key is ‘aardvark’ and the maximum is ‘БългарÑки.’ I intentionally won’t say any more about the data; your solution must be able to figure out what to do based on limited knowledge, such as using EXPLAIN to access index statistics.
- The charset is utf8 and the collation is case-insensitive.
- There are 18 million rows, more or less, in the table.
- You want to select the data in chunks of approximately a million rows, as explained above.
- The solution must generate about 18 separate queries of around a million rows each, and each query must not examine and discard, or scan through, any rows it does not include in the result. I think you want to do this with successive BETWEEN clauses, but if there is another way that is efficient and correct, that’s good too.
- LIMIT with OFFSET is out of the question, because it actually selects and discards part of its results.
- A single GROUP BY is no good either, because it will run in one query, not 18.
- You do not have to do this in pure SQL. You can bring any scripting or programming language to your aid, to help generate queries or do some math or whatever.
- You may not create temporary tables, triggers, views, or stored procedures. Your solution must result in SQL queries beginning with the word “SELECT” that a user can run without altering anything about the server; the server is read-only for purposes of this challenge. Likewise, you may not alter the table in any way.
Can you think of any solutions?



This might be a little silly…
Select 1601 rows completely randomly and sort them by the primary key.
Use rows 1, 101, 201 … 1501, 1601 (17 rows) from that query, the first row(‘aardvark’) and the last row(‘БългарÑки’) as your range limits. That gives you 19 points or 18 ranges..
This assumes that the 1601 rows you retrieved are representative of the entire set, which is impossible to guarantee. The number of results returned by each range queries could be wildly different than our goal (1million). I’m also not sure it’s possible to efficiently select those 1601 rows.
/me shrugs ;)
This is what I would try first!
Olivier
Olivier
11 Jun 07 at 6:58 pm
I suppose it is too late to do this incrementally, partitioning new records as they are created… :-P
With arbitrarily clustered data, I think binary partitioning is going to be the best method. A full binary search takes 24 iterations on a set of 18 million, so I think that is reasonable.
Do you want pseudocode?
Tim McCormack
11 Jun 07 at 9:48 pm
Hi Olivier,
Yes, efficiently selecting the random rows is a difficulty I think. I wish MySQL exposed its index stats in a queryable form. That would make this easy, no?
Hi Tim,
Yep, too late :-) Binary search seems promising to me too, but what’s an algorithm to go, for example, halfway between ‘aardvark’ and ‘БългарÑки?’ Can you split the middle? What if the character set is, oh, say Shift-JIS Japanese collated sjis_japanese_ci? That part seems hard to me. I wouldn’t need pseudocode for binary search, but if you have code for choosing midpoints, I’d love to see that!
Xaprb
11 Jun 07 at 10:21 pm
The text strings comprise a well-ordered set, yes? And the first few characters of each string are the important ones — short strings can be considered right-padded with NULs. To take the midpoint, walk into the string until you find the first differing character. Split the difference of the Unicode numeric code, and find the next valid character. E.g.: ELEPHANT and ELEGANT, ELE (P-G)/2 -> ELEK. ELEK is the Unicode midpoint.
Tim McCormack
11 Jun 07 at 11:29 pm
Okay, that should be (P G)/2, obviously.
Tim McCormack
11 Jun 07 at 11:31 pm
It is well-ordered, yes. Collations are more complex than this, though. Even ASCII collated latin1_ci is more complex than simply looking at the numeric code and doing math on it. For example, suppose MySQL tells me the minimum value is ‘a’ and the maximum is ‘B’.
select ord('a'), ord('B'); ---------- ---------- | ord('a') | ord('B') | ---------- ---------- | 97 | 66 | ---------- ----------As far as I know, you have to find the midpoint based on the collation, not the numeric code. Other character sets and collations make things even more interesting; for example, some collations consider multiple “letters” to be a single “character.”
When the strings share a common prefix it makes certain things easier, but what is the midpoint between ‘a’ and ‘Б’ — I have a feeling the answer will depend on the collation.
Whatever the answer, it must never disagree with MySQL’s view of the world; to get the database to partition the set, we have to consider the database right, even if it deviates from standards or is otherwise “wrong” according to the external world. That tells me that I should ask the database for the answer somehow. This has been a successful tactic in other problems; for example, using STR_CMP() to ask how two strings should be sorted in the database’s view. But I can’t figure out how to apply it here.
Xaprb
12 Jun 07 at 9:12 am
If you know what the collation is, then you can just apply what has been described above (binary search using a radix method). Are you asking for an API to get the total ordering of the collation (allowing you to compute midpoints), making your solution collation independent?
Alternatively, does the solution have to partition based on your collation order? Can you just apply, e.g. a hash or simple cast to binary of the PK in one pass, build the index afterwards, and then simply partition on that? This would be a one-time hit. In the future you can compute the cast or hash in app logic before the insert.
And if you don’t need exact up-to-date consistency (and it sounds like you don’t), you can do this off-line, as in data warehousing, which is not my area of expertise :)
Have you checked out http://mysql.openmirrors.org/doc/refman/5.1/en/partitioning.html
or
http://forums.mysql.com/list.php?106
Maybe you’re not looking to partition your database for storage, but they might be able to shed some light on partitioning data in general.
Nick Gerner
12 Jun 07 at 1:16 pm
Got it. One would need either A) a deep knowledge of the collation and character set, or B) a way to ask the DB for “midpoints”.
Bleh.
Tim McCormack
12 Jun 07 at 2:30 pm
Tim: yes, that’s what I am afraid of. I was hoping someone can show me a clever way to either do it without needing to know about collations or some other method entirely. Knowing the collation wouldn’t be enough either, if MySQL’s implementation of the collation differed from whatever else you might use. That’s why asking MySQL for the truth (in its view) would be the only thing I’d really feel safe doing.
Xaprb
12 Jun 07 at 2:52 pm
Sounds like you’ve got a good point about truth and different implementations. There is a standard for utf8 collations: http://www.unicode.org/reports/tr10/
Unfortunately it appears MySQL doesn’t support it by default (for valid perf reasons). See http://bugs.mysql.com/bug.php?id=15913
If you don’t need to be too precise (in terms of size equality of partitions) you might still be able to use the standard Unicode collation ordering to do the binary search with EXPLAIN as Tim suggests above.
You’re right that it would be nice if MySQL would expose its index distribution estimates.
Nick Gerner
12 Jun 07 at 3:20 pm
Nick: sorry, I had a response started but got pulled away. It would have to be done on-line. One important application of this is checksumming tables in chunks, to avoid locking a huge table for a long time while checksumming it.
Alas, I was hoping someone would show me something very clever… not that the binary search thing isn’t very clever, but I think this is just a hard problem to solve, and I’d love it if there were some magic wand I could wave at it.
Xaprb
12 Jun 07 at 4:22 pm
Assuming you know the number of rows per partition – N, you can do
something like:
– fetch the first chunk
chunk = SELECT name from table ORDER BY name LIMIT N
– while there are other rows to be fetched
do
last_name = chunk[N-1]
chunk = SELECT name from table WHERE name > last_name
ORDER BY name LIMIT N
done
hth,
C
Anonymous
12 Jun 07 at 7:02 pm
That’s it! Blindingly obvious, yet I couldn’t see it. It will not work on non-unique keys, but it’ll work perfectly on primary keys and unique keys.
Even more amazing, I have been using similar algorithms on MySQL Archiver, which I’ve been working on for weeks. How did I miss this?
Thanks!
Xaprb
12 Jun 07 at 8:40 pm
Ah, blinded by my own love of fancy algorithms. Yes, that looks like a winner.
Tim McCormack
13 Jun 07 at 9:19 am
Hello
I’m glad to see you have liked my solution – yes it works only on
unique indexes (on a non-unique index the where clause name > last_name
may skip some rows), but it works for any data types.
I’ve found your articles very useful and instructive, keep up the good work!
C
Anonymous
13 Jun 07 at 4:18 pm
Sorry if I am asking a stupid question, but doesn’t “SELECT foo FROM table WHERE foo > bar ORDER BY foo LIMIT n” scan through the table, sort the rows and discard part of the results? How is it different from “LIMIT n OFFSET m” in terms of performance?
Newbie
18 Jun 07 at 8:53 pm
No, no, it’s a good question. It is efficient for several reasons:
1) WHERE foo > bar makes MySQL seek into the index and start the scan there. It doesn’t have to scan the rows before that in the index.
2) The rows are already sorted because I’ll use the primary key.
3) LIMIT N makes MySQL stop after it finds enough rows.
By contrast, OFFSET makes MySQL throw away OFFSET number of rows.
That said, I actually realized this won’t work in the particular situation I need it to — because the real query I’m using is grouped implicitly into a single row with MAX(), and LIMIT operates after grouping, not before. This is why I was thinking it needed to be a BETWEEN clause — I was thinking so hard about the problem I forgot the context.
There are other options, but still nothing magical is presenting itself. I’m going to work on some other things, where I have work lined up for myself, and see if something clever comes to me. It has happened in the past.
Xaprb
19 Jun 07 at 9:16 pm
Expanding on Anonymous’ solution, couldn’t you do the following?
- fetch the first chunk
select min(name) as min_name, max(name) as max_name from ( SELECT name from table ORDER BY name LIMIT N) temp
- while there are other rows to be fetched
do
last_name = max_name from chunk[N-1]
select min(name) as min_name, max(name) as max_name from ( SELECT name from table WHERE name > last_name
ORDER BY name LIMIT N) temp
I haven’t tested this on all versions of MySQL but it seems to work on 5.0 where it leverages the index although it does end up using a derived table to temporarily hold the data from the index.
-Dipin
Dipin
21 Jun 07 at 3:03 pm
Thanks for the explanation.
Newbie
21 Jun 07 at 6:21 pm