Tag Archive for 'checksum'

Progress on Maatkit bounty, part 3

This is the last day I’m taking off work to hack on mk-table-sync, and I thought it was time for (yet another) progress report. Here’s what I have done so far:

  • All the code, except for a tiny bit of “glue” and “setup” code, is in modules.
  • Lots more tests for the modules.
  • A new sync algorithm (I still haven’t rewritten the top-down and bottom-up, which are designed for network efficiency more than MySQL efficiency, and are very complicated). This algorithm is called “Chunk” and is based on the chunking module I’m re-using from two of the other tools. This allows syncing the table a bit at a time to avoid locking it so much.
  • The tool chooses its own parameters, including choosing the sync algorithm automatically by examining indexes.
  • Proper exit codes, as well as several other smaller issues requested via bug reports.
  • The tool now syncs entire servers. That is, you don’t have to specify a table. It’ll find all the tables and just sync them.
  • The tool can sync many servers. You give it five servers, it will treat the first as the source, and sync every table in the source to each of the four remaining servers in turn.
  • It can work via replication. It can discover a master’s slaves via SHOW SLAVE HOSTS and sync each slave to the master. You can also point it at a slave and it’ll discover the master, connect to it, and sync the slave to the master.
  • It integrates with mk-table-checksum’s results. If you’ve given the –replicate option to mk-table-checksum, the slave’s results are stored in a table. It can read that table and sync anything marked as different. This can be combined with sync-to-master and auto-discover-slaves functionality.
  • Lots of other bugs and problems are gone simply because I’m using the modules I wrote for other tools. This includes issues with table parsing, identifier quoting, etc etc. As an aside, I have to roll my own for almost everything, because I can’t rely on things like DBI’s quote_identifier() function — it does not work in earlier versions, which are amazingly common in the real world.

Whew! So what isn’t done yet?

  • Bi-directional syncing.
  • The Nibble sync algorithm. It will be preferred over Chunk and can be used in more cases.
  • Documentation.
  • Full support for wide characters. (This is non-trivial in Perl. I need to research it. A partial solution might not be hard, but I’m worried about the versions included in, for example, RHEL 3, which is very widely used.)
  • Updating other tools to work right with the changes to shared code.
  • Locking and transaction code. The tool will ultimately use FOR UPDATE/LOCK IN SHARE MODE automatically on InnoDB tables instead of locking them, for example.

Here’s a sample of what it can do, using a replication sandbox I set up with Giuseppe’s MySQL Sandbox. The sandbox contains a copy of the Sakila sample database. I’ll just mangle a few films on the slaves:

baron@kanga:~$ cd rsandbox_5_0_45/
baron@kanga:~/rsandbox_5_0_45$ ./s1
Welcome to the MySQL monitor.  Commands end with ; or \g.
Your MySQL connection id is 6
Server version: 5.0.45-log MySQL Community Server (GPL)

Type 'help;' or '\h' for help. Type '\c' to clear the buffer.

slave1 [localhost] {msandbox} ((none)) > update sakila.film set title='academy dinosaur2' limit 12;
Query OK, 12 rows affected, 12 warnings (0.07 sec)
Rows matched: 12  Changed: 12  Warnings: 0

slave1 [localhost] {msandbox} ((none)) > Bye
baron@kanga:~/rsandbox_5_0_45$ ./s2
Welcome to the MySQL monitor.  Commands end with ; or \g.
Your MySQL connection id is 6
Server version: 5.0.45-log MySQL Community Server (GPL)

Type 'help;' or '\h' for help. Type '\c' to clear the buffer.

slave2 [localhost] {msandbox} ((none)) > update sakila.film set title='academy dinosaur2' limit 1;
Query OK, 1 row affected, 1 warning (0.05 sec)
Rows matched: 1  Changed: 1  Warnings: 0

slave2 [localhost] {msandbox} ((none)) > Bye

OK, now I’ve messed up the first 12 films on one slave, and the first 1 on another. I could just go ahead and sync them right away, but first I’ll do a table checksum to demonstrate that functionality:

baron@kanga:~/rsandbox_5_0_45$ mk-table-checksum --replicate=test.checksum --port=16045 127.0.0.1 -q

And now I’ll tell the sync tool to go fix the differences the checksum revealed:

baron@kanga:~/rsandbox_5_0_45$ mk-table-sync  --replicate=test.checksum h=127.0.0.1,P=16045 -vx
# Syncing P=16046,h=127.0.0.1
# DELETE INSERT UPDATE ALGORITHM DATABASE.TABLE
#      0      0     12 Chunk     sakila.film
#      0      0      0 Chunk     sakila.film_text
# Syncing P=16047,h=127.0.0.1
# DELETE INSERT UPDATE ALGORITHM DATABASE.TABLE
#      0      0      0 Chunk     sakila.film
#      0      0      0 Chunk     sakila.film_text
baron@kanga:~/rsandbox_5_0_45$ 

Pretty easy, huh? Take a look at the output: the first thing it did was fix the 12 films I changed. sakila.film has a trigger that updates sakila.film_text, so that table got changed too. The checksum tool caught this difference, but the differences were gone by the time the sync tool examined them, again due to the trigger. On the second slave, no differences were found at all, because the changes to the first slave were made on the master, automatically fixing the second slave. (This won’t always be the case, but it worked in this example).

While I’d love to continue building the perfect beast, I’m going to have to call it quits around noon today and start cleaning up, writing the documentation, and getting ready to release the code. I’m not sure how much I’ll finish in that time.

By the way, anyone who wants to is welcome to get the code from the Maatkit SVN repository! I never make a big deal out of that because I generally assume people want to run released code, but SVN is there if you want it…

Technorati Tags:, , , , , ,

You might also like:

  1. How to sync tables in master-master MySQL replication
  2. Maatkit version 1579 released
  3. Progress on Maatkit bounty, part 2
  4. Get Maatkit fast from the command line
  5. MySQL Table Checksum 1.1.0 released

Maatkit version 1297 released

Download Maatkit

Maatkit (formerly MySQL Toolkit) version 1297 contains a significant update to MySQL Table Checksum (which will be renamed soon to avoid trademark violations). The changelog follows. What you don’t see in the changelog is the unit test suite! I got a lot more of the code into modules that are tested and re-usable.

2007-11-18: version 1.1.19 

* Check for needed privileges on --replicate table before beginning. 
* Made some error messages more informative. 
* Fixed child process exit status with 8-bit right-shift. 
* Improved checksumming code auto-detects best algorithm and function. 
* Added --ignoreengine option; ignores federated and merge by default. 
* Added --columns and --checksum options. 
* Removed --chunkcol, --chunksize-exact, --index options. 
* --chunksize can be specified as a data size now. 
* Improved chunking algorithm handles more cases and uses fewer chunks. 
* Do not print --replcheck results for servers that are not slaves. 
* Create only one DB connection for each host, not one per host/tbl/chunk. 
* Code assumed backtick quoting, broke on SQL_MODE=ANSI (bug #1813030). 
* There were many potential bugs with database and table name quoting. 
* Child exit status errors could be masked by subsequent successes.
Technorati Tags:, ,

You might also like:

  1. Maatkit version 1753 released
  2. Maatkit version 1674 released
  3. Maatkit version 1877 released
  4. Maatkit version 1579 released
  5. Maatkit version 1314 released

How to select the first or last row per group in SQL

There is no “first” or “last” aggregate function in SQL. Sometimes you can use MIN() or MAX(), but often that won’t work either. There are a couple of ways to solve this vexing non-relational problem. Read on to find out how.

First, let’s be clear: I am posing a very non-relational problem. This is not about the minimum, maximum, top, most, least or any other relationally valid extreme in the group. It’s the first or last, in whatever order the rows happen to come. And we all know rows aren’t ordered — in theory. But in practice they are, and sometimes you need the first or last row in a group.

If you have a question this article doesn’t answer, you might like to read how to select the first/least/max row per group in SQL and how to find the maximum row per group in SQL without subqueries.

A MySQL user-variable solution

I’ll show a MySQL-specific solution with one of the queries I developed for MySQL Table Checksum.

Here’s the idea: crush an entire table down to a single checksum value by checksumming each row, mushing it together with the previous row’s checksum, and then checksumming the result again. It’s fairly easy to do this, but it’s hard to get the final result in one statement. This is necessary to use the statement in an INSERT .. SELECT, which I needed to do.

An example might clarify:

select * from fruit;
+---------+
| variety |
+---------+
| apple   | 
| orange  | 
| lemon   | 
| pear    | 
+---------+

set @crc := '';

select variety, @crc := md5(concat(@crc, md5(variety))) from fruit;
+---------+-----------------------------------------+
| variety | @crc := md5(concat(@crc, md5(variety))) |
+---------+-----------------------------------------+
| apple   | ae6d32585ecc4d33cb8cd68a047d8434        | 
| orange  | 7ec613c796f44ef5ccb0e24e94323e38        | 
| lemon   | a2475f37be12cebf733ebfc7ee2ee473        | 
| pear    | ec98fe57833bbd91790ebc7ccf84c7e9        | 
+---------+-----------------------------------------+

I want the “last” value of @crc after the statement is done processing. How can I do this? The solution I found is to use a counter variable. I’ll demonstrate:

set @crc := '', @cnt := 0;

select variety,
   @cnt := @cnt + 1 as cnt,
   @crc := md5(concat(@crc, md5(variety))) as crc
from fruit;
+---------+------+----------------------------------+
| variety | cnt  | crc                              |
+---------+------+----------------------------------+
| apple   |    1 | ae6d32585ecc4d33cb8cd68a047d8434 | 
| orange  |    2 | 7ec613c796f44ef5ccb0e24e94323e38 | 
| lemon   |    3 | a2475f37be12cebf733ebfc7ee2ee473 | 
| pear    |    4 | ec98fe57833bbd91790ebc7ccf84c7e9 | 
+---------+------+----------------------------------+

The counter variable might make you want to write something like HAVING cnt = MAX(cnt), but that won’t work (try it!). Instead, I prefixed the checksum with the count so the last row is the stringwise maximum:

select variety,
   @crc := concat(lpad(@cnt := @cnt + 1, 10, '0'),
      md5(concat(right(@crc, 32), md5(variety)))) as crc
from fruit;
+---------+--------------------------------------------+
| variety | crc                                        |
+---------+--------------------------------------------+
| apple   | 0000000001ae6d32585ecc4d33cb8cd68a047d8434 | 
| orange  | 00000000027ec613c796f44ef5ccb0e24e94323e38 | 
| lemon   | 0000000003a2475f37be12cebf733ebfc7ee2ee473 | 
| pear    | 0000000004ec98fe57833bbd91790ebc7ccf84c7e9 | 
+---------+--------------------------------------------+

You can see I also left-padded the count so a lexical sort will agree with a numeric sort, and so I can predict how many extra characters I’ll need to remove to get back the original value. Now I can use the MAX() function to select the last row, and simply lop off the leftmost ten digits (I use the RIGHT() function for convenience, but generally you want to use SUBSTRING()):

select right(max(
   @crc := concat(lpad(@cnt := @cnt + 1, 10, '0'),
      md5(concat(right(@crc, 32), md5(variety))))
   ), 32) as crc
from fruit;
+----------------------------------+
| crc                              |
+----------------------------------+
| ec98fe57833bbd91790ebc7ccf84c7e9 | 
+----------------------------------+

Et voila, I got the last value in the group. By the way, this will work with ONLY_FULL_GROUP_BY in the server’s SQL mode.

Other methods

My solution relies on a MySQL user variable to do the counting, but there are many ways to number rows in SQL: you could simulate the ROW_NUMBER() function, for instance, or use techniques mentioned in the comments on how to number rows in MySQL (one of the comments shows a particularly clever solution with subqueries, but I didn’t want to use it because MySQL doesn’t support subqueries in older versions). Any of these should work one way or another. Of course, if you are using a product such as Microsoft SQL server 2005, which actually has the ROW_NUMBER() function, you can use that!

Conclusion

Finding the first or last row is a bit unintuitive, and it’s definitely non-relational, but sometimes it’s what you need. The technique I demonstrated in this article is easily adaptable to many kinds of queries. I hope it helped you!

If this article didn’t solve your problem, please read these before posting questions to the comments: how to select the first/least/max row per group in SQL and how to find the maximum row per group in SQL without subqueries.

Technorati Tags:, , , , ,

You might also like:

  1. How to simulate the SQL ROW_NUMBER function
  2. How to number rows in MySQL
  3. How to simulate the GROUP_CONCAT function
  4. How to select the first/least/max row per group in SQL
  5. Advanced MySQL user variable techniques

MySQL Toolkit version 675 released

Download MySQL Toolkit

I’ve just released changes to two of the tools in MySQL Toolkit. MySQL Table Checksum got some convenient functionality to help you recursively check slaves for bad replicated checksum chunks. MySQL Archiver got statistics-gathering functionality to help you optimize your archiving and purging jobs, plus a few important bug fixes.

Changes in MySQL Archiver:

  • Made –time suffix optional.
  • Added –statistics option to gather and print timing statistics.
  • Added signal handling so mysql-archiver exits cleanly when it can.
  • Changed exit status to 0 when –help is given.
  • Out-of-column-order primary keys were not ascended correctly.

Changes in MySQL Table Checksum:

  • Added –replcheck option to check –replicate results on slaves.
  • Added –recursecheck option to do –replcheck recursively.
Technorati Tags:, , , , , ,

You might also like:

  1. MySQL Toolkit version 848 released
  2. MySQL Toolkit released as one package
  3. MySQL Toolkit version 815 released
  4. MySQL Toolkit distribution 620 released
  5. Maatkit version 1314 released

MySQL Toolkit distribution 620 released

Download MySQL Toolkit

MySQL Toolkit distribution 620 updates documentation and test suites, includes some major bug fixes and functionality changes, and adds one new tool to the toolkit. This article is mostly a changelog, with some added notes.

Many of the tools have matured and I just needed to make the documentation top-notch, but there’s still a lot to be done on the crucial checksumming and syncing tools. Time is in short supply for me right now, though. In fact, I actually finished this release on June 22, but wasn’t able to release it till just tonight!

Documentation is now maintained online at the MySQL Toolkit website, by the way.

mysql-archiver

Version 0.9.3

Changes

  • Added more hooks for plugins before and after archiving.
  • Documentation.
  • Made –time suffix optional.

Bugs fixed

  • MySQL Archiver could crash on a lock wait timeout when –txnsize was not set

mysql-deadlock-logger

Version 1.0.2

Incompatible changes

  • Changed the format of the –source and –dest options.

Changes

  • Documentation.

mysql-duplicate-key-checker

Version 1.0.4

  • Documentation.

mysql-find

Version 0.9.3

  • Documentation.

mysql-query-profiler and mysql-profile-compact

Version 1.1.2

  • Documentation

mysql-show-grants

Version 1.0.2

  • Documentation.

mysql-slave-restart

Version 0.9.2. This is an initial release of a new tool. I found myself in a situation where I needed to do some complex error-skipping on a slave (its relay logs got into an infinite loop). I have written throwaway scripts to skip, restart, check, skip several times in the past, but this situation called for something more complex. Again I realized I was three-quarters of the way to a more flexible, powerful tool many people might find useful, so I went ahead and put the extra effort into it.

It ended up helping me avoid re-snapshotting a slave with a ton of data, so it was worth it.

mysql-table-checksum and mysql-checksum-filter

This version fixes some badly optimized chunking queries. As I have mentioned in the past, the chunking behavior is preliminary and subject to change. This is still true, but this release is much smarter than the previous release! I have also fleshed out some methods of doing chunking on real-valued columns (float, decimal, and even character). I don’t know when I’ll get a chance to code, test, and release that.

Even though much remains to be done, MySQL Table Checksum is still a great way to check that your slaves have the same data as the master. (In fact, it’s the only way I know of — and MySQL employees themselves recommend MySQL Table Checksum).

Version 1.1.8

Changes

  • Documentation.
  • Support complex host definitions.
  • Added –explainhosts option to debug host definitions.
  • Added –explain option.
  • When exact chunking is impossible, mysql-table-checksum will use approximate.

Incompatible changes

  • Added required ‘boundaries’ column to checksum table for –replicate.

Bugs fixed

  • Chunking on temporal types defeated indexes.

mysql-table-sync

Version 0.9.5

  • Documentation.
Technorati Tags:, , , ,

You might also like:

  1. MySQL Table Checksum 1.1.6 released
  2. MySQL Table Checksum 1.1.9 released
  3. MySQL Toolkit version 896 released
  4. Maatkit version 1877 released
  5. MySQL Toolkit version 815 released

A challenge: partition a character set in MySQL

How good are your SQL and/or general coding skills? I have a specific challenge I’d like your help solving. I am not sure it’s possible, but I’d love to be proven wrong.

I’ll explain some background for the problem first, and then pose the challenge at the end of the article.

The problem

Several of the algorithms I’ve been implementing require data to be partitioned for a divide-and-conquer approach. This is easy enough with numeric and even with temporal data, but character data is more difficult, and I don’t have a good strategy yet.

The problem is how to work on only part of the data at a time, yet not miss any of it. In geeky terms, I want to partition the set, which means to divide it into disjoint subsets that complete a cover over the set.

To give a little more context, I need to use this in several ways. In one application, I want to checksum tables a little at a time. The idea is to reduce the impact on the server, with a sleep between the checksums. Another application, which finds and resolves differences between MySQL tables, needs to partition coarsely at first, then further subdivide partitions that are “interesting.”

At the moment, I think doing this efficiently requires finding the partition endpoints. This is easy with numeric data. It’s very efficient to select the MIN() and MAX() from an indexed column, EXPLAIN a query to see how many rows the query optimizer thinks are in it, and do a little subtraction and division to find the endpoint of each range. These can then go into a BETWEEN clause to implement the partitioning.

I can treat temporal data the same way. Instead of subtracting numbers, I can use DATEDIFF() or similar to find out how large the range is, and then use other date math functions to generate successive endpoints.

Character data isn’t so simple. Character sets and collations seem to make it harder to find endpoints and be assured of a cover over the set. If the characters were restricted to 0-9 and A-F, of course I could just treat them as hexadecimal and do ordinary math. Even generating a cover isn’t all that hard, but making sure they are disjoint strikes me as hard.

One idea is a binary search, or trial and error, to find endpoints. I could use EXPLAIN to ask how many rows are greater than some value, and vary the value to discover the approximate distribution of the data, choosing endpoints along the way. I think this path is fraught with difficulties, though.

A challenge

The above paragraphs should give you an idea what I’ve been considering for this problem. It might make it easier for people to help me if I give a specific example and ask for a solution. Here we go:

  • You have a single table in MySQL with a varchar(50) primary key.
  • The minimum value in the primary key is ‘aardvark’ and the maximum is ‘Български.’ I intentionally won’t say any more about the data; your solution must be able to figure out what to do based on limited knowledge, such as using EXPLAIN to access index statistics.
  • The charset is utf8 and the collation is case-insensitive.
  • There are 18 million rows, more or less, in the table.
  • You want to select the data in chunks of approximately a million rows, as explained above.
  • The solution must generate about 18 separate queries of around a million rows each, and each query must not examine and discard, or scan through, any rows it does not include in the result. I think you want to do this with successive BETWEEN clauses, but if there is another way that is efficient and correct, that’s good too.
    • LIMIT with OFFSET is out of the question, because it actually selects and discards part of its results.
    • A single GROUP BY is no good either, because it will run in one query, not 18.
  • You do not have to do this in pure SQL. You can bring any scripting or programming language to your aid, to help generate queries or do some math or whatever.
  • You may not create temporary tables, triggers, views, or stored procedures. Your solution must result in SQL queries beginning with the word “SELECT” that a user can run without altering anything about the server; the server is read-only for purposes of this challenge. Likewise, you may not alter the table in any way.

Can you think of any solutions?

Technorati Tags:, , , , , , , , ,

You might also like:

  1. How to simulate the SQL ROW_NUMBER function
  2. MySQL challenge: LIMIT rows accessed, not rows returned