Xaprb

Stay curious!

How to calculate table checksums in MySQL

with 16 comments

MySQL has no built-in functionality to calculate a table’s checksum in any storage engine but MyISAM (this is not true; I read the manual wrong, but it doesn’t eliminate the usefulness of the technique in this article). Table checksums can confirm that two tables are identical — useful to verify a slave server is in sync with its master (see my article on reliable MySQL replication for more). Fortunately, it’s easy to calculate a checksum on non-MyISAM tables with user variables. This technique works on any storage engine with any version of MySQL, doesn’t require the BLACKHOLE storage engine, and avoids locks caused by INSERT... SELECT on InnoDB tables.

Update I’ve coded this method into a Perl script for you to use. See MySQL Table Checksum for more details.

MyISAM checksums

If you want to verify that two MyISAM tables are identical, you can use CHECKSUM TABLE. If you need to do this frequently, you might want to make sure your MyISAM tables have the CHECKSUM = 1 option on so checksums are maintained continuously and don’t have to be calculated.

That was easy! Now on to the hard part: how to compare tables when one or both uses a storage engine other than MyISAM. That’s what the rest of this article is about.

Foundation

I derived this technique from Giuseppe Maxia’s article on how to use the BLACKHOLE storage engine to avoid cursors. Giuseppe uses a two-step process:

  1. Concatenate the values of all columns in the first row, and get a checksum of it with a cryptographic hash function.
  2. Concatenate this checksum with a global checksum, and take the checksum of the result, then move to the next row and repeat.

After processing each row in the table this way, you have a checksum of the entire table’s contents.

Two ways to iterate over rows

Giuseppe showed two ways to iterate over every row in the table: one with cursors, and another with user variables. Cursors in MySQL can’t be used with dynamically prepared statements, so they are much less practical for a generic method of checksumming any table; that would require creating a different routine for each table. So cursors are out, and user-variables are in. Here’s the essential code:

set @crc := '', @cnt := 0;
select @crc := sha1(concat(@crc,
    sha1(concat_ws('#', col1, col2... colN)))),
    @cnt := @cnt + 1 from tbl order by ID;
select @crc, @cnt;

(This code both CRCs the table and counts the number of rows in it).

The problem with this method, as Giuseppe points out, is the result set. If you run this on a table with a million rows, you’re going to get a million rows of CRCs and counts back from the first SELECT. This is not good; you really want the server to throw the intermediate values away so they don’t consume resources needlessly.

Giuseppe’s solution to this is to change the SELECT to an INSERT... SELECT. The INSERT goes into a BLACKHOLE table, which is like dumping it into /dev/null. It is simply discarded. This avoids some overhead, and Giuseppe found that it’s faster than using a cursor.

Limitations and performance concerns

There are still some undesirable effects of this approach. One is the BLACKHOLE storage engine. It’s not included in MySQL’s official binaries, so you’ll have to compile your own server to use it. It’s also not available before version 4.1.11. Those limitations rule this method out for many users.

The second is InnoDB. If you use InnoDB (odds are you do, or you wouldn’t be reading this) the INSERT... SELECT will place shared locks on every row in the table. If you’re doing this on an otherwise quiet server, it may not matter, but if you’re taking live checksums on a production machine, it’s much better to avoid those locks.

I’ll show you how to avoid both problems.

Wrap the user-variables up and discard the results

If you’ve read my articles about advanced user-variable techniques in MySQL, you might see where I’m headed. You don’t have to generate a large result set from that SELECT statement. In fact, you can hide the variable assignments, which you care about, inside a function whose result you don’t care about, and use an aggregate function to eliminate all but one row. Study the following code, which uses the same fruits table as in the user-variables article:

set @crc := '';

select min(
      length(@crc := sha1(concat(
         @crc,
         sha1(concat_ws('#', type, variety, price)))))
   ) as discard
from fruits use index(PRIMARY);
+---------+
| discard |
+---------+
|      40 |
+---------+

select @crc;
+------------------------------------------+
| @crc                                     |
+------------------------------------------+
| 3be9117fff37bcdd3f422e6ce4d24ee2a6642566 |
+------------------------------------------+

Notice a couple of things: there’s only one row, and the MIN() calculation that collapses all those rows into one is very efficient. Maybe a MySQL developer can comment on exactly how much memory this will take, but I think it should be really cheap, since it processes a row at a time and then throws it away.

I omitted the row count calculation for clarity. If you want to count rows as well, the following code will do both in one statement:

set @crc := '', @cnt := 0;

select min(least(
      length(@crc := sha1(concat(
         @crc,
         sha1(concat_ws('#', type, variety, price))))),
      @cnt := @cnt + 1
   )) as discard
from fruits use index(PRIMARY);

select @crc, @cnt;

You should always reset @crc and @cnt between runs so you get repeatable results.

Further considerations

It’s important to order the SELECT by something predictable, or the results will be non-deterministic. However, an ORDER BY clause won’t do it — that orders the final result, not the table scan. Forcing a certain index to be used will do the trick, hence the USE INDEX clause above. If you don’t have a primary key on your table, use a UNIQUE key if that’s available; otherwise, you’re probably not going to be able to get a dependable checksum.

This method is easy to use inside a stored procedure or routine on MySQL 5.0. You can easily build a column list from INFORMATION_SCHEMA and generate the dynamic SQL to checksum a table.

This technique is CPU-bound on our servers. I used the BENCHMARK() function to time different hash functions to try to improve the speed. In my tests, SHA1() took about 85% as long as MD5(). This surprised me; based on some cryptographic function benchmarks I found on the web, I thought MD5() would be faster. It may be system-dependent.

I’m also a little worried about using CONCAT_WS() to stringify each row for the hash function. This function skips NULL values, so it’s easy to come up with an edge case where two rows with different columns hash to the same thing. Although this is a very unlikely problem, I’d still rather not have it. If you know of a way to fix this, please let me know.

Finally, just a comment on doing this on running servers: if you’re comparing a master and slave, you can use LOCK TABLES on the master, find the checksum there, and then find the checksum on the slave before releasing the lock on the master. If your slave isn’t far behind the master, it ought to have plenty of time to catch up while the checksum is running on the master, at which point that table should be identical to the master (because you have the table locked on the master, no changes to the table will be replicating to the slave). This makes it practical to verify a slave is in sync without stopping the whole server.

Perl snippet to generate a SQL statement

For the Perl programmers out there, the following subroutine accepts a database handle, database name and table name, and returns the table’s storage engine and a ready-to-run query.

sub checksum_query {
   my ( $dbh, $db, $tbl ) = @_;
   my $ddl = ($dbh->selectrow_array("show create table $db.$tbl"))[1];
   my ( $type ) = $ddl =~ m/^\) (?:ENGINE|TYPE)=(\S+)/m;
   my $cols  = join(', ', $ddl =~ m/^\s+(`[^`]+`)/gm);
   my $index = $ddl =~ m/PRIMARY KEY/ ? ' USE INDEX(PRIMARY)' : '';
   my $query = 'SELECT MIN(LEAST(0, LENGTH(@crc := SHA1('
      . 'CONCAT_WS("#", @crc, SHA1(CONCAT_WS("#", ' . $cols . '))))),'
      . '@cnt := @cnt + 1)) AS len FROM ' . "`$db`.`$tbl` $index";
   return ( $type, $query );
}

I prefer to use SHOW CREATE TABLE instead of DESCRIBE TABLE, because it gives me all information about the table, such as the storage engine and index types. I have also found it to be faster, and of course it works on pre-5.0 versions.

Summary

In this article I showed you how to calculate a checksum for an entire table’s contents in MySQL. I used my favorite technique of hiding user-variable assignments inside a function to reduce a table scan’s result set to a single integer. This avoids a lot of overhead, and eliminates the need to insert the result set into a BLACKHOLE table, which means you can use the technique on a standard MySQL-supplied server. It also avoids InnoDB row locks. And it works on all storage engines, with all versions of MySQL that support user variables (3.23.6 and up).

I also touched a bit on the finer points of NULL values and ORDER BY for consistent results. And I gave you some ready-to-use Perl code to generate the SQL you need to execute against a given table.

This is the fastest, easiest, most efficient way I know to compare the contents of two or more tables, which is necessary to verify that a replicated master and slave server are still in sync.

Written by Xaprb

January 25th, 2007 at 4:01 pm

Posted in Uncategorized

Tagged with , ,

16 Responses to 'How to calculate table checksums in MySQL'

Subscribe to comments with RSS

  1. Hi,

    Nice approach.

    For completeness sake, please see this article which used quite a similar method, back when stored procedures were not available.

    Cheers

    Giuseppe

    Giuseppe Maxia

    25 Jan 07 at 5:29 pm

  2. That is an interesting twist — to get a checksum of each row, then convert it from base sixteen to four base ten numbers, then SUM() those and checksum the result. Thanks for providing the link. That is a useful technique.

    Xaprb

    25 Jan 07 at 6:10 pm

  3. Hello,

    thanks for your article, it is very interesting.

    I have a suggestion on how to check for NULL values as well. How about including an artifically created value in the sha1 checksum, which is basically a bitfield signaling where the row contains a NULL and a non-NULL value. This should be the same for matching rows, so the checksum should still be unchanged.

    sha1(concat_ws([columns of table], concat(ISNULL(column1), ISNULL(column2),...])).

    There are probably other ways to use the ISNULL() function for the checksumming, but I think this is a fairly simple method.

    It would be easy to add this to your dynamic perl script, and I believe (I haven’t tested) it would not be a drag on the performance either (after all it’s just a few comparisons).

    Cheers,
    Adam

    Adam Burg

    12 Feb 07 at 8:44 pm

  4. Adam: brilliant! It took me a minute to understand what you mean, but I got it!

    Xaprb

    13 Feb 07 at 8:24 am

  5. So efficient it is! Thanks for your blog, i’ve checksumed some million-level tables in several minutes with user-variable based tech.

    For online product DB, the checksum calculation running during several minutes maybe a big pressure to the server. So I add “WHERE ID BETWEEN XX AND YY”, in which “XX,YY are pre-calculated PRIMARY ID range, into the SQL and separate those ranges with sleeping several seconds in order to relax the server a little. It’s like:

    calculate checksum, ID from 1 to 50000
    sleep(3)
    calculate checksum, ID from 50001 to 100000
    sleep(3)
    ….

    Another suggestion is about how to calculating the checksum to a table that do not have an auto-increase PRIMARY(index) ID but an UNIQUE KEY. I’m not sure if it can work okay without “USE INDEX()” on both master and slave(the order maybe different?), maybe we can use “USE INDEX(CERTAIN_UNIQUE_KEY_NAME)” to force the check order.

    I also want to solve a problem which combine these two issue above, to calculate the checksum for a table has only an UNIQUE-KEY (e.g: ID1-ID2, a two-columns unique key) with separated ranges one by one, can you help me?

    qinyf

    23 Apr 07 at 1:35 am

  6. Hi qinyf,

    I have limited time right now since I’m at the MySQL conference, but these sound like issues I would like to address on the mysqltoolkit mailing list or bug tracking system, because I think these would be great features for mysql-table-checksum. Would you please add feature requests there? When I get back from the conference I’ll try to address it. I think the issue of which index to use is solved in that tool (but let me know if it doesn’t meet your needs), but maybe the tool can figure out a smart way of dividing the checksum into ranges. It can already do a WHERE clause but that doesn’t sound like enough for you.

    Xaprb

    23 Apr 07 at 10:12 am

  7. hi Xaprb,

    Because I’m in vacation these days, very sorry for my late reply.

    It can be solved by WHERE if the PRIMARY KEY is auto-increase(it works well in my script), dividing the checksum into ranges. However, if the key isn’t numeric or sortable, for example, an PRIMARY or UNIQUE KEY (ID1, ID2), a counter maybe needed to get checksum ranges.

    qinyf

    10 May 07 at 10:35 pm

  8. Hi qinyf, I have added this feature to MySQL Table Checksum. I have not released it, but you can get the code from Subversion.

    Xaprb

    15 May 07 at 11:16 am

  9. Hi,

    I think I have a very similar topic.
    Have you ever tryed to sign a full table in order to certify it has not been changed ?

    I work on Mysql 5.0. If you have suggestion to do so…

    Thx in advance.

    Jok

    Jok

    14 Jun 07 at 12:59 pm

  10. Yes, and I have now developed this technique into a very powerful tool to do so. Look on Sourceforge for MySQL Table Checksum.

    Xaprb

    14 Jun 07 at 4:29 pm

  11. Hi Xaprb,

    Thank you for your quick answer.
    Could you be more specific about the tool (mysql-table-checksum) I should use in my case and the way to manipulate it ?

    Jok

    15 Jun 07 at 4:13 am

  12. The best thing to do is read the embedded Perl documentation, which becomes a man page if you install the tool, so ‘man mysql-table-checksum’. If there is anything missing from the documentation please submit a bug report, as I want to make it complete and useful.

    Xaprb

    15 Jun 07 at 9:39 am

  13. I have a useful trick to add. You can INSERT the checksum into a table so that the checksum statement gets replicated. The trick is to then additionally OUTFILE and INFILE the checksum. This way the masters checksum gets replicated to the slaves so it can be easily compared to the local value.

    Joachim Zobel

    8 May 08 at 7:41 am

  14. Useful indeed. See Maatkit — its mk-table-checksum tool does this :)

    Xaprb

    8 May 08 at 7:57 am

  15. hello
    “I have a confusion regarding innodb calculating checksum, like myisam it calculates the checksum on its on and result can be seen in information_schema table. But innodb don’t . does it calculate checksum on its own also and save it some where and it calculate checksum on table wise and show the result or row wise and shows a seperate result. i need to have an urgent reply..
    thank u “

    adeela

    adeela

    18 Nov 09 at 4:32 am

  16. I’m a little late to the party, but using 5.1+ statement based replication, the following should be safe to work around the identical nulls issue (if you didn’t already work around it in pt-table-checksum):

    select concat_ws(‘#’, ifnull(a, rand()*now()), ifnull(…

    Justin Swanhart

    16 Dec 12 at 6:13 pm

Leave a Reply