Xaprb

Stay curious!

Archive for the ‘checksum’ tag

How to select the first or last row per group in SQL

with 6 comments

There is no “first” or “last” aggregate function in SQL. Sometimes you can use MIN() or MAX(), but often that won’t work either. There are a couple of ways to solve this vexing non-relational problem. Read on to find out how.

First, let’s be clear: I am posing a very non-relational problem. This is not about the minimum, maximum, top, most, least or any other relationally valid extreme in the group. It’s the first or last, in whatever order the rows happen to come. And we all know rows aren’t ordered — in theory. But in practice they are, and sometimes you need the first or last row in a group.

If you have a question this article doesn’t answer, you might like to read how to select the first/least/max row per group in SQL and how to find the maximum row per group in SQL without subqueries.

A MySQL user-variable solution

I’ll show a MySQL-specific solution with one of the queries I developed for MySQL Table Checksum.

Here’s the idea: crush an entire table down to a single checksum value by checksumming each row, mushing it together with the previous row’s checksum, and then checksumming the result again. It’s fairly easy to do this, but it’s hard to get the final result in one statement. This is necessary to use the statement in an INSERT .. SELECT, which I needed to do.

An example might clarify:

select * from fruit;
+---------+
| variety |
+---------+
| apple   | 
| orange  | 
| lemon   | 
| pear    | 
+---------+

set @crc := '';

select variety, @crc := md5(concat(@crc, md5(variety))) from fruit;
+---------+-----------------------------------------+
| variety | @crc := md5(concat(@crc, md5(variety))) |
+---------+-----------------------------------------+
| apple   | ae6d32585ecc4d33cb8cd68a047d8434        | 
| orange  | 7ec613c796f44ef5ccb0e24e94323e38        | 
| lemon   | a2475f37be12cebf733ebfc7ee2ee473        | 
| pear    | ec98fe57833bbd91790ebc7ccf84c7e9        | 
+---------+-----------------------------------------+

I want the “last” value of @crc after the statement is done processing. How can I do this? The solution I found is to use a counter variable. I’ll demonstrate:

set @crc := '', @cnt := 0;

select variety,
   @cnt := @cnt + 1 as cnt,
   @crc := md5(concat(@crc, md5(variety))) as crc
from fruit;
+---------+------+----------------------------------+
| variety | cnt  | crc                              |
+---------+------+----------------------------------+
| apple   |    1 | ae6d32585ecc4d33cb8cd68a047d8434 | 
| orange  |    2 | 7ec613c796f44ef5ccb0e24e94323e38 | 
| lemon   |    3 | a2475f37be12cebf733ebfc7ee2ee473 | 
| pear    |    4 | ec98fe57833bbd91790ebc7ccf84c7e9 | 
+---------+------+----------------------------------+

The counter variable might make you want to write something like HAVING cnt = MAX(cnt), but that won’t work (try it!). Instead, I prefixed the checksum with the count so the last row is the stringwise maximum:

select variety,
   @crc := concat(lpad(@cnt := @cnt + 1, 10, '0'),
      md5(concat(right(@crc, 32), md5(variety)))) as crc
from fruit;
+---------+--------------------------------------------+
| variety | crc                                        |
+---------+--------------------------------------------+
| apple   | 0000000001ae6d32585ecc4d33cb8cd68a047d8434 | 
| orange  | 00000000027ec613c796f44ef5ccb0e24e94323e38 | 
| lemon   | 0000000003a2475f37be12cebf733ebfc7ee2ee473 | 
| pear    | 0000000004ec98fe57833bbd91790ebc7ccf84c7e9 | 
+---------+--------------------------------------------+

You can see I also left-padded the count so a lexical sort will agree with a numeric sort, and so I can predict how many extra characters I’ll need to remove to get back the original value. Now I can use the MAX() function to select the last row, and simply lop off the leftmost ten digits (I use the RIGHT() function for convenience, but generally you want to use SUBSTRING()):

select right(max(
   @crc := concat(lpad(@cnt := @cnt + 1, 10, '0'),
      md5(concat(right(@crc, 32), md5(variety))))
   ), 32) as crc
from fruit;
+----------------------------------+
| crc                              |
+----------------------------------+
| ec98fe57833bbd91790ebc7ccf84c7e9 | 
+----------------------------------+

Et voila, I got the last value in the group. By the way, this will work with ONLY_FULL_GROUP_BY in the server’s SQL mode.

Other methods

My solution relies on a MySQL user variable to do the counting, but there are many ways to number rows in SQL: you could simulate the ROW_NUMBER() function, for instance, or use techniques mentioned in the comments on how to number rows in MySQL (one of the comments shows a particularly clever solution with subqueries, but I didn’t want to use it because MySQL doesn’t support subqueries in older versions). Any of these should work one way or another. Of course, if you are using a product such as Microsoft SQL server 2005, which actually has the ROW_NUMBER() function, you can use that!

Conclusion

Finding the first or last row is a bit unintuitive, and it’s definitely non-relational, but sometimes it’s what you need. The technique I demonstrated in this article is easily adaptable to many kinds of queries. I hope it helped you!

If this article didn’t solve your problem, please read these before posting questions to the comments: how to select the first/least/max row per group in SQL and how to find the maximum row per group in SQL without subqueries.

Written by Xaprb

August 21st, 2007 at 4:13 pm

Posted in Uncategorized

Tagged with , , , , , ,

MySQL Toolkit version 675 released

with one comment

Download MySQL Toolkit

I’ve just released changes to two of the tools in MySQL Toolkit. MySQL Table Checksum got some convenient functionality to help you recursively check slaves for bad replicated checksum chunks. MySQL Archiver got statistics-gathering functionality to help you optimize your archiving and purging jobs, plus a few important bug fixes.

Changes in MySQL Archiver:

  • Made –time suffix optional.
  • Added –statistics option to gather and print timing statistics.
  • Added signal handling so mysql-archiver exits cleanly when it can.
  • Changed exit status to 0 when –help is given.
  • Out-of-column-order primary keys were not ascended correctly.

Changes in MySQL Table Checksum:

  • Added –replcheck option to check –replicate results on slaves.
  • Added –recursecheck option to do –replcheck recursively.

Written by Xaprb

July 20th, 2007 at 8:32 am

Posted in Uncategorized

Tagged with , , , , , ,

MySQL Toolkit distribution 620 released

without comments

Download MySQL Toolkit

MySQL Toolkit distribution 620 updates documentation and test suites, includes some major bug fixes and functionality changes, and adds one new tool to the toolkit. This article is mostly a changelog, with some added notes.

Many of the tools have matured and I just needed to make the documentation top-notch, but there’s still a lot to be done on the crucial checksumming and syncing tools. Time is in short supply for me right now, though. In fact, I actually finished this release on June 22, but wasn’t able to release it till just tonight!

Documentation is now maintained online at the MySQL Toolkit website, by the way.

mysql-archiver

Version 0.9.3

Changes

  • Added more hooks for plugins before and after archiving.
  • Documentation.
  • Made –time suffix optional.

Bugs fixed

  • MySQL Archiver could crash on a lock wait timeout when –txnsize was not set

mysql-deadlock-logger

Version 1.0.2

Incompatible changes

  • Changed the format of the –source and –dest options.

Changes

  • Documentation.

mysql-duplicate-key-checker

Version 1.0.4

  • Documentation.

mysql-find

Version 0.9.3

  • Documentation.

mysql-query-profiler and mysql-profile-compact

Version 1.1.2

  • Documentation

mysql-show-grants

Version 1.0.2

  • Documentation.

mysql-slave-restart

Version 0.9.2. This is an initial release of a new tool. I found myself in a situation where I needed to do some complex error-skipping on a slave (its relay logs got into an infinite loop). I have written throwaway scripts to skip, restart, check, skip several times in the past, but this situation called for something more complex. Again I realized I was three-quarters of the way to a more flexible, powerful tool many people might find useful, so I went ahead and put the extra effort into it.

It ended up helping me avoid re-snapshotting a slave with a ton of data, so it was worth it.

mysql-table-checksum and mysql-checksum-filter

This version fixes some badly optimized chunking queries. As I have mentioned in the past, the chunking behavior is preliminary and subject to change. This is still true, but this release is much smarter than the previous release! I have also fleshed out some methods of doing chunking on real-valued columns (float, decimal, and even character). I don’t know when I’ll get a chance to code, test, and release that.

Even though much remains to be done, MySQL Table Checksum is still a great way to check that your slaves have the same data as the master. (In fact, it’s the only way I know of — and MySQL employees themselves recommend MySQL Table Checksum).

Version 1.1.8

Changes

  • Documentation.
  • Support complex host definitions.
  • Added –explainhosts option to debug host definitions.
  • Added –explain option.
  • When exact chunking is impossible, mysql-table-checksum will use approximate.

Incompatible changes

  • Added required ‘boundaries’ column to checksum table for –replicate.

Bugs fixed

  • Chunking on temporal types defeated indexes.

mysql-table-sync

Version 0.9.5

  • Documentation.

Written by Xaprb

July 5th, 2007 at 10:41 pm

Posted in Uncategorized

Tagged with , , , ,