Tag Archive for 'user-defined-variables'

How to select the first or last row per group in SQL

There is no “first” or “last” aggregate function in SQL. Sometimes you can use MIN() or MAX(), but often that won’t work either. There are a couple of ways to solve this vexing non-relational problem. Read on to find out how.

First, let’s be clear: I am posing a very non-relational problem. This is not about the minimum, maximum, top, most, least or any other relationally valid extreme in the group. It’s the first or last, in whatever order the rows happen to come. And we all know rows aren’t ordered — in theory. But in practice they are, and sometimes you need the first or last row in a group.

If you have a question this article doesn’t answer, you might like to read how to select the first/least/max row per group in SQL and how to find the maximum row per group in SQL without subqueries.

A MySQL user-variable solution

I’ll show a MySQL-specific solution with one of the queries I developed for MySQL Table Checksum.

Here’s the idea: crush an entire table down to a single checksum value by checksumming each row, mushing it together with the previous row’s checksum, and then checksumming the result again. It’s fairly easy to do this, but it’s hard to get the final result in one statement. This is necessary to use the statement in an INSERT .. SELECT, which I needed to do.

An example might clarify:

select * from fruit;
+---------+
| variety |
+---------+
| apple   | 
| orange  | 
| lemon   | 
| pear    | 
+---------+

set @crc := '';

select variety, @crc := md5(concat(@crc, md5(variety))) from fruit;
+---------+-----------------------------------------+
| variety | @crc := md5(concat(@crc, md5(variety))) |
+---------+-----------------------------------------+
| apple   | ae6d32585ecc4d33cb8cd68a047d8434        | 
| orange  | 7ec613c796f44ef5ccb0e24e94323e38        | 
| lemon   | a2475f37be12cebf733ebfc7ee2ee473        | 
| pear    | ec98fe57833bbd91790ebc7ccf84c7e9        | 
+---------+-----------------------------------------+

I want the “last” value of @crc after the statement is done processing. How can I do this? The solution I found is to use a counter variable. I’ll demonstrate:

set @crc := '', @cnt := 0;

select variety,
   @cnt := @cnt + 1 as cnt,
   @crc := md5(concat(@crc, md5(variety))) as crc
from fruit;
+---------+------+----------------------------------+
| variety | cnt  | crc                              |
+---------+------+----------------------------------+
| apple   |    1 | ae6d32585ecc4d33cb8cd68a047d8434 | 
| orange  |    2 | 7ec613c796f44ef5ccb0e24e94323e38 | 
| lemon   |    3 | a2475f37be12cebf733ebfc7ee2ee473 | 
| pear    |    4 | ec98fe57833bbd91790ebc7ccf84c7e9 | 
+---------+------+----------------------------------+

The counter variable might make you want to write something like HAVING cnt = MAX(cnt), but that won’t work (try it!). Instead, I prefixed the checksum with the count so the last row is the stringwise maximum:

select variety,
   @crc := concat(lpad(@cnt := @cnt + 1, 10, '0'),
      md5(concat(right(@crc, 32), md5(variety)))) as crc
from fruit;
+---------+--------------------------------------------+
| variety | crc                                        |
+---------+--------------------------------------------+
| apple   | 0000000001ae6d32585ecc4d33cb8cd68a047d8434 | 
| orange  | 00000000027ec613c796f44ef5ccb0e24e94323e38 | 
| lemon   | 0000000003a2475f37be12cebf733ebfc7ee2ee473 | 
| pear    | 0000000004ec98fe57833bbd91790ebc7ccf84c7e9 | 
+---------+--------------------------------------------+

You can see I also left-padded the count so a lexical sort will agree with a numeric sort, and so I can predict how many extra characters I’ll need to remove to get back the original value. Now I can use the MAX() function to select the last row, and simply lop off the leftmost ten digits (I use the RIGHT() function for convenience, but generally you want to use SUBSTRING()):

select right(max(
   @crc := concat(lpad(@cnt := @cnt + 1, 10, '0'),
      md5(concat(right(@crc, 32), md5(variety))))
   ), 32) as crc
from fruit;
+----------------------------------+
| crc                              |
+----------------------------------+
| ec98fe57833bbd91790ebc7ccf84c7e9 | 
+----------------------------------+

Et voila, I got the last value in the group. By the way, this will work with ONLY_FULL_GROUP_BY in the server’s SQL mode.

Other methods

My solution relies on a MySQL user variable to do the counting, but there are many ways to number rows in SQL: you could simulate the ROW_NUMBER() function, for instance, or use techniques mentioned in the comments on how to number rows in MySQL (one of the comments shows a particularly clever solution with subqueries, but I didn’t want to use it because MySQL doesn’t support subqueries in older versions). Any of these should work one way or another. Of course, if you are using a product such as Microsoft SQL server 2005, which actually has the ROW_NUMBER() function, you can use that!

Conclusion

Finding the first or last row is a bit unintuitive, and it’s definitely non-relational, but sometimes it’s what you need. The technique I demonstrated in this article is easily adaptable to many kinds of queries. I hope it helped you!

If this article didn’t solve your problem, please read these before posting questions to the comments: how to select the first/least/max row per group in SQL and how to find the maximum row per group in SQL without subqueries.

Technorati Tags:, , , , ,

You might also like:

  1. How to simulate the SQL ROW_NUMBER function
  2. How to number rows in MySQL
  3. How to simulate the GROUP_CONCAT function
  4. How to select the first/least/max row per group in SQL
  5. Advanced MySQL user variable techniques

How to subtract in SQL over samples that wrap

This article explains how to do subtraction in SQL over samples that wrap back to zero when they exceed a boundary.

A reader wrote in with a question about how to find how much traffic has passed through a network interface. The reader has a script that samples the interface’s statistics and stores them in a database. The statistics wrap back around when they exceed the maximum size of an integer, so it’s not a strictly increasing sequence. The question, paraphrased, is “how can I find out how much traffic has gone through the interface in any given time period?”

A key assumption is that the counter never wraps back to zero more than once between samples. If it does, all hope is lost.

Setup

To simplify the math, pretend the counter wraps at 1,000 and you have the following table:

create table samples(
   num int not null auto_increment primary key,
   bytes int not null
);

insert into samples(bytes) values
   (100), (900),
   (230), (700), (982),
   (163), (600);

select * from samples;
+-----+-------+
| num | bytes |
+-----+-------+
|   1 |   100 | 
|   2 |   900 | 
|   3 |   230 | 
|   4 |   700 | 
|   5 |   982 | 
|   6 |   163 | 
|   7 |   600 | 
+-----+-------+

How much traffic?

A manual calculation is easier than it looks, and solving this by hand is the key to solving it in SQL. You don’t have to do a bunch of nasty math, like subtracting 982 from 163 (that’s already too hard for me). You just have to notice where the counter wraps. You can find these places by seeing where the number decreases from one sample to the next. In the example, the counter wraps twice: from 900 to 230, and from 982 to 163. Here’s the data, graphed with wraps “unrolled.”

Graph of sample data

There are several ways to proceed from here. One way is to calculate the traffic as 1,000 times the number of wraps. Then you just do a little math to “clean up the edges:” subtract the first number in the sequence, and then add the last number. This gives (2 * 1000) - 100 + 600, which is 2500.

Another approach is to go row by row, summing the differences from the previous row and the last row. When the counter wraps, you add 1000 before taking the difference. This math gives the same answer. This is a lot harder to do by hand.

Either technique works given an arbitrary start and end point in the sequence. Now let’s see how to do these in SQL.

Problem: find the “previous” row

While these methods seem easy to humans, they resist many relational solutions because of the notion of “previous row.” SQL is set-oriented, and doesn’t do iterative row-by-row data manipulation. If you try to do this by grouping each strictly increasing set of data together and using aggregate functions like SUM, you’ll have trouble. You need the values from the “previous set” to do that, and that doesn’t work like you might want it to.

However, it’s not that hard to get the current and last row matched up side-by-side so you can operate upon them within the context of a single row:

select cur.num, cur.bytes, prev.bytes as prev_bytes
from samples as cur
   left outer join samples as prev on cur.num = prev.num + 1;
+-----+-------+------------+
| num | bytes | prev_bytes |
+-----+-------+------------+
|   1 |   100 |       NULL | 
|   2 |   900 |        100 | 
|   3 |   230 |        900 | 
|   4 |   700 |        230 | 
|   5 |   982 |        700 | 
|   6 |   163 |        982 | 
|   7 |   600 |        163 | 
+-----+-------+------------+

Once you think of “previous” in SQL terms, it becomes easy. Armed with this tool, we are ready to take on the queries.

Technique 1: count wraps and clean up the edges

Now that we’ve figured out how to find the “previous row,” how can we express the “count wraps and clean up edges” algorithm in SQL? Brace yourself:

select 1000 * sum(t1.wraps) - t2.start + o.bytes as total
from samples as o
   inner join (
      select cur.num, count(prev.num) as wraps
      from samples as cur
         left outer join samples as prev on cur.num = prev.num + 1
            and cur.bytes < prev.bytes
      group by cur.num
   ) as t1 on t1.num <= o.num
   cross join (
      select bytes as start from samples order by num limit 1
   ) as t2
where o.num = 7
group by o.num

Anything I say about that query will probably make it harder to understand, so I’ll just count on you reading it carefully. It may help to remove some of it so you can see the intermediate results:

select sum(t1.wraps) as wraps, t2.start, o.bytes
from samples as o
   inner join (
      select cur.num, count(prev.num) as wraps
      from samples as cur
         left outer join samples as prev on cur.num = prev.num + 1
            and cur.bytes < prev.bytes
      group by cur.num
   ) as t1 on t1.num <= o.num
   cross join (
      select bytes as start from samples order by num limit 1
   ) as t2
group by o.num;
+-------+-------+-------+
| wraps | start | bytes |
+-------+-------+-------+
|     0 |   100 |   100 | 
|     0 |   100 |   900 | 
|     1 |   100 |   230 | 
|     1 |   100 |   700 | 
|     1 |   100 |   982 | 
|     2 |   100 |   163 | 
|     2 |   100 |   600 | 
+-------+-------+-------+

You can also write it as a correlated subquery, instead of a subquery in the FROM clause:

select 1000 * (
      select count(*) from samples as cur
         inner join samples as prev on cur.num = prev.num + 1
            and cur.bytes < prev.bytes
      where cur.num <= samples.num
   )
   - (select bytes from samples order by num limit 1)
   + samples.bytes as total
from samples
where num = 7

Both queries need WHERE clauses in multiple places to make them behave if you want anything other than the full range of data summed up. For example, if you want to sum over rows 3 through 6, the first query becomes

select 1000 * sum(t1.wraps) - t2.start + o.bytes as total
from samples as o
   inner join (
      select cur.num, count(prev.num) as wraps
      from samples as cur
         left outer join samples as prev on cur.num = prev.num + 1
            and cur.bytes < prev.bytes
      where cur.num > 3
      group by cur.num
   ) as t1 on t1.num <= o.num
   cross join (
      select bytes as start from samples where num >= 3 order by num limit 1
   ) as t2
where o.num = 6
group by o.num

The problem with both queries is the <= predicate, which turns them into O(n2) algorithms. They’re essentially a cross-join. Plus, they’re hard to understand. It turns out that the simplest method by hand is complicated in SQL.

Method 2: Adjust when there’s a wrap

The second method I showed above is more complex for humans, but it’s actually simpler to do in SQL:

select sum(
   if (cur.bytes >= prev.bytes,
      cur.bytes - prev.bytes,
      cur.bytes - prev.bytes + 1000)) as total
from samples as cur
   inner join samples as prev on cur.num = prev.num + 1
-- optional WHERE clause for choosing start/end:
-- where cur.num > 3 and cur.num <= 6

A slightly more compact way to write this is

select sum(
   cur.bytes - prev.bytes + if(cur.bytes >= prev.bytes, 0, 1000)) as total
from samples as cur
   inner join samples as prev on cur.num = prev.num + 1
-- where cur.num > 3 and cur.num <= 6

This query is both simpler and more efficient than the first method I showed. If your platform doesn’t support IF(), use a CASE statement.

Method 3: do it with user-variables in MySQL

It’s possible to do even better than the simple join technique on MySQL. Using some MySQL-specific tricks, you can make this query a once-through, low-cost algorithm, much the way you might do it by hand or in a programming language that supports iteration. If you want to know how this works, and why the query has to be written in such a contorted way, read my article on advanced user-variable techniques in MySQL.

set @last_bytes := null;

select sum(greatest(
      if(bytes >= @last_bytes,
         bytes - @last_bytes,
         coalesce(bytes + 1000 - @last_bytes, 0)),
      least(0, @last_bytes := bytes)
   )) as bytes
from samples order by num;

This is a bit trickier to write than some of the other user-variable examples I’ve shown, because you can’t use @last_bytes is null in any IF() or CASE statement. If you do, the query optimizer will look at @last_bytes at compile time, see that the statement can be optimized out and replaced with a constant, and your query will not work as you expect it to.

Summary

In this article I’ve shown you three methods to do SQL math on a counter that wraps around to zero when it reaches a limit. I showed them to you in order of increasing efficiency, but the second method is both the simplest and the most platform-independent (and probably the most sane).

Can you think of any other ways to do this, or any other uses for these kinds of techniques? Write into the comments!

Technorati Tags:, ,

You might also like:

  1. How to find next and previous records in SQL
  2. Why I use explicit date math in SQL
  3. How to find contiguous ranges with SQL

How to delete duplicate rows with SQL, Part 2

By reader request, this article explains ways to remove duplicate rows when they are completely identical, and you don’t have a primary key or other criterion to identify which rows to “save.”

This is a special case of deleting duplicates. I’ve written another article about the more general case, so I assume you have the background it gives. If not, you should probably go read my article about how to delete duplicate rows in SQL.

Introduction

In general, this is a hard problem. Suppose you have the following data, and you want to delete everything but the first row of its type (you don’t care which, because all duplicate rows are completely identical).

When you’re done, you want just two rows in the table:

Why this is hard

This is hard because there is no way to do this in standard SQL (correct me if I’m wrong). SQL is based on relational algebra, and duplicates cannot occur in relational algebra, because duplicates are not allowed in a set. That’s why SQL doesn’t give you tools to solve this problem.

No database product is truly relational, so in real life it’s possible for duplicates to occur. When it happens, you will have to resort to platform-specific methods to solve it. There should always be a way to do it, because there is always a difference between apparently identical rows. It might be an internal row ID, for example (as in Oracle). If nothing else, the rows have different memory and disk locations in the computer.

The easy way

The easiest thing to do is add a column with a unique number. This is called something different on every platform: it’s an IDENTITY column in SQL Server, an AUTO_INCREMENT column in MySQL, a SERIAL in PostgreSQL, and so on. Look at your platform’s documentation for instructions how to do it.

Once you’ve done that, you’re on easy street. Now go read my previous article to do the actual deleting.

If that won’t do…

Build a new table with distinct values from the old table, then drop and rename:

CREATE TABLE new_fruits ...;

INSERT INTO new_fruits(fruit)
   SELECT DISTINCT fruit FROM fruits;

DROP TABLE fruits;

RENAME TABLE new_fruits fruits;

If you can’t do that…

Perhaps you simply can’t do either of the above. Maybe your table is too large, for example. In that case you’re going to have to use some sort of iterative technique to do it; loop through the rows one at a time and delete every row you see more than once. This is also going to be a platform-specific solution; you may need to use a WHILE loop or server-side cursor. Consult your platform’s documentation for more; I can’t possibly cover all the bases here.

Two examples for MySQL

Here’s a quick technique that uses advanced user-variable techniques on MySQL to delete the rows. MySQL’s server-side cursors are read-only, so some other technique has to be used. User-variables can do the trick, if you write the statement just right — it’s very touchy.

set @num := 0, @type := '';

delete from fruits
where greatest(0,
   @num := if(type = @type, @num + 1, 0),
   least(0, length(@type := type))) > 1
order by type;

If you don’t understand that, go read the article :-) This can be very efficient because it doesn’t require any GROUP BY clause. If your rows are “naturally ordered” with all the duplicates adjacent to each other, you can even omit the ORDER BY clause (if your rows aren’t “sorted naturally,” you will miss some duplicate rows).

The other obvious option is to repeatedly identify a duplicated row, find how many times it’s duplicated, and delete one less than that many rows. You will need to either do this in a stored routine, or get help from some programming language. For example, in pseudo-code:

set @num := 0;

select @type := type, @num := count(*)
   from fruits
   group by type
   having count(*) > 1
   limit 1;

while @num > 0

   delete from fruits where type = 'type'
      limit @num - 1;

   set @num := 0;
   select @type := type, @num := count(*)
      from fruits
      group by type
      having count(*) > 1
      limit 1;

end while

That is pseudo-code, by the way; if you’re doing this in a stored procedure, you’re going to have to concatenate strings together to make an executable statement and execute it. If you’re using an external programming language, you’ll need to fetch the values that are duplicated and dynamically build a statement that deletes all but one row.

Summary

In this article I explained how to solve the special-case problem of removing duplicate rows with no distinguishing columns at all. It’s a harder case of the general problem, and SQL has no built-in way to solve it, so you have to learn your platform’s tricks to solve it. I showed you how to add a unique column so you can use the “easy” techniques I explained in an earlier article. You might also be able to put the rows into another table and drop the original table. Failing that, you have to use something like cursors. As a bonus, I explained two ways to do this in MySQL, one of them sneaky and the other not.

Technorati Tags:, ,

You might also like:

  1. How to implement a queue in SQL
  2. How to write multi-table, cross-database deletes with aliases in MySQL
  3. How to avoid many-to-one problems in SQL
  4. How to simulate FULL OUTER JOIN in MySQL
  5. What is a SQL blind insert?

How to calculate table checksums in MySQL

MySQL has no built-in functionality to calculate a table’s checksum in any storage engine but MyISAM (this is not true; I read the manual wrong, but it doesn’t eliminate the usefulness of the technique in this article). Table checksums can confirm that two tables are identical — useful to verify a slave server is in sync with its master (see my article on reliable MySQL replication for more). Fortunately, it’s easy to calculate a checksum on non-MyISAM tables with user variables. This technique works on any storage engine with any version of MySQL, doesn’t require the BLACKHOLE storage engine, and avoids locks caused by INSERT... SELECT on InnoDB tables.

Update I’ve coded this method into a Perl script for you to use. See MySQL Table Checksum for more details.

MyISAM checksums

If you want to verify that two MyISAM tables are identical, you can use CHECKSUM TABLE. If you need to do this frequently, you might want to make sure your MyISAM tables have the CHECKSUM = 1 option on so checksums are maintained continuously and don’t have to be calculated.

That was easy! Now on to the hard part: how to compare tables when one or both uses a storage engine other than MyISAM. That’s what the rest of this article is about.

Foundation

I derived this technique from Giuseppe Maxia’s article on how to use the BLACKHOLE storage engine to avoid cursors. Giuseppe uses a two-step process:

  1. Concatenate the values of all columns in the first row, and get a checksum of it with a cryptographic hash function.
  2. Concatenate this checksum with a global checksum, and take the checksum of the result, then move to the next row and repeat.

After processing each row in the table this way, you have a checksum of the entire table’s contents.

Two ways to iterate over rows

Giuseppe showed two ways to iterate over every row in the table: one with cursors, and another with user variables. Cursors in MySQL can’t be used with dynamically prepared statements, so they are much less practical for a generic method of checksumming any table; that would require creating a different routine for each table. So cursors are out, and user-variables are in. Here’s the essential code:

set @crc := '', @cnt := 0;
select @crc := sha1(concat(@crc,
    sha1(concat_ws('#', col1, col2... colN)))),
    @cnt := @cnt + 1 from tbl order by ID;
select @crc, @cnt;

(This code both CRCs the table and counts the number of rows in it).

The problem with this method, as Giuseppe points out, is the result set. If you run this on a table with a million rows, you’re going to get a million rows of CRCs and counts back from the first SELECT. This is not good; you really want the server to throw the intermediate values away so they don’t consume resources needlessly.

Giuseppe’s solution to this is to change the SELECT to an INSERT... SELECT. The INSERT goes into a BLACKHOLE table, which is like dumping it into /dev/null. It is simply discarded. This avoids some overhead, and Giuseppe found that it’s faster than using a cursor.

Limitations and performance concerns

There are still some undesirable effects of this approach. One is the BLACKHOLE storage engine. It’s not included in MySQL’s official binaries, so you’ll have to compile your own server to use it. It’s also not available before version 4.1.11. Those limitations rule this method out for many users.

The second is InnoDB. If you use InnoDB (odds are you do, or you wouldn’t be reading this) the INSERT... SELECT will place shared locks on every row in the table. If you’re doing this on an otherwise quiet server, it may not matter, but if you’re taking live checksums on a production machine, it’s much better to avoid those locks.

I’ll show you how to avoid both problems.

Wrap the user-variables up and discard the results

If you’ve read my articles about advanced user-variable techniques in MySQL, you might see where I’m headed. You don’t have to generate a large result set from that SELECT statement. In fact, you can hide the variable assignments, which you care about, inside a function whose result you don’t care about, and use an aggregate function to eliminate all but one row. Study the following code, which uses the same fruits table as in the user-variables article:

set @crc := '';

select min(
      length(@crc := sha1(concat(
         @crc,
         sha1(concat_ws('#', type, variety, price)))))
   ) as discard
from fruits use index(PRIMARY);
+---------+
| discard |
+---------+
|      40 |
+---------+

select @crc;
+------------------------------------------+
| @crc                                     |
+------------------------------------------+
| 3be9117fff37bcdd3f422e6ce4d24ee2a6642566 |
+------------------------------------------+

Notice a couple of things: there’s only one row, and the MIN() calculation that collapses all those rows into one is very efficient. Maybe a MySQL developer can comment on exactly how much memory this will take, but I think it should be really cheap, since it processes a row at a time and then throws it away.

I omitted the row count calculation for clarity. If you want to count rows as well, the following code will do both in one statement:

set @crc := '', @cnt := 0;

select min(least(
      length(@crc := sha1(concat(
         @crc,
         sha1(concat_ws('#', type, variety, price))))),
      @cnt := @cnt + 1
   )) as discard
from fruits use index(PRIMARY);

select @crc, @cnt;

You should always reset @crc and @cnt between runs so you get repeatable results.

Further considerations

It’s important to order the SELECT by something predictable, or the results will be non-deterministic. However, an ORDER BY clause won’t do it — that orders the final result, not the table scan. Forcing a certain index to be used will do the trick, hence the USE INDEX clause above. If you don’t have a primary key on your table, use a UNIQUE key if that’s available; otherwise, you’re probably not going to be able to get a dependable checksum.

This method is easy to use inside a stored procedure or routine on MySQL 5.0. You can easily build a column list from INFORMATION_SCHEMA and generate the dynamic SQL to checksum a table.

This technique is CPU-bound on our servers. I used the BENCHMARK() function to time different hash functions to try to improve the speed. In my tests, SHA1() took about 85% as long as MD5(). This surprised me; based on some cryptographic function benchmarks I found on the web, I thought MD5() would be faster. It may be system-dependent.

I’m also a little worried about using CONCAT_WS() to stringify each row for the hash function. This function skips NULL values, so it’s easy to come up with an edge case where two rows with different columns hash to the same thing. Although this is a very unlikely problem, I’d still rather not have it. If you know of a way to fix this, please let me know.

Finally, just a comment on doing this on running servers: if you’re comparing a master and slave, you can use LOCK TABLES on the master, find the checksum there, and then find the checksum on the slave before releasing the lock on the master. If your slave isn’t far behind the master, it ought to have plenty of time to catch up while the checksum is running on the master, at which point that table should be identical to the master (because you have the table locked on the master, no changes to the table will be replicating to the slave). This makes it practical to verify a slave is in sync without stopping the whole server.

Perl snippet to generate a SQL statement

For the Perl programmers out there, the following subroutine accepts a database handle, database name and table name, and returns the table’s storage engine and a ready-to-run query.

sub checksum_query {
   my ( $dbh, $db, $tbl ) = @_;
   my $ddl = ($dbh->selectrow_array("show create table $db.$tbl"))[1];
   my ( $type ) = $ddl =~ m/^\) (?:ENGINE|TYPE)=(\S+)/m;
   my $cols  = join(', ', $ddl =~ m/^\s+(`[^`]+`)/gm);
   my $index = $ddl =~ m/PRIMARY KEY/ ? ' USE INDEX(PRIMARY)' : '';
   my $query = 'SELECT MIN(LEAST(0, LENGTH(@crc := SHA1('
      . 'CONCAT_WS("#", @crc, SHA1(CONCAT_WS("#", ' . $cols . '))))),'
      . '@cnt := @cnt + 1)) AS len FROM ' . "`$db`.`$tbl` $index";
   return ( $type, $query );
}

I prefer to use SHOW CREATE TABLE instead of DESCRIBE TABLE, because it gives me all information about the table, such as the storage engine and index types. I have also found it to be faster, and of course it works on pre-5.0 versions.

Summary

In this article I showed you how to calculate a checksum for an entire table’s contents in MySQL. I used my favorite technique of hiding user-variable assignments inside a function to reduce a table scan’s result set to a single integer. This avoids a lot of overhead, and eliminates the need to insert the result set into a BLACKHOLE table, which means you can use the technique on a standard MySQL-supplied server. It also avoids InnoDB row locks. And it works on all storage engines, with all versions of MySQL that support user variables (3.23.6 and up).

I also touched a bit on the finer points of NULL values and ORDER BY for consistent results. And I gave you some ready-to-use Perl code to generate the SQL you need to execute against a given table.

This is the fastest, easiest, most efficient way I know to compare the contents of two or more tables, which is necessary to verify that a replicated master and slave server are still in sync.

Technorati Tags:, ,

You might also like:

  1. How to know if a MySQL slave is identical to its master
  2. Introducing MySQL Table Checksum
  3. MySQL Table Checksum 1.1.0 released
  4. Why MySQL says the server is not configured as a slave
  5. How to sync tables in master-master MySQL replication

How to implement a queue in SQL

This article explains how to create a fixed-size FIFO (first-in, first-out) queue in SQL, where rows added after a threshold will cause the oldest row to be deleted. There are several ways to do this, but MERGE on Oracle and DB2, and MySQL’s non-standard extensions to SQL, make an elegant solution easy.

Update a PostgreSQL blogger pointed out the obvious method I missed: triggers! There’s also a really neat PG-specific feature that allows it to work even more elegantly on that platform. Well worth a read. I sometimes wish I worked at a PostgreSQL shop so I could have time to learn as much about it as I’ve learned about MySQL.

Method one: do it with a single query in MySQL

Since I’m most familiar with MySQL, I’ll explain it in detail for MySQL. I’m sure a competent Oracle or DB2 developer can translate it to those platforms.

It is not possible to simultaneously INSERT and DELETE in standard SQL. However, in MySQL it is possible to simultaneously INSERT and UPDATE, with the ON DUPLICATE KEY UPDATE syntax. Another way to do it is with REPLACE, which actually works as a DELETE and INSERT. I’ve written about these before (flexible INSERT and UPDATE statements in MySQL — one of my most popular articles). You may not like them because they’re non-standard, but they’re available. I believe in using what my tools give me.

For these queries to work, you need to let inserts proceed as normal until the limit (say, 5) is reached. After that, new inserts need to create a unique index violation, and then the two-things-at-once functionality of the statement kicks in.

There are probably many ways to do this, but for this article, I’m going to imagine the table as a fixed-size, fixed-order queue. Once the list is full, new inserts “wrap around” and begin again at the bottom, travelling up through the rows one at a time. Each new insert then over-writes an existing row until it reaches the top and wraps around again.

For simplicity’s sake, I’m also going to imagine that nothing ever deletes any rows from this table. That way there’ll be no gaps in the sequence I’ll use to make the wrap-around work.

Here’s the table definition:

CREATE TABLE q (
   id int NOT NULL,
   modulo int NOT NULL,
   fruit varchar(10) NOT NULL,
   PRIMARY KEY(id),
   UNIQUE KEY(modulo)
)

This table has two unique keys: one to serve as a monotonically increasing “row number,” and one to cause the wrap-around effect to work. The only real data is the fruit column. Here’s a query to insert “apples” into the queue:

insert into q(id, modulo, fruit)
   select
      (coalesce(max(id), -1) + 1),
      (coalesce(max(id), -1) + 1) mod 5,
      'apples'
   from q
      on duplicate key update
         id    = values(id),
         fruit = values(fruit)

Here’s what the query does: it finds the maximum value of id in the table, which ought to be efficient since it’s indexed. If there are no rows, the result will be NULL, which COALESCE() converts into -1. Then it adds one to that value, which will become the next largest value in the id sequence. What I’m really doing here is rolling my own AUTO_INCREMENT, with a slight twist: the sequence starts at zero, not one.

The sequence needs to start at zero to make the modulo arithmetic easy to understand and work with. At the same time I’m inserting that value into id, I’m also dividing it by the desired size of the table, and inserting the remainder into the modulo column. When the table gets “full,” that column will already contain the calculated value, and there’ll be a unique index violation. Then the ON DUPLICATE KEY UPDATE clause will fire and update the existing row instead of inserting a new one.

Here’s what the table contains after the above insert:

select * from q;
+----+--------+--------+
| id | modulo | fruit  |
+----+--------+--------+
|  0 |      0 | apples |
+----+--------+--------+

Let me now insert four more rows for oranges, peaches, cherries and pears, so the queue is full:

insert into q(id, modulo, fruit)
   select
      (coalesce(max(id), -1) + 1),
      (coalesce(max(id), -1) + 1) mod 5,
      'oranges'
   from q
      on duplicate key update
         id    = values(id),
         fruit = values(fruit);

insert into q(id, modulo, fruit)
   select
      (coalesce(max(id), -1) + 1),
      (coalesce(max(id), -1) + 1) mod 5,
      'peaches'
   from q
      on duplicate key update
         id    = values(id),
         fruit = values(fruit);

insert into q(id, modulo, fruit)
   select
      (coalesce(max(id), -1) + 1),
      (coalesce(max(id), -1) + 1) mod 5,
      'cherries'
   from q
      on duplicate key update
         id    = values(id),
         fruit = values(fruit);

insert into q(id, modulo, fruit)
   select
      (coalesce(max(id), -1) + 1),
      (coalesce(max(id), -1) + 1) mod 5,
      'pears'
   from q
      on duplicate key update
         id    = values(id),
         fruit = values(fruit);

Each row I inserted caused MySQL to print the following back to the command prompt:

Query OK, 1 row affected (0.05 sec)
Records: 1  Duplicates: 0  Warnings: 0

And now, the contents of the table:

select * from q;
+----+--------+----------+
| id | modulo | fruit    |
+----+--------+----------+
|  0 |      0 | apples   |
|  1 |      1 | oranges  |
|  2 |      2 | peaches  |
|  3 |      3 | cherries |
|  4 |      4 | pears    |
+----+--------+----------+

Now I’ll insert another row for bananas:

insert into q(id, modulo, fruit)
   select
      (coalesce(max(id), -1) + 1),
      (coalesce(max(id), -1) + 1) mod 5,
      'bananas'
   from q
      on duplicate key update
         id    = values(id),
         fruit = values(fruit);

That query should have wrapped around to the beginning of the queue and triggered the unique index violation. As a result, MySQL should have overwritten the ‘apples’ row. In fact, the messages at the command prompt indicate something did happen:

Query OK, 2 rows affected (0.03 sec)
Records: 1  Duplicates: 1  Warnings: 0

Two rows were “affected” because of the duplicate key (you can read the MySQL manual for more on what “rows affected” really means). And there was indeed a duplicate row. Here’s what’s in the table now:

select * from q order by modulo;
+----+--------+----------+
| id | modulo | fruit    |
+----+--------+----------+
|  5 |      0 | bananas  |
|  1 |      1 | oranges  |
|  2 |      2 | peaches  |
|  3 |      3 | cherries |
|  4 |      4 | pears    |
+----+--------+----------+

Notice I ordered that query by modulo to show the entries in the same order as before. The “oldest” row, which is at the “front” of the queue, is now the one with the smallest value of id, so to see them “in queue order,” you can order by id:

select * from q order by id;
+----+--------+----------+
| id | modulo | fruit    |
+----+--------+----------+
|  1 |      1 | oranges  |
|  2 |      2 | peaches  |
|  3 |      3 | cherries |
|  4 |      4 | pears    |
|  5 |      0 | bananas  |
+----+--------+----------+

Method two: use REPLACE on MySQL

If it’s easier to write your query this way, or you need support on older versions of MySQL, you can use REPLACE instead of INSERT... ON DUPLICATE KEY UPDATE. Here’s an example:

replace into q(id, modulo, fruit)
   select
      (coalesce(max(id), -1) + 1),
      (coalesce(max(id), -1) + 1) mod 5,
      'bananas'
   from q;

The query may be more or less efficient, depending on your MySQL version, the storage engine you chose, and so forth. If I were doing this in production, I’d test it, probably with my MySQL Query Profiler tool. Hint, hint!

Methods three and four: on other platforms

Another option, which will allow the same easy single-query solution, is to use MERGE on Oracle and DB2. MERGE, REPLACE and friends are what we database folks call upsert queries, because they insert or update. If this functionality is available on other platforms, let me know. It looks like it’s still on the TODO list for PostgreSQL, and I’m fairly certain it’s not in SQL Server 2005. Perhaps a future release of one of these products will offer it.

Till then, I think the best option on these platforms would be a transaction with a couple statements to check the table and either insert or update (or delete and then insert, depending on how you want to do it). This fourth method can be made completely portable across platforms, if that’s important for your use case.

Things to think about

If you implement a system like this, consider the edge cases. Are you ever going to delete rows from the queue? If so, does that mess with the desired behavior of new inserts? Are there any ways you can get a hole in the sequence? If so, what happens — do you get too few rows in the queue, overwrite something other than the oldest row, or something else? If you need to “process” items in the queue, maybe you can just mark them as processed rather than deleting them.

What if you want to insert multiple rows at once? If you need to go that route on MySQL, my past articles might help. You could use advanced user variable techniques to number several rows at once. I also talked about related techniques in how to write INSERT IF NOT EXISTS queries.

Conclusion

In this article I showed you several ways to let a table grow to a fixed size, after which new rows replace old rows. I haven’t personally used this in production; this article grew out of a reader’s question (thanks for the stimulating topic!). If this article helped you, you should consider subscribing for future updates via email or feeds. It’s free and convenient.

Technorati Tags:, , , , ,

You might also like:

  1. SQL Server 2000 date and time puzzler
  2. How to avoid unique index violations on updates in MySQL
  3. How to write a UNION in SQL without using UNION
  4. How to notify event listeners in MySQL
  5. How to delete duplicate rows with SQL, Part 2

Advanced MySQL user variable techniques

MySQL’s user variables have interesting properties that enable the useful techniques I wrote about in recent articles. One property is that you can read from and assign to a user variable simultaneously, because an assignment can be an r-value (the result of the assignment is the final value of the variable). Another property, which sometimes causes confusing behavior, is un-intuitive evaluation time. In this post I’ll show you how to make sure your variables get updated at the time they’re used, instead of potentially reading and updating them at different stages of query execution. This technique enables a whole new range of applications for user variables. As a bonus, it also avoids extra columns of output created by variable manipulations.

I will cover several things in this article: assignments as r-values and its side effects, lazy evaluation and its side effects, and finally a technique that lets you have non-lazy evaluation and avoid some side effects.

Setup

I’ll use the same data as in recent articles:

CREATE TABLE fruits (
  type varchar(10) NOT NULL,
  variety varchar(20) NOT NULL,
  price decimal(5,2) NOT NULL default 0,
  PRIMARY KEY  (type,variety)
) ENGINE=MyISAM DEFAULT CHARSET=latin1;

insert into fruits(type, variety, price) values
('apple',  'gala',       2.79),
('apple',  'fuji',       0.24),
('apple',  'limbertwig', 2.87),
('orange', 'valencia',   3.59),
('orange', 'navel',      9.36),
('pear',   'bradford',   6.05),
('pear',   'bartlett',   2.14),
('cherry', 'bing',       2.55),
('cherry', 'chelan',     6.33);

Simultaneous assignment and reading

MySQL lets you assign and read a variable at the same time. This is familiar in many programming languages where an assignment can be an r-value. For example,

set @test1 := 0;
set @test2 := @test1 := 5;
select @test1, @test2;
+--------+--------+
| @test1 | @test2 |
+--------+--------+
| 5      | 5      | 
+--------+--------+

The second statement sets @test1 to 5, and then sets @test2 to the result of that assignment, which is the current value of @test1. My previous articles have shown you how to exploit this to number rows in a result set, among other things. For example, you can keep a running count as MySQL processes rows, updating and returning the count at the same time.

Side effects

Unfortunately, it got a bit messy sometimes. For example, the following batch, which restarts the numbering every time type changes, spews an extra dummy column into the output, because that column is where the calculations are taking place:

set @type := '', @num := 1;

select type, variety,
   @num := if(@type = type, @num + 1, 1) as row_number,
   @type := type as dummy
from fruits
order by type, variety;

+--------+------------+------------+--------+
| type   | variety    | row_number | dummy  |
+--------+------------+------------+--------+
| apple  | fuji       |          1 | apple  | 
| apple  | gala       |          2 | apple  | 
| apple  | limbertwig |          3 | apple  | 
| cherry | bing       |          1 | cherry | 
| cherry | chelan     |          2 | cherry | 
| orange | navel      |          1 | orange | 
| orange | valencia   |          2 | orange | 
| pear   | bartlett   |          1 | pear   | 
| pear   | bradford   |          2 | pear   | 
+--------+------------+------------+--------+

In previous articles I suggested wrapping that query in a subquery so you can pick which columns you want in the output. That is a bit inefficient (it creates a temporary table internally) and feels kind of kludgey.

Lazy evaluation

MySQL doesn’t evaluate expressions containing user variables until they are sent to the client, so some expressions don’t work as expected. Setting a variable in one place (such as the SELECT list) and reading it another (such as the HAVING clause) might give weird results, like as those I demonstrated in my last article where every row was numbered 1 instead of getting incremented as expected.

Here’s further clarification from the manual:

In a SELECT statement, each expression is evaluated only when sent to the client. This means that in a HAVING, GROUP BY, or ORDER BY clause, you cannot refer to an expression that involves variables that are set in the SELECT list. For example, the following statement does not work as expected:

mysql> SELECT (@aa:=id) AS a, (@aa+3) AS b FROM tbl_name HAVING b=5;

The reference to b in the HAVING clause refers to an alias for an expression in the SELECT list that uses @aa. This does not work as expected: @aa contains the value of id from the previous selected row, not from the current row.

In other words, the “alias” in the HAVING clause is probably a pointer to a memory location, whose content is not determined for the current row until the current row is output to the client — at which point it’s too late to apply any HAVING criteria to the row.

Side effects of lazy evaluation

In my last article I showed you how to select the top N rows from each group with user variables. To make that work right, I had to group the query, use a HAVING clause, and force a certain index order for that query — because of lazy evaluation. Otherwise, I might have been able to just use the variable in a WHERE clause, right? Lazy evaluation is why this doesn’t work:

set @type := '', @num := 1;

select type, variety, price,
       @num := if(@type = type, @num + 1, 1) as row_number,
       @type := type as dummy
from fruits
where @num <= 2;

+-------+------------+-------+------------+-------+
| type  | variety    | price | row_number | dummy |
+-------+------------+-------+------------+-------+
| apple | gala       |  2.79 |          1 | apple | 
| apple | fuji       |  0.24 |          2 | apple | 
| apple | limbertwig |  2.87 |          3 | apple | 
+-------+------------+-------+------------+-------+

That last row gets output even though it seems @num should have the value 3, eliminating it from the results. However, you can infer from this behavior that @num really had the value 2 at the time the WHERE clause was evaluated, and was only incremented to 3 after the row was sent to the client.

This aspect of user variable behavior makes user variables significantly harder to understand. Sometimes the results are non-deterministic and/or hard to predict. It would be great if there were a way to update those variables in the context in which they’re declared, so they get assigned and read at the same time, instead of having to wait for rows to be sent to the client — a different step in the query execution plan.

Forcing variable evaluation with multi-staged queries

If you understand the order of the steps MySQL uses to execute a query, you can see there are opportunities to make MySQL “finish up” variable assignments before sending the query to the next step. In fact, perhaps it’s a bit misleading to say assignments in the SELECT are done when rows are sent. I think it’s more accurate to say they’re done when rows are generated for each stage in query execution.

You can see this in a subquery in the FROM clause, which is internally stored as an intermediate temporary table. Variable assignments are done before or as the rows are stored in the temporary table, so when results are read from the temporary table, there are no funny side effects.

Let me show you the previous query slightly rewritten, and you’ll see what I mean:

set @type := '', @num := 1;

select type, variety, price, row_number
from (
   select type, variety, price,
       @num := if(@type = type, @num + 1, 1) as row_number,
       @type := type as dummy
   from fruits
) as x
where row_number <= 2;

+--------+----------+-------+------------+
| type   | variety  | price | row_number |
+--------+----------+-------+------------+
| apple  | gala     |  2.79 |          1 | 
| apple  | fuji     |  0.24 |          2 | 
| orange | valencia |  3.59 |          1 | 
| orange | navel    |  9.36 |          2 | 
| pear   | bradford |  6.05 |          1 | 
| pear   | bartlett |  2.14 |          2 | 
| cherry | bing     |  2.55 |          1 | 
| cherry | chelan   |  6.33 |          2 | 
+--------+----------+-------+------------+

Just by introducing an intermediate step in the query, I forced the variables to be evaluated so the results, when they get to the outer WHERE clause, are deterministic. But as I mentioned before, this is kind of kludgey, and depending on the data, it might not be very efficient to create an intermediate temporary table for the results.

Are there better ways? You bet!

Try 1: Use functions to force immediate evaluation

Here’s an idea: what if certain functions evaluate their arguments immediately? You could exploit that to create a context that has to be evaluated first, sort of like parenthesizing an expression in an equation. You know, a = (a + b) * (b + c) means “do the additions first,” which wouldn’t be the case without the parentheses — normally multiplication comes before addition.

For this to work, you’d need a function that guarantees the expression is evaluated. For example, COALESCE() might be a good choice as long as you put the expression first in the argument list, since COALESCE() shortcuts and doesn’t evaluate any more arguments as soon as it find a non-NULL argument.

Theoretically, then you could write something like the following and get the desired results:

set @type := '', @num := 1;

select type, variety, price,
   coalesce(@num := if(@type = type, @num + 1, 1)) as row_number
...

It doesn’t work. Why not? Because the COALESCE itself isn’t evaluated until the rows are generated. So much for that idea.

What about a scalar subquery, then?

set @num := 0, @type := '';

select type, variety, price,
   (select(@num := if(@type = type, @num + 1, 1))) as row_number,
...

Sorry, no dice. This gives exactly the same results.

This idea will not work, period. Each and every expression in the SELECT list is evaluated as the rows are generated. Functions are expressions, scalar subqueries are expressions… the only things that will work are operations that result in rows being evaluated for a final value.

Try 2: Force my will on the query

One thing I do know: subqueries in the FROM clause are materialized to a temp table, so this will definitely result in rows being generated. This might do what I want, at the expense of generating temporary tables willy-nilly:

set @type := '', @num := 1;

select type, variety, price,
   (select n from (select @num := if(@type = type, @num + 1, 1) as n) as x) as row_number,
   (select t from (select @type := type as t) as x) as dummy
from fruits
where @num <= 2;

That won’t work either, as it turns out. The subqueries are correlated — they refer to columns from the outer table. That isn’t allowed because of the intermediate step, which insulates the inner queries from the outer. This is a limitation of correlated subqueries: you can’t nest a subquery in the FROM clause inside them.

This is really getting silly. It’s time to stop trying to force this to work.

Try 3: Work with me, son

What if I stop trying to get the SELECT clause to be evaluated at the same time as the WHERE clause? What if I work with the server’s order of operations, and do all the evaluating and updating in the WHERE clause instead of in two places? Maybe it looks like this:

set @num := 0, @type := '';

select type, variety, price, @num
from fruits
where
   2 >= @num := if(@type = type, @num + 1, 1)
   and @type := type;

+--------+------------+-------+------+
| type   | variety    | price | @num |
+--------+------------+-------+------+
| apple  | gala       |  2.79 | 0    | 
| apple  | fuji       |  0.24 | 0    | 
| apple  | limbertwig |  2.87 | 0    | 
| orange | valencia   |  3.59 | 0    | 
| orange | navel      |  9.36 | 0    | 
| pear   | bradford   |  6.05 | 0    | 
| pear   | bartlett   |  2.14 | 0    | 
| cherry | bing       |  2.55 | 0    | 
| cherry | chelan     |  6.33 | 0    | 
+--------+------------+-------+------+

Hmm, that was not really what I wanted. It looks like the variable is never getting updated at all! I’m not sure why not. Maybe if I ‘parenthesize’ the variable expression like I tried before? I’ll use the GREATEST() function, which I know will evaluate all its arguments instead of short-cutting like COALESCE():

set @num := 0, @type := '';

select type, variety, price, @num
from fruits
where
   2 >= @num := greatest(0, if(@type = type, @num + 1, 1))
   and @type := type;

No, that gives the same result. I feel like I’m getting close, though. What if I separate out the assignment and comparison?

set @num := 0, @type := '';

select * from fruits
where @num := if(type = @type, @num + 1, 1)
   and @type := type
   and @num <= 2;
Empty set (0.00 sec)

select @num, @type;
+------+-------+
| @num | @type |
+------+-------+
| 0    | 0     | 
+------+-------+

That didn’t work either. How did @type get assigned an integer? It should be a string. It turns out the := operator has the lowest possible operator precedence, so that WHERE clause is actually equivalent to

where @num := (
   if(type = @type, @num + 1, 1)
      and (@type := (
         type and @num <= 2)));

If I use parentheses right, maybe I can get it to do what I want:

select * from fruits
where (@num := if(type = @type, @num + 1, 1))
      and (@type := type)
      and (@num <= 2);
Empty set (0.00 sec)

select @num, @type;
+------+--------+
| @num | @type  |
+------+--------+
| 9    | cherry | 
+------+--------+

Now I’ve gotten the variables to be assigned, but the WHERE clause is still eliminating all the rows. This feels so close to being right. What’s missing?

Pay dirt: do the assignment inside the function

In fact, I was very close. All I need to do is move the entire assignment and the evaluation inside the function. It seems the variable expressions need to be sealed away from the comparison operator. In the example below, I’ve put everything inside the GREATEST() function, but the expression that updates @type has an incompatible type (string), so I convert it to a number with LENGTH() and mask its value with LEAST().

set @num := 0, @type := '';

select type, variety, price, @num
from fruits
where 2 >= greatest(
   @num := if(@type = type, @num + 1, 1),
   least(0, length(@type := type)));

The entire GREATEST() expression evaluates to the resulting value of @num, which is what I want on the right-hand side of the comparison. And guess what? This works:

+--------+----------+-------+------+
| type   | variety  | price | @num |
+--------+----------+-------+------+
| apple  | gala     |  2.79 | 1    | 
| apple  | fuji     |  0.24 | 2    | 
| orange | valencia |  3.59 | 1    | 
| orange | navel    |  9.36 | 2    | 
| pear   | bradford |  6.05 | 1    | 
| pear   | bartlett |  2.14 | 2    | 
| cherry | bing     |  2.55 | 1    | 
| cherry | chelan   |  6.33 | 2    | 
+--------+----------+-------+------+

After playing with more and more combinations, I found another way that works too:

select *, @num
from fruits
where
   (@num := if(type = @type, @num + 1, 1)) is not null
   and (@type := type) is not null
   and (@num <= 2);

I confess, I don’t fully understand this. I figured it out through trial and error. If the user manual explains it well enough for me to have gotten there by reason, I don’t know where. Can someone make it make sense please? I don’t want to have to read the source…

What’s so great about this?

Two words: one pass. One pass through the table — no quadratic-time algorithms, no grouping or sorting. This is highly efficient. I showed you another technique with UNION in my last article, which might be more efficient in some cases. But if you have lots of types of fruit, each of which has just a few varieties, you will be hard-pressed to find a more efficient algorithm to output the first two rows from each group. In fact, I doubt it can be done.

Spurious columns are gone

Putting the variable assignments inside functions not only let me put everything into the WHERE clause, it also got rid of the extra columns in the output — without kludges like subqueries. You can use this technique to clean up your output whenever you’re doing row-by-row calculations.

Notice the order of rows!

As in previous articles, rows are processed and numbered in order. I never really stated what I was trying to accomplish in the example above. The query I showed you will just output a maximum of two consecutive rows of the same type, in the order they’re read from the table (actually, I guess that’s the order they pass through the WHERE filter, which might not be the same). If I want to do something specific, such as get the two cheapest varieties from each type of fruit, I need to add an explicit ORDER BY to get the rows in order of price:

set @num := 0, @type := '';

select type, variety, price, @num
from fruits
where 2 >= greatest(
   @num := if(@type = type, @num + 1, 1),
   least(0, length(@type := type)))
order by type, price;

Exercise for the reader: run this query without an index that can be used for ordering. What’s in the @num column? Why? Add an index on (type, price) and try again. How does it change? Why? EXPLAIN the queries to find out.

Is that all?

Nope. If you can put user-variable evaluations inside a function, you can put them anywhere you can put a function. That means you could, for example, put them in the ORDER BY clause, in the JOIN clause, in the HAVING clause… anywhere. Now that you know you can do this, you can manipulate variables in lots of places you couldn’t do otherwise.

Conclusion

In this article I showed you how two properties of MySQL’s user variables (assignment is an r-value, and lazy evaluation) simultaneously cause side effects and give you great power. I showed you why you simply can’t get around the fact that the WHERE clause and the SELECT list are evaluated at different times (I proved it by figuratively banging my head against a wall). I then showed you how you can tuck variable manipulations inside functions, masking out the manipulations and just getting the result, which can be used in a WHERE clause or anywhere else. You now have the tools you need to avoid the side effects of those properties I mentioned.

Finally, I showed you one place you might want to use such a technique to get the first N rows from each group. In certain cases, I think this is the most efficient algorithm possible, requiring just one pass through the table.

I don’t know about you, but this opens up a lot of interesting possibilities. I have one particular use in mind that I’ll write about next — another way to linearize a query that’s normally extremely expensive.

What do you think? Leave a comment and let me know!

Note: I’m taking a break from computers. This is pre-recorded. I’ll moderate your comments shortly.

Technorati Tags:, ,

You might also like:

  1. How to number rows in MySQL
  2. How to simulate the SQL ROW_NUMBER function
  3. How to select the first/least/max row per group in SQL
  4. How to select the first or last row per group in SQL
  5. What is a SQL blind insert?

How to number rows in MySQL

I wrote before about a generic, cross-platform way to simulate the SQL ROW_NUMBER() function in any RDBMS. There is a much more efficient way to do this on MySQL with user variables.

Background

Please see my previous article on how to simulate the ROW_NUMBER() function for the background. I’ll use the same table structure and data in this article.

Unfortunately, that’s a quadratic algorithm, so it’s not something I’d do much (though I once did it over small sets of data in SQL Server 2000 at a jobsite).

A more efficient method

In MySQL you can simultaneously assign to and read from user variables in a SELECT statement. This allows the following method of numbering rows:

set @type = '';
set @num  = 1;

select
   type,
   variety,
   @num := if(@type = type, @num + 1, 1) as row_number,
   @type := type as dummy
from fruits;

+--------+------------+------------+--------+
| type   | variety    | row_number | dummy  |
+--------+------------+------------+--------+
| apple  | fuji       |          1 | apple  | 
| apple  | gala       |          2 | apple  | 
| apple  | limbertwig |          3 | apple  | 
| cherry | bing       |          1 | cherry | 
| cherry | chelan     |          2 | cherry | 
| orange | navel      |          1 | orange | 
| orange | valencia   |          2 | orange | 
| pear   | bartlett   |          1 | pear   | 
| pear   | bradford   |          2 | pear   | 
+--------+------------+------------+--------+

How does that work? I’m restarting the row number each time the type column changes, by keeping track of the value it had in the last row. And I’m simultaneously incrementing and selecting the row number in each row.

The spurious dummy column has to be there, but if your version of MySQL supports it, you can use a subquery in the FROM clause to eliminate columns you don’t want in the results.

Efficiency

All I’m doing is maintaining a bit of extra memory and performing a few small comparisons and assignments for each row, so this technique is very efficient.

Playing with fire

You can refer to the generated row_number column in a HAVING or GROUP BY clause, but don’t burn your fingers. This technique is very much like playing with fire. The result of assigning to a variable and using it in the same statement (in the HAVING, for example) depends on the query plan the server chooses, the phase of the moon, and probably other things too. Before you use this technique, you should read and understand the section on user-defined variables in the MySQL Manual, and decide whether it’s safe for your query.

Now that you’ve read that section of the manual, particularly the part about the aliased expression, you should understand why the following query might be a safer paradigm when using the result in the HAVING clause, even though it produces another dummy column:

set @type = '';
set @num  = 1;

select
   type,
   variety,
   @num := if(@type = type, @num + 1, 1) as dummy_1,
   @type := type as dummy_2,
   @num as row_number
from fruits
group by type, variety
having row_number = 1;

+--------+----------+---------+---------+------------+
| type   | variety  | dummy_1 | dummy_2 | row_number |
+--------+----------+---------+---------+------------+
| apple  | fuji     |       1 | apple   | 1          | 
| cherry | bing     |       1 | cherry  | 1          | 
| orange | navel    |       1 | orange  | 1          | 
| pear   | bartlett |       1 | pear    | 1          | 
+--------+----------+---------+---------+------------+

(If I’m wrong about that, somebody please correct me).

A safer technique is to use a subquery in the FROM clause. This will cause the results to be materialized in a temporary table behind the scenes. It might be less efficient for some uses, though:

select type, variety
from (
   select
      type,
      variety,
      @num := if(@type = type, @num + 1, 1) as row_number,
      @type := type as dummy
   from fruits
) as x
where row_number = 1;

+--------+----------+
| type   | variety  |
+--------+----------+
| apple  | fuji     | 
| cherry | bing     | 
| orange | navel    | 
| pear   | bartlett | 
+--------+----------+

Conclusion

This is an efficient, flexible way to generate and use row numbers in MySQL. I’ll leave it to you to find uses for it for right now, but I’m going to show you at least one application for this in an upcoming article.

Technorati Tags:, ,

You might also like:

  1. How to simulate the SQL ROW_NUMBER function
  2. How to select the first/least/max row per group in SQL
  3. Advanced MySQL user variable techniques
  4. How to select the first or last row per group in SQL
  5. How to simulate the GROUP_CONCAT function