Xaprb

Stay curious!

Archive for the ‘Ranked-data’ tag

How to select the first/least/max row per group in SQL

with 58 comments

Here are some common SQL problems, all of which have related solutions: how do I find the most recent log entry for each program? How do I find the most popular item from each category? How do I find the top score for each player? In general, these types of “select the extreme from each group” queries can be solved with the same techniques. I’ll explain how to do that in this article, including the harder problem of selecting the top N entries, not just the top 1.

This topic is related to numbering rows, which I just wrote about (see my articles about MySQL-specific and generic techniques to assign a number to each row in a group). Therefore I’ll use nearly the same table and data as I used in those articles, with the addition of a price column:

+--------+------------+-------+
| type   | variety    | price |
+--------+------------+-------+
| apple  | gala       |  2.79 | 
| apple  | fuji       |  0.24 | 
| apple  | limbertwig |  2.87 | 
| orange | valencia   |  3.59 | 
| orange | navel      |  9.36 | 
| pear   | bradford   |  6.05 | 
| pear   | bartlett   |  2.14 | 
| cherry | bing       |  2.55 | 
| cherry | chelan     |  6.33 | 
+--------+------------+-------+

Selecting the one maximum row from each group

Let’s say I want to select the most recent log entry for each program, or the most recent changes in an audit table, or something of the sort. This question comes up over and over on IRC channels and mailing lists. I’ll re-phrase the question in terms of fruits. I want to select the cheapest fruit from each type. Here’s the desired result:

+--------+----------+-------+
| type   | variety  | price |
+--------+----------+-------+
| apple  | fuji     |  0.24 | 
| orange | valencia |  3.59 | 
| pear   | bartlett |  2.14 | 
| cherry | bing     |  2.55 | 
+--------+----------+-------+

There are a few common solutions to this problem. All involve two steps: finding the desired value of price, and then selecting the rest of the row based on that.

One common solution is a so-called self-join. Step one is to group the fruits by type (apple, cherry etc) and choose the minimum price:

select type, min(price) as minprice
from fruits
group by type;
+--------+----------+
| type   | minprice |
+--------+----------+
| apple  |     0.24 | 
| cherry |     2.55 | 
| orange |     3.59 | 
| pear   |     2.14 | 
+--------+----------+

Step two is to select the rest of the row by joining these results back to the same table. Since the first query is grouped, it needs to be put into a subquery so it can be joined against the non-grouped table:

select f.type, f.variety, f.price
from (
   select type, min(price) as minprice
   from fruits group by type
) as x inner join fruits as f on f.type = x.type and f.price = x.minprice;
+--------+----------+-------+
| type   | variety  | price |
+--------+----------+-------+
| apple  | fuji     |  0.24 | 
| cherry | bing     |  2.55 | 
| orange | valencia |  3.59 | 
| pear   | bartlett |  2.14 | 
+--------+----------+-------+

Another common way to do this is with a correlated subquery. This can be much less efficient, depending on how good your system’s query optimizer is. You might find it clearer, though.

select type, variety, price
from fruits
where price = (select min(price) from fruits as f where f.type = fruits.type);
+--------+----------+-------+
| type   | variety  | price |
+--------+----------+-------+
| apple  | fuji     |  0.24 | 
| orange | valencia |  3.59 | 
| pear   | bartlett |  2.14 | 
| cherry | bing     |  2.55 | 
+--------+----------+-------+

Both queries are logically equivalent, though they may not perform the same.

Select the top N rows from each group

This is a slightly harder problem to solve. Finding a single row from each group is easy with SQL’s aggregate functions (MIN(), MAX(), and so on). Finding the first several from each group is not possible with that method because aggregate functions only return a single value. Still, it’s possible to do.

Let’s say I want to select the two cheapest fruits from each type. Here’s a first try:

select type, variety, price
from fruits
where price = (select min(price) from fruits as f where f.type = fruits.type)
   or price = (select min(price) from fruits as f where f.type = fruits.type
      and price > (select min(price) from fruits as f2 where f2.type = fruits.type));
+--------+----------+-------+
| type   | variety  | price |
+--------+----------+-------+
| apple  | gala     |  2.79 | 
| apple  | fuji     |  0.24 | 
| orange | valencia |  3.59 | 
| orange | navel    |  9.36 | 
| pear   | bradford |  6.05 | 
| pear   | bartlett |  2.14 | 
| cherry | bing     |  2.55 | 
| cherry | chelan   |  6.33 | 
+--------+----------+-------+

Yuck! That can be written as a self-join, but it’s just as bad (I leave it as an exercise for the reader). This gets worse as you go to higher numbers (top 3, top 4…). There are other ways to phrase the statement, but they all boil down to the same thing, and they’re all pretty unwieldy and inefficient.

There’s a better way: select the variety from each type where the variety is no more than the second-cheapest of that type.

select type, variety, price
from fruits
where (
   select count(*) from fruits as f
   where f.type = fruits.type and f.price < fruits.price
) <= 2;

This is elegant, and lets you vary N without rewriting your query (a very good thing!), but it’s functionally the same as the previous query. Both are essentially a quadratic algorithm relative to the number of varieties in each type. And again, some query optimizers may not do well with this and make it quadratic with respect to the number of rows in the table overall (especially if no useful index is defined), and the server might get clobbered. Are there better ways? Can it be done with one pass through the data, instead of the many passes required by a correlated subquery? You know it can, or I wouldn’t be writing this, now would I?

Use UNION

If there’s an index on (type, price), and there are many more records to eliminate than to include in each group, a more efficient single-pass method (especially for MySQL, but also for some other RDBMSs) is to break the queries out separately and put a limit on each, then UNION them all back together. Here’s the syntax you need for MySQL:

(select * from fruits where type = 'apple' order by price limit 2)
union all
(select * from fruits where type = 'orange' order by price limit 2)
union all
(select * from fruits where type = 'pear' order by price limit 2)
union all
(select * from fruits where type = 'cherry' order by price limit 2)

Peter Zaitsev has written in detail about this technique, so I won’t go into it too much more here. If it suits your purposes, it can be a very good solution.

One note: use UNION ALL, not just UNION. It prevents the server sorting the results to eliminate duplicates before returning them. In this case there will be no duplicates, so I’m telling the server to skip that (useless, expensive) step.

Do it with user variables on MySQL

The UNION trick is an especially good idea when the results are a small fraction of the rows in the table and there is an index that can be used for sorting the rows. Another linear-time technique, which might be a good option in cases where you are selecting most of the rows from the table anyway, is user variables. This is MySQL-specific. Please refer to my previous post on how to number rows in MySQL for the gory details of why this works:

set @num := 0, @type := '';

select type, variety, price
from (
   select type, variety, price,
      @num := if(@type = type, @num + 1, 1) as row_number,
      @type := type as dummy
  from fruits
  order by type, price
) as x where x.row_number <= 2;

This isn’t one pass through the table, by the way. The subquery is implemented as a temporary table behind the scenes, so filling it with data is one pass; then selecting every row from it and applying the WHERE clause is another. However, twice through is still O(n) with respect to the table size. That’s a lot better than correlated subqueries, which are O(n2) with respect to the group size — even moderate group sizes cause bad performance (say there are five varieties of each fruit. That’s on the order of 25 passes through the table, all told).

One-pass technique on MySQL… maybe?

If you want to leave your queries up the the query optimizer’s whims, you can try this technique, which builds no temporary tables and makes just one pass through:

set @num := 0, @type := '';

select type, variety, price,
      @num := if(@type = type, @num + 1, 1) as row_number,
      @type := type as dummy
from fruits
group by type, price, variety
having row_number <= 2;

This theoretically ought to work if MySQL orders by the GROUP BY criteria, which it sometimes does for efficiency and to produce the expected results. Does it work? Here’s what it returns on MySQL 5.0.27 on Windows:

+--------+----------+-------+------------+--------+
| type   | variety  | price | row_number | dummy  |
+--------+----------+-------+------------+--------+
| apple  | gala     |  2.79 |          1 | apple  |
| apple  | fuji     |  0.24 |          3 | apple  |
| orange | valencia |  3.59 |          1 | orange |
| orange | navel    |  9.36 |          3 | orange |
| pear   | bradford |  6.05 |          1 | pear   |
| pear   | bartlett |  2.14 |          3 | pear   |
| cherry | bing     |  2.55 |          1 | cherry |
| cherry | chelan   |  6.33 |          3 | cherry |
+--------+----------+-------+------------+--------+

Look closely… it’s returning rows one and three from each group, and they’re not numbered in order of increasing price? Huh? But the HAVING clause says the row_number should be no greater than 2! Here’s what it returns on version 5.0.24a on Ubuntu:

+--------+------------+-------+------------+--------+
| type   | variety    | price | row_number | dummy  |
+--------+------------+-------+------------+--------+
| apple  | fuji       |  0.24 |          1 | apple  |
| apple  | gala       |  2.79 |          1 | apple  |
| apple  | limbertwig |  2.87 |          1 | apple  |
| cherry | bing       |  2.55 |          1 | cherry |
| cherry | chelan     |  6.33 |          1 | cherry |
| orange | valencia   |  3.59 |          1 | orange |
| orange | navel      |  9.36 |          1 | orange |
| pear   | bartlett   |  2.14 |          1 | pear   |
| pear   | bradford   |  6.05 |          1 | pear   |
+--------+------------+-------+------------+--------+

Look, this time everything is numbered 1 and every row is returned. Wonky. This is exactly what the MySQL manual page on user variables warns about.

This technique is pretty much non-deterministic, because it relies on things that you and I don’t get to control directly, such as which indexes MySQL decides to use for grouping. However, if you need to use it — and I know there are some folks out there who do, because I’ve consulted for them — you can still tweak it. We’re getting into the realm of really bastardizing SQL, but the results above came from a table without indexes other than the primary key on (type, variety). What happens if I add an index MySQL can use for grouping?

alter table fruits add key(type, price);

Nothing changes, and EXPLAIN shows why: the query doesn’t use the index I just added. Why? Because the grouping is on three columns, and the index is only on two. In fact, the query is using a temp table and filesort anyway, so this is still not achieving the once-through goal. I can force it to use the index:

set @num := 0, @type := '';

select type, variety, price,
      @num := if(@type = type, @num + 1, 1) as row_number,
      @type := type as dummy
from fruits force index(type)
group by type, price, variety
having row_number <= 2;

Let’s see if that works:

+--------+----------+-------+------------+--------+
| type   | variety  | price | row_number | dummy  |
+--------+----------+-------+------------+--------+
| apple  | fuji     |  0.24 |          1 | apple  | 
| apple  | gala     |  2.79 |          2 | apple  | 
| cherry | bing     |  2.55 |          1 | cherry | 
| cherry | chelan   |  6.33 |          2 | cherry | 
| orange | valencia |  3.59 |          1 | orange | 
| orange | navel    |  9.36 |          2 | orange | 
| pear   | bartlett |  2.14 |          1 | pear   | 
| pear   | bradford |  6.05 |          2 | pear   | 
+--------+----------+-------+------------+--------+

Ah, now we’re cooking! It did what I wanted, without a filesort or temporary table. Another way to do this, by the way, is to take variety out of the GROUP BY so it uses the index on its own. Because this selects a non-grouped column from a grouped query, this only works if you are running with ONLY_FULL_GROUP_BY mode turned off, which I hope you are not doing without good reason.

Other methods

Be sure to check the comments for user-contributed methods. There are some really novel approaches. I always learn so much from your comments… thank you!

Conclusion

Well, that’s it. I’ve shown you several ways of solving the common “get the extreme row from each group” query, and then moved on to how you can get the top N rows from each group in various ways. Then I dove into MySQL-specific techniques which some (including myself, depending on my mood) would regard as mildly foolish to utterly stupid. But if you need the last bit of speed out of your server, you sometimes have to know when to break the rules. And for those who think this is just MySQL foolishness, it’s not; I’ve seen people desperately do these types of things on other platforms too, such as SQL Server. There are hacks and tweaks on every platform, and people who need to use them.

I hope you’ve enjoyed and profited from this discussion. If you did, maybe you’d like to subscribe for convenient notification of future articles. Happy coding!

Written by Xaprb

December 7th, 2006 at 8:59 am

Posted in SQL

Tagged with

How to number rows in MySQL

with 18 comments

I wrote before about a generic, cross-platform way to simulate the SQL ROW_NUMBER() function in any RDBMS. There is a much more efficient way to do this on MySQL with user variables.

Background

Please see my previous article on how to simulate the ROW_NUMBER() function for the background. I’ll use the same table structure and data in this article.

Unfortunately, that’s a quadratic algorithm, so it’s not something I’d do much (though I once did it over small sets of data in SQL Server 2000 at a jobsite).

A more efficient method

In MySQL you can simultaneously assign to and read from user variables in a SELECT statement. This allows the following method of numbering rows:

set @type = '';
set @num  = 1;

select
   type,
   variety,
   @num := if(@type = type, @num + 1, 1) as row_number,
   @type := type as dummy
from fruits;

+--------+------------+------------+--------+
| type   | variety    | row_number | dummy  |
+--------+------------+------------+--------+
| apple  | fuji       |          1 | apple  | 
| apple  | gala       |          2 | apple  | 
| apple  | limbertwig |          3 | apple  | 
| cherry | bing       |          1 | cherry | 
| cherry | chelan     |          2 | cherry | 
| orange | navel      |          1 | orange | 
| orange | valencia   |          2 | orange | 
| pear   | bartlett   |          1 | pear   | 
| pear   | bradford   |          2 | pear   | 
+--------+------------+------------+--------+

How does that work? I’m restarting the row number each time the type column changes, by keeping track of the value it had in the last row. And I’m simultaneously incrementing and selecting the row number in each row.

The spurious dummy column has to be there, but if your version of MySQL supports it, you can use a subquery in the FROM clause to eliminate columns you don’t want in the results.

Efficiency

All I’m doing is maintaining a bit of extra memory and performing a few small comparisons and assignments for each row, so this technique is very efficient.

Playing with fire

You can refer to the generated row_number column in a HAVING or GROUP BY clause, but don’t burn your fingers. This technique is very much like playing with fire. The result of assigning to a variable and using it in the same statement (in the HAVING, for example) depends on the query plan the server chooses, the phase of the moon, and probably other things too. Before you use this technique, you should read and understand the section on user-defined variables in the MySQL Manual, and decide whether it’s safe for your query.

Now that you’ve read that section of the manual, particularly the part about the aliased expression, you should understand why the following query might be a safer paradigm when using the result in the HAVING clause, even though it produces another dummy column:

set @type = '';
set @num  = 1;

select
   type,
   variety,
   @num := if(@type = type, @num + 1, 1) as dummy_1,
   @type := type as dummy_2,
   @num as row_number
from fruits
group by type, variety
having row_number = 1;

+--------+----------+---------+---------+------------+
| type   | variety  | dummy_1 | dummy_2 | row_number |
+--------+----------+---------+---------+------------+
| apple  | fuji     |       1 | apple   | 1          | 
| cherry | bing     |       1 | cherry  | 1          | 
| orange | navel    |       1 | orange  | 1          | 
| pear   | bartlett |       1 | pear    | 1          | 
+--------+----------+---------+---------+------------+

(If I’m wrong about that, somebody please correct me).

A safer technique is to use a subquery in the FROM clause. This will cause the results to be materialized in a temporary table behind the scenes. It might be less efficient for some uses, though:

select type, variety
from (
   select
      type,
      variety,
      @num := if(@type = type, @num + 1, 1) as row_number,
      @type := type as dummy
   from fruits
) as x
where row_number = 1;

+--------+----------+
| type   | variety  |
+--------+----------+
| apple  | fuji     | 
| cherry | bing     | 
| orange | navel    | 
| pear   | bartlett | 
+--------+----------+

Conclusion

This is an efficient, flexible way to generate and use row numbers in MySQL. I’ll leave it to you to find uses for it for right now, but I’m going to show you at least one application for this in an upcoming article.

Written by Xaprb

December 2nd, 2006 at 4:02 pm

Posted in SQL

Tagged with , , ,

How to simulate the SQL ROW_NUMBER function

with 4 comments

If your RDBMS does not support the ROW_NUMBER function, it is still possible to write some queries without it. ROW_NUMBER operates over a partition of the result set, and returns a number for each row in a partition (it is possible to specify numbering schemes besides consecutive integers).

Ranking and windowing functions are available in Microsoft SQL Server 2005, and are part of the newer SQL standards. I expect it will be some time before MySQL implements them. I don’t know about other systems such as PostgreSQL and Firebird. It has been a long time since I worked with Oracle, so I’m unsure about it, and viewing the Oracle documentation requires an account I’m unwilling to create.

An easy solution

One easy solution is a temporary table with an identity (AUTO_INCREMENT in MySQL) column. Insert into the temporary table, then select from it again and use the values from the identity column. This only works when the result set contains a single partition, but it works very well and is very efficient.

When the data has multiple partitions

If the data has multiple partitions, the solution is more complex. Here is one possible solution:

create table fruits (
    type varchar(10) not null,
    variety varchar(20) not null,
    primary key(type, variety));

insert into fruits values
('apple', 'gala'),
('apple', 'fuji'),
('apple', 'limbertwig'),
('orange', 'valencia'),
('orange', 'navel'),
('pear', 'bradford'),
('pear', 'bartlett'),
('cherry', 'bing'),
('cherry', 'chelan');

select l.type, l.variety, count(*) as num
from fruits as l
left outer join fruits as r
    on l.type = r.type
    and l.variety >= r.variety
group by l.type, l.variety;

+--------+------------+-----+
| type   | variety    | num |
+--------+------------+-----+
| apple  | fuji       |   1 |
| apple  | gala       |   2 |
| apple  | limbertwig |   3 |
| cherry | bing       |   1 |
| cherry | chelan     |   2 |
| orange | navel      |   1 |
| orange | valencia   |   2 |
| pear   | bartlett   |   1 |
| pear   | bradford   |   2 |
+--------+------------+-----+

Here I’ve partitioned the data by the type column, so there are four partitions (apple, cherry, orange, pear). Notice the num column starts at 1 for each partition and counts upward.

The drawback to this solution is the LEFT OUTER JOIN with the >= in the join condition. This effectively makes the join a CROSS JOIN, which is inefficient (O(n2)). It may be a good idea to avoid this except on small data sets.

Another option on MySQL, but only with the MyISAM and BDB storage engines, is to use an AUTO_INCREMENT column as the second column in the primary key, which makes MySQL restart the numbering for each group.

Written by Xaprb

September 27th, 2005 at 8:45 am

Posted in SQL

Tagged with