Xaprb

Stay curious!

How to select the Nth greatest/least/first/last row in SQL

with 11 comments

This is a continuation of my articles on how to select the desired rows from ranked data. A user recently posed a question in the comments that I thought was particularly intriguing:

What is the best way to query 1) Sum of min price of all types? 2) Sum of 2nd highest price of all types?

Sounds like fun! Let me start by saying the sum is the easy part. You can always do that like so:

select sum(price) from (
   -- find desired rows here
) as x;

Finding the desired rows is the hard part. In my previous articles I focused on extrema:

  • The single biggest/smallest/extremest row in each group. (Pretty easy.)
  • The N most extreme rows in each group. (Doable, but harder.)

In this article, we’re going to see how to get not the most extreme row, not the N most extreme rows, but — hold your breath — the single Nth most extreme row per group. (In a future article I might talk about how to get the Nth through Mth most extreme rows.)

The setup

Let’s create some sample data to get started.

drop table if exists fruits;

create table fruits (
   type varchar(20) not null,
   variety varchar(20) not null,
   price int not null,
   primary key (type, variety)
);

insert into fruits values
('apple',  'fuji',       1),
('apple',  'gala',       2),
('apple',  'limbertwig', 3),
('cherry', 'bing',       4),
('cherry', 'chelan',     5),
('orange', 'navel',      6),
('orange', 'valencia',   7),
('pear',   'bartlett',   8),
('pear',   'bradford',   9);

For convenience so it’s easier to see how they are ordered, I’ve just ordered the fruits alphabetically and given them unique prices.

The desired results — second-cheapest prices for each fruit — are as follows:

+--------+-----------------+
| type   | second_cheapest |
+--------+-----------------+
| apple  |               2 | 
| cherry |               5 | 
| orange |               7 | 
| pear   |               9 | 
+--------+-----------------+

The solution

The intuition you need here is that if you get the 2 cheapest fruits in each group, and then take the single most extreme from each group, you can get the Nth offset. Let’s begin with one of the queries from my earlier article. (You should be able to use any of them. I’m just using this one because it’s convenient and pretty clear.)

select type, variety, price
from fruits
where (
   select count(*) from fruits as f
   where f.type = fruits.type and f.price < fruits.price
) <= 1;
+--------+----------+-------+
| type   | variety  | price |
+--------+----------+-------+
| apple  | fuji     |     1 | 
| apple  | gala     |     2 | 
| cherry | bing     |     4 | 
| cherry | chelan   |     5 | 
| orange | navel    |     6 | 
| orange | valencia |     7 | 
| pear   | bartlett |     8 | 
| pear   | bradford |     9 | 
+--------+----------+-------+

The result is the 2 cheapest fruits from each type. (Notice that all we really did was eliminate one row -- the most expensive apple.) Now let's get the second cheapest -- and what is that? It's simply the most expensive of the fruits we found in that query. And that's just a MAX().

select type, max(price) as second_cheapest
from (
   select type, variety, price
   from fruits
   where (
      select count(*) from fruits as f
      where f.type = fruits.type and f.price < fruits.price
   ) <= 1
) as x
group by type;
+--------+-----------------+
| type   | second_cheapest |
+--------+-----------------+
| apple  |               2 | 
| cherry |               5 | 
| orange |               7 | 
| pear   |               9 | 
+--------+-----------------+

That's it!

Sum of the second cheapest

By now you probably see the pattern: do it one step at a time, turning each thing into a simpler question that's easy to answer. So how do we sum the second cheapest prices for each type of fruit? First, we find them (done!), then we sum them.

select sum(second_cheapest) from (
   select type, max(price) as second_cheapest
   from (
      select type, variety, price
      from fruits
      where (
         select count(*) from fruits as f
         where f.type = fruits.type and f.price < fruits.price
      ) <= 1
   ) as x
   group by type
) as y;
+----------------------+
| sum(second_cheapest) |
+----------------------+
|                   23 | 
+----------------------+

Conclusion

In this post I showed you how to decompose the problem into simpler and simpler pieces. Often what's hardest about a complex query is trying to do it all at once. I have lots of tips elsewhere on this blog about how to make things faster -- this is not a particularly fast query -- but here I just wanted to show how to get the correct answer.

Written by Baron Schwartz

August 8th, 2008 at 11:40 am

Posted in Uncategorized

Tagged with

11 Responses to 'How to select the Nth greatest/least/first/last row in SQL'

Subscribe to comments with RSS

  1. I would like to know how to control joined rows per join condition. A common requirement is to pull rows from a table and to fetch some corresponding rows in a child table – but only 1 row per join condition, typically the most recent.

    For example a forum application would like to list a list of threads with some data on the last post in the thread. How can that be accomplished in a normalized setup?

    Eran Galperin

    8 Aug 08 at 7:34 pm

  2. Approach it the other direction.

    First, get the data on the last post in each thread. See my earlier posts for this.

    Then join to the threads.

    Xaprb

    8 Aug 08 at 10:00 pm

  3. As soon as MySQL will add analytic functions, it will be muuuuch easier :)

    Gints

    9 Aug 08 at 5:21 pm

  4. A simplification:

    select sum(price)
    from fruits f1
    where
    ( select count(*)
    from fruits f2
    where f1.type=f2.type and
    f2.price <= f1.price
    ) = 2

    If can exists multiple varieties with the same price then:

    select sum(price) from fruits f1
    where (
    select count(*)
    from fruits f2
    where f1.type=f2.type and
    ((f2.price < f1.price) or ( (f2.price=f1.price) and
    (f2.variety < f1.variety)))

    ) =1

    Jordi Baylina

    11 Aug 08 at 3:57 am

  5. Nice article, but isn’t the 2nd cheapest the _least_ expensive of the fruits found in the first query (so “min(price) as second_cheapest”)?

    Rijk van Wel

    12 Aug 08 at 4:03 pm

  6. Nope, because the first query finds the two cheapest prices for each fruit, so that would effectively be “find the min of the min” which is… the min, or the 1st cheapest.

    Try it and see!

    Xaprb

    12 Aug 08 at 8:36 pm

  7. Thank you Jordi for the technique to handle identical entries. I needed to select the lowest four golf scores from the last 6 rounds played and couldn’t do it reliably until I saw your method.

    If you revisit this site, perhaps you could explain how this works.

    Fred

    23 Jan 09 at 11:34 am

  8. The main idea is to have a “calculated field” for each fruit that indicates the position in the group. This is the inner select.

    Then, I just use this field to select the fruits that I am interested in order to make the sum, selection, or what ever.

    Jordi Baylina

    26 Jan 09 at 5:48 am

  9. A particular table in a database stores the time versus train speed data. This table keeps on growing with time as the train goes through a particular region (fixed in space).

    Basically trains enter the region, move through it and then leave a region. I am pulling in GPS data for that region (time versus speed) each time a train goes through the region. So the table keeps on growing as a train arrives and leaves , and then the next train arrives and leaves and so on…

    As the train goes through the region, it generates following type of data:

    Time Speed
    13:00 3 (say train enters the region at 13:00)
    13:01 1
    13:02 0
    13:03 0
    13:04 0
    13:05 1
    13:06 2
    13:07 4
    13:08 1
    13:09 0
    13:10 0
    13:11 0
    13:12 1
    . .
    . .
    . .

    The time the train enters the region is obviously not fixed. However another timestamp (say Train_Arrival from a difeerent table in the database) is available and it always indicates the latest train. (that is Train_Arrival timestamp will remain unchanged until the latest train has left the region and the next train arrives)

    For the latest train (say in the above example, Train_Arrival was 12:55), I need to find the first time the train speed goes to ’0′ when inside the region (i.e. 13:02 in the above example).

    It is not able to post process the data as the data is arriving in real-time and needs to be processed in real-time as well (a requirement for the application being built)

    If you can provide me the right direction, it shall be deeply appreciated.

    Kumar

    27 Jul 09 at 12:37 pm

  10. select min(Time)
    from speeds
    where Time>Train_Arrival and Speed=0

    Jordi Baylina

    28 Jul 09 at 2:54 pm

  11. Hi how do I do this if for example my fruits table comes from a select result set?

    Geocine

    5 Aug 10 at 4:19 am

Leave a Reply