Archive for the ‘user defined variables’ tag
MySQL challenge: LIMIT rows accessed, not rows returned
Dear reader, this is a challenge. How’s your MySQL prowess? You know about LIMIT: it cuts off the results at the specified number.
mysql>s; select actor_id from sakila.actor where actor_id % 5 = 0 limit 5; +----------+ | actor_id | +----------+ | 5 | | 10 | | 15 | | 20 | | 25 | +----------+ 5 rows in set (0.00 sec)
But that query actually accessed 25 rows. What if I want to say “return up to 5 rows, but don’t read any more than 20 rows to find them?”
Right now I’ve got the following:
mysql> select actor_id, @rows
-> from actor, (select @rows := 0) as x where
-> ((@rows := @rows + 1) <= 20)
-> and actor_id % 5 = 0
-> limit 5;
+----------+-------+
| actor_id | @rows |
+----------+-------+
| 5 | 5 |
| 10 | 10 |
| 15 | 15 |
| 20 | 20 |
+----------+-------+
4 rows in set (0.00 sec)
The derived table subquery x is only there to initialize the user variable at the beginning of the query.
This appears to work, but it doesn’t. If you profile this with SHOW STATUS, you see that it reads every row in the table (Handler_read_next = 200). This is actually worse, not better, than just LIMIT.
Any ideas?
I’ve got a few. But I don’t like them for various reasons. Extra props for really efficient solutions that don’t involve subqueries (so it’ll work on pre-4.0) or things that add extra overhead (subqueries, for example). I guess you probably see the direction I want to go with this — I don’t want to use subqueries.
How to write a lazy UNION in MySQL
The other day I was explaining options to someone who wanted to know about archiving data in MySQL. “So,” he said, “I might have to code my app to look for the data in two places?” The disadvantage of this is that his app might be more complex. Another disadvantage is that it might take two queries — if you look for a user in the usual location and it’s not there, you have to look for it elsewhere.
One way to deal with this, as long as the archived data is on the same server, is a UNION.
select user_id from user where user_id = 123 union all select user_id from user_archive where user_id = 123;
The benefit is that you don’t have to issue two queries. That saves network round trips, and makes your code shorter. But it has a disadvantage, too: you’re still querying the archive table when you don’t need to. Does this matter? Yes, it does. Your archive table may be very large and slow — perhaps stored on a big slow hard drive, perhaps on a SAN — and just peeking at it is kind of expensive in some cases.
Something occurred to me a couple of weeks ago: why not write a UNION that stops executing as soon as one part of it finds a row? Then you can UNION to your heart’s content and not incur the overhead of that second lookup unless you need it. For lack of a better term, I’m calling this a lazy UNION.
My idea here is to use a user variable. If the first part of the UNION finds a row, it sets the variable. The second part has the variable in its WHERE clause, and won’t execute if the variable has been set by the first part. To make the whole thing self-contained, I’m adding a third part of the UNION, which will always execute but never return any rows; it will set the variable back to its initial state of NULL.
Here’s a complete example:
drop table if exists user, user_archive; create table user(user_id int not null primary key); create table user_archive like user; insert into user(user_id) values(1); insert into user_archive(user_id) values(2); select greatest(@found := -1, user_id) as user_id, 'user' as which_tbl from user where user_id = 1 union all select user_id as user_id, 'user_archive' as which_tbl from user_archive where user_id = 1 and @found is null union all select 1, '' from dual where ( @found := null ) is not null; select greatest(@found := -1, user_id) as user_id, 'user' as which_tbl from user where user_id = 2 union all select user_id as user_id, 'user_archive' as which_tbl from user_archive where user_id = 2 and @found is null union all select 1, '' from dual where ( @found := null ) is not null;
You can play around with it and verify that indeed, the second part of the UNION never executes if the first part finds a row. There are several ways to do this: with EXPLAIN, by adding some more variables to show which part of the query executes, etc. The results of the above query are as follows:
+---------+-----------+ | user_id | which_tbl | +---------+-----------+ | 1 | user | +---------+-----------+ 1 row in set (0.00 sec) +---------+--------------+ | user_id | which_tbl | +---------+--------------+ | 2 | user_archive | +---------+--------------+ 1 row in set (0.00 sec)
I have not benchmarked this. My gut feeling is that whether it’s beneficial is going to depend on your workload. But it’s a fun little hack I thought I’d share with you. By the way, there’s no reason you have to stop at two; you could add any number of queries to the UNION.
How to select the first or last row per group in SQL
There is no “first” or “last” aggregate function in SQL. Sometimes you can use MIN() or MAX(), but often that won’t work either. There are a couple of ways to solve this vexing non-relational problem. Read on to find out how.
First, let’s be clear: I am posing a very non-relational problem. This is not about the minimum, maximum, top, most, least or any other relationally valid extreme in the group. It’s the first or last, in whatever order the rows happen to come. And we all know rows aren’t ordered — in theory. But in practice they are, and sometimes you need the first or last row in a group.
If you have a question this article doesn’t answer, you might like to read how to select the first/least/max row per group in SQL and how to find the maximum row per group in SQL without subqueries.
A MySQL user-variable solution
I’ll show a MySQL-specific solution with one of the queries I developed for MySQL Table Checksum.
Here’s the idea: crush an entire table down to a single checksum value by checksumming each row, mushing it together with the previous row’s checksum, and then checksumming the result again. It’s fairly easy to do this, but it’s hard to get the final result in one statement. This is necessary to use the statement in an INSERT .. SELECT, which I needed to do.
An example might clarify:
select * from fruit; +---------+ | variety | +---------+ | apple | | orange | | lemon | | pear | +---------+ set @crc := ''; select variety, @crc := md5(concat(@crc, md5(variety))) from fruit; +---------+-----------------------------------------+ | variety | @crc := md5(concat(@crc, md5(variety))) | +---------+-----------------------------------------+ | apple | ae6d32585ecc4d33cb8cd68a047d8434 | | orange | 7ec613c796f44ef5ccb0e24e94323e38 | | lemon | a2475f37be12cebf733ebfc7ee2ee473 | | pear | ec98fe57833bbd91790ebc7ccf84c7e9 | +---------+-----------------------------------------+
I want the “last” value of @crc after the statement is done processing. How can I do this? The solution I found is to use a counter variable. I’ll demonstrate:
set @crc := '', @cnt := 0; select variety, @cnt := @cnt + 1 as cnt, @crc := md5(concat(@crc, md5(variety))) as crc from fruit; +---------+------+----------------------------------+ | variety | cnt | crc | +---------+------+----------------------------------+ | apple | 1 | ae6d32585ecc4d33cb8cd68a047d8434 | | orange | 2 | 7ec613c796f44ef5ccb0e24e94323e38 | | lemon | 3 | a2475f37be12cebf733ebfc7ee2ee473 | | pear | 4 | ec98fe57833bbd91790ebc7ccf84c7e9 | +---------+------+----------------------------------+
The counter variable might make you want to write something like HAVING cnt = MAX(cnt), but that won’t work (try it!). Instead, I prefixed the checksum with the count so the last row is the stringwise maximum:
select variety,
@crc := concat(lpad(@cnt := @cnt + 1, 10, '0'),
md5(concat(right(@crc, 32), md5(variety)))) as crc
from fruit;
+---------+--------------------------------------------+
| variety | crc |
+---------+--------------------------------------------+
| apple | 0000000001ae6d32585ecc4d33cb8cd68a047d8434 |
| orange | 00000000027ec613c796f44ef5ccb0e24e94323e38 |
| lemon | 0000000003a2475f37be12cebf733ebfc7ee2ee473 |
| pear | 0000000004ec98fe57833bbd91790ebc7ccf84c7e9 |
+---------+--------------------------------------------+
You can see I also left-padded the count so a lexical sort will agree with a numeric sort, and so I can predict how many extra characters I’ll need to remove to get back the original value. Now I can use the MAX() function to select the last row, and simply lop off the leftmost ten digits (I use the RIGHT() function for convenience, but generally you want to use SUBSTRING()):
select right(max(
@crc := concat(lpad(@cnt := @cnt + 1, 10, '0'),
md5(concat(right(@crc, 32), md5(variety))))
), 32) as crc
from fruit;
+----------------------------------+
| crc |
+----------------------------------+
| ec98fe57833bbd91790ebc7ccf84c7e9 |
+----------------------------------+
Et voila, I got the last value in the group. By the way, this will work with ONLY_FULL_GROUP_BY in the server’s SQL mode.
Other methods
My solution relies on a MySQL user variable to do the counting, but there are many ways to number rows in SQL: you could simulate the ROW_NUMBER() function, for instance, or use techniques mentioned in the comments on how to number rows in MySQL (one of the comments shows a particularly clever solution with subqueries, but I didn’t want to use it because MySQL doesn’t support subqueries in older versions). Any of these should work one way or another. Of course, if you are using a product such as Microsoft SQL server 2005, which actually has the ROW_NUMBER() function, you can use that!
Conclusion
Finding the first or last row is a bit unintuitive, and it’s definitely non-relational, but sometimes it’s what you need. The technique I demonstrated in this article is easily adaptable to many kinds of queries. I hope it helped you!
If this article didn’t solve your problem, please read these before posting questions to the comments: how to select the first/least/max row per group in SQL and how to find the maximum row per group in SQL without subqueries.


