MySQL challenge: LIMIT rows accessed, not rows returned
Dear reader, this is a challenge. How’s your MySQL prowess? You know about LIMIT: it cuts off the results at the specified number.
mysql>s; select actor_id from sakila.actor where actor_id % 5 = 0 limit 5; +----------+ | actor_id | +----------+ | 5 | | 10 | | 15 | | 20 | | 25 | +----------+ 5 rows in set (0.00 sec)
But that query actually accessed 25 rows. What if I want to say “return up to 5 rows, but don’t read any more than 20 rows to find them?”
Right now I’ve got the following:
mysql> select actor_id, @rows
-> from actor, (select @rows := 0) as x where
-> ((@rows := @rows + 1) <= 20)
-> and actor_id % 5 = 0
-> limit 5;
+----------+-------+
| actor_id | @rows |
+----------+-------+
| 5 | 5 |
| 10 | 10 |
| 15 | 15 |
| 20 | 20 |
+----------+-------+
4 rows in set (0.00 sec)
The derived table subquery x is only there to initialize the user variable at the beginning of the query.
This appears to work, but it doesn’t. If you profile this with SHOW STATUS, you see that it reads every row in the table (Handler_read_next = 200). This is actually worse, not better, than just LIMIT.
Any ideas?
I’ve got a few. But I don’t like them for various reasons. Extra props for really efficient solutions that don’t involve subqueries (so it’ll work on pre-4.0) or things that add extra overhead (subqueries, for example). I guess you probably see the direction I want to go with this — I don’t want to use subqueries.



Just do WHERE actor_id IN (5,10,15,…) ?
Filling the list is trivial if you know the amount of rows wanted and the step size. You also have to know that there aren’t any gaps in the actor_id’s, but I think these queries are useless anyway if gaps are allowed.
Maarten
29 Jun 08 at 5:07 am
I can’t think of any good ways to do it without sub queries (sorry)…but I think a basic pagination query will work – it still reads a full 25, but it only reads it once…the following example should grab you the last 5 values in the set of 25…
select actor_id from (
select actor_id from (
select actor_id from sakila.actor where actor_id % 5 = 0 order by actor_id asc limit 25
) tmp order by actor_id desc limit 5
) tmp1 order by actor_id asc;
Kevin Marshall
29 Jun 08 at 8:10 am
for terminating the table scan early an optimizer has to prove this is correct. this is easy for oracle’s ROWNUM because it is known to be strictly monotonic increasing and once a ‘
strcmp
29 Jun 08 at 8:40 am
for terminating the table scan early an optimizer has to prove this is correct. this is easy for oracle’s ROWNUM because it is known to be strictly monotonic increasing and once a ‘<=’ criterion changes to ‘false’ it stays that way. if the optimizer doesn’t know about your innovative construct it won’t be able terminate the scan early (C compilers know to optimize this, but only because it is a known common construct). this includes explicitly ordered tests like “where if( ((@rows := @rows + 1) <= 20), actor_id % 5 = 0, 0)” where there can be no access to row data after hitting the limit at all.
the only features limiting number of scanned rows i know about are unique indexes (1) (exploited the IN (…)-expression) or using tables with only limited number of rows (2) or combinations thereof (JOINs). you can limit the number of generated rows with aggregate functions (3) (to get exactly one row) and/or GROUP BY constructs over a limited value set (3a) or LIMIT (4). the LIMIT optimization looks as good as you can get, i.e. it stops scanning completely once the limit is reached. this leaves the trivial solution SELECT … FROM (SELECT * FROM actors LIMIT 20) actors WHERE … or the equivalent code with a temporary table, i.e. using a combination of (4) and (2). what did i forget?
strcmp
29 Jun 08 at 8:42 am
SELECT actor_id
FROM (
SELECT actor_id
FROM actor
ORDER BY actor_id
LIMIT 20
) a
WHERE actor_id % 5 = 0;
If you do an explain on the query, it appears that the derived table will read all 200 rows, but according to handler_read_next, it only reads 20.
Andrew
29 Jun 08 at 11:33 am
mysql> SHOW SESSION STATUS LIKE ‘handler_read_next’;
——————- ——-
| Variable_name | Value |
——————- ——-
| Handler_read_next | 476 |
——————- ——-
1 row in set (0.00 sec)
mysql> SET @rows= 0;
Query OK, 0 rows affected (0.00 sec)
mysql> SELECT actor_id
-> FROM actor
-> WHERE (@rows := @rows 1) GROUP BY actor_id
-> HAVING actor_id % 5 = 0;
———-
| actor_id |
———-
| 5 |
| 10 |
| 15 |
———-
3 rows in set (0.02 sec)
mysql> SHOW SESSION STATUS LIKE ‘handler_read_next’;
——————- ——-
| Variable_name | Value |
——————- ——-
| Handler_read_next | 476 |
——————- ——-
1 row in set (0.00 sec)
Rob Wultsch
29 Jun 08 at 1:16 pm
Rob: what are the other Handler stats, though?
Xaprb
29 Jun 08 at 1:28 pm
Gah! I had forgotten to turn off query cache. This still increments handle_read_next by 200.
Back to the drawing board…
Rob Wultsch
29 Jun 08 at 1:45 pm
I understand you’re looking for the first 5 actor_ids that are divisible by 5. Since actor_id is auto_increment, it’s always ascending, so you’ll find your actor_ids in the first 20 rows like this:
select actor_id from actor where actor_id % 5 = 0 and actor_id < 26;
A quick explain shows only 20 rows accessed and no sub-queries:
mysql> explain select actor_id from actor where actor_id % 5 = 0 and actor_id < 26\G;
*************************** 1. row ***************************
id: 1
select_type: SIMPLE
table: actor
type: range
possible_keys: PRIMARY
key: PRIMARY
key_len: 2
ref: NULL
rows: 20
Extra: Using where; Using index
1 row in set (0.00 sec)
Mark Schoonover
29 Jun 08 at 2:54 pm
I take it back, the sql cache was not the problem. MySQL did not want to do an index scan…
mysql> SHOW SESSION STATUS LIKE ‘handler%’;
—————————- ——-
| Variable_name | Value |
—————————- ——-
| Handler_commit | 0 |
| Handler_delete | 0 |
| Handler_discover | 0 |
| Handler_prepare | 0 |
| Handler_read_first | 0 |
| Handler_read_key | 0 |
| Handler_read_next | 0 |
| Handler_read_prev | 0 |
| Handler_read_rnd | 0 |
| Handler_read_rnd_next | 0 |
| Handler_rollback | 0 |
| Handler_savepoint | 0 |
| Handler_savepoint_rollback | 0 |
| Handler_update | 0 |
| Handler_write | 14 |
—————————- ——-
15 rows in set (0.00 sec)
mysql> SET @rows= 0;
Query OK, 0 rows affected (0.00 sec)
mysql> SELECT SQL_NO_CACHE actor_id
-> FROM actor FORCE INDEX(PRIMARY)
-> WHERE (@rows := @rows 1) GROUP BY actor_id
-> HAVING actor_id % 5 = 0;
———-
| actor_id |
———-
| 5 |
| 10 |
| 15 |
———-
3 rows in set (0.00 sec)
mysql> SHOW SESSION STATUS LIKE ‘handler%’;
—————————- ——-
| Variable_name | Value |
—————————- ——-
| Handler_commit | 0 |
| Handler_delete | 0 |
| Handler_discover | 0 |
| Handler_prepare | 0 |
| Handler_read_first | 1 |
| Handler_read_key | 200 |
| Handler_read_next | 0 |
| Handler_read_prev | 0 |
| Handler_read_rnd | 0 |
| Handler_read_rnd_next | 16 |
| Handler_rollback | 0 |
| Handler_savepoint | 0 |
| Handler_savepoint_rollback | 0 |
| Handler_update | 0 |
| Handler_write | 29 |
—————————- ——-
15 rows in set (0.02 sec)
So I am doing a full index scan. This should work for version above 4.0.9 (when FORCE INDEX was introduced, USE does not appear to be enough).
Rob Wultsch
29 Jun 08 at 5:33 pm
*MySQL did not want to consistently do an index scan after I did an analyze table…
Rob Wultsch
29 Jun 08 at 5:35 pm
I think dealing with the “gaps” case is important – otherwise you can simply do the IN (…) solution posted above or that other approach linked on planetmysql by assumming actor_id
ak47
30 Jun 08 at 5:49 am
In this example, surely adding ‘AND actor_id BETWEEN 1 AND 21′ to the where would do the trick? I appreciate this isn’t a blanket solution, but solves the immediate problem…
GBA
30 Jun 08 at 7:46 am
I should note that this particular example is just a simple case. The tool I want to use this for (mk-archiver) is general-purpose and its existing techniques will work on a table whose primary key is any number of columns with any datatype. For that matter, it doesn’t even have to use a primary key, it can use any index. That said, if there’s a special optimization that it could apply to a single-column auto_increment primary key, that might be worth doing. But that still needs to work when there are gaps.
My current thinking is something like this: if you want N rows back from the table, where potentially M > N rows might have to be scanned but you don’t want to scan the entire rest of the table in case there are no more rows that match at all (i.e. you are archiving 100 rows at a time in a million-row table and you expect that you only need to archive 50 chunks oldest-first before quitting) you can do something approximate — for example, WHERE … AND pk_column < @current_row 1000 LIMIT 100
Xaprb
30 Jun 08 at 8:16 am
Well, if you don’t require the solution to be a single SQL statement than your tool could use HANDLER to limit the rows returned. With HANDLER you can specify exactly what index you are retrieving from, and then execute HANDLER READ NEXT (without a where clause) until you decide you don’t want to archive any more. The cost of course is more network traffic (since you can’t PREPARE them) but the reads are very fast and non-blocking for MyISAM and InnoDB.
I’d have to know more about what you’re trying to do to know if it would actually fit, and it’s probably a long shot anyhow. But HANDLER is a very underused statement compared to its value, especially when you see people abusing cursors to get the same effect… but I digress. Good luck with your archiving solution.
Ryan Thiessen
1 Jul 08 at 12:35 am
That’s an excellent point. I had a TODO note to build some code that could figure out when HANDLER is appropriate to use. It has some limitations of course.
Xaprb
1 Jul 08 at 12:55 am
The biggest problem is that even though you’re looking at an index on a column, what you really want is an index on a function of a column — in this case, you want an index on (actor_id % 5).
As this is for an external tool I do not know how much schema changing you can do, nor do I know how much performance you want to sacrifice by having a copy of a table where you can index what you *really* want to index.
Subqueries won’t help because MySQL doesn’t support LIMIT in subqueries; what you’d really like is something like
explain select actor_id
from actor
where actor_id in (select actor_id from actor limit 20)
and actor_id % 5 = 0;
but that’s not possible with MySQL just yet.
Another way to look at it is using relational calculus. What you want is the intersection of 2 tables:
SELECT tbl1.actor_id FROM actor as tbl1 WHERE actor_id % 5 = 0 limit 5;
SELECT tbl2.actor_id FROM actor as tbl2 LIMIT 20;
And you can’t really have 2 LIMITs in one query.
Hacks like the max_concat trick might work. Another way to look at it is to see the query as “the top x values in a group” except it’s not the “top” x values.
If this were a client I’d recommend having a separate column that’s actor_id % 5 and put an index on that; although given the generic nature of your issue, that may not be the rightest solution.
Sheeri Cabral
1 Jul 08 at 6:32 am
Sheeri:
Your probably aware, but I think it is worth note just in case you are not that Postgres supports indexs on a function of a column.
http://www.postgresql.org/docs/8.3/interactive/indexes-expressional.html
IIRC Oracle supports something similar as well.
Rob Wultsch
1 Jul 08 at 8:03 am
Rob — in fact, I was aware. That’s why I said MySQL doesn’t have it “yet”, because indexes on functions and other similar features that stem from materializing data is something I think people will request/demand/pay for in MySQL sooner rather than later.
Sheeri Cabral
1 Jul 08 at 1:05 pm
the corect code is
SELECT * FROM pregunta order by idpre desc limit 0,10
Diseño Web Peru
12 Oct 08 at 4:40 pm