This is the third in a series of articles on profiling queries in MySQL (the second of two demonstrations of profiling techniques, but the third article overall). In this article I’ll present the same example as in the second article, but use a different approach to show details I didn’t include.
An astute reader’s comment reinforced my vague unease at my second article. In hindsight, I see I got distracted trying to choose an interesting example that’s not too hard to present in one article, yet has enough depth to usefully demonstrate the technique. It’s harder than I thought it would be. In real life I’ve learned from dozens of cases, and cramming everything into one example is probably not possible. In any case, I’m going to approach the same query from a different angle in this article, so you get a more complete picture.
Method and results
For the tests in this article, I restarted MySQL, then ran the query twice against the first table. The first run was ‘cold’ — no data or indexes had been read into memory yet. The second was ‘warm’ and should perform just as it did the last time I profiled. I measured both runs and found what I expected. So far, so good.
Then something really surprising happened. I had dropped the tables I used in the second article, and re-created them the same way for this article, but the queries against the redesigned table (clustered date-first, to optimize queries on a date range) performed almost identically to the queries against the table with a surrogate key. What happened?
I’m not sure, actually. For some reason, MySQL’s optimizer decided to use the client index, which is the same strategy as it used on the initial table design — in fact, the query plan was identical. I double-checked all the data and table structures to be sure, re-analyzed the table, and tried again; no dice. It still wanted to use a sub-optimal query plan! In the end, I rewrote the query with FORCE INDEX to make it use the primary key instead of the client index, and got good performance as I expected. If anyone has insight as to why this non-repeatable result happened, please leave a comment.
Here are all three sets of numbers. The three sets are the query against the initial table design, the redesigned table, and the redesigned table with FORCE INDEX. As I said, I ran each query twice: once cold, once warm.
| Variable_name | Before Redesign | After Redesign | Redesign, FORCE INDEX |
|||
|---|---|---|---|---|---|---|
| Cold | Warm | Cold | Warm | Cold | Warm | |
| Com_select | 1 | 1 | 1 | 1 | 1 | 1 |
| Created_tmp_tables | 1 | 1 | 1 | 1 | 0 | 0 |
| Handler_read_key | 1522 | 1522 | 1522 | 1522 | 2 | 2 |
| Handler_read_next | 49881 | 49881 | 49881 | 49881 | 0 | 0 |
| Handler_read_rnd | 31 | 31 | 31 | 31 | 31000 | 31000 |
| Handler_read_rnd_next | 32 | 32 | 32 | 32 | 0 | 0 |
| Handler_update | 1488 | 1488 | 1488 | 1488 | 0 | 0 |
| Handler_write | 31 | 31 | 31 | 31 | 0 | 0 |
| Innodb_buffer_pool_read_ahead_rnd | 1 | 1 | 2 | 2 | 1 | 0 |
| Innodb_buffer_pool_read_ahead_seq | 168 | 168 | 151 | 151 | 4 | 0 |
| Innodb_buffer_pool_read_requests | 204739 | 204733 | 204522 | 204361 | 4117 | 3963 |
| Innodb_buffer_pool_reads | 88 | 91 | 138 | 100 | 10 | 0 |
| Innodb_data_read | 45629440 | 45678592 | 42106880 | 41517056 | 1343488 | 0 |
| Innodb_data_reads | 257 | 260 | 295 | 255 | 16 | 0 |
| Innodb_pages_read | 2785 | 2788 | 2570 | 2534 | 82 | 0 |
| Innodb_rows_read | 49881 | 49881 | 49881 | 49881 | 31001 | 31001 |
| Select_range | 0 | 0 | 0 | 0 | 1 | 1 |
| Sort_rows | 31 | 31 | 31 | 31 | 0 | 0 |
| Sort_scan | 1 | 1 | 1 | 1 | 0 | 0 |
| Table_locks_immediate | 1 | 1 | 1 | 1 | 1 | 1 |
Analysis
You can see the queries that use the client index perform almost identically to each other. They use the same query plan, build the same temporary table, and so on. The main difference is there’s a different amount of data in the table and indexes after the redesign:
| Before Redesign | After Redesign | |
|---|---|---|
| Data Size | 45678592 | 40452096 |
| Index Size | 53067776 | 39944192 |
Using the surrogate key is less space-efficient in this case, so the redesigned table is smaller. However, each index is smaller in the table with the surrogate key, because the primary key is not as wide. If I had to guess, I wouldn’t know whether this would result in more or less data being read, which is why I don’t guess, I measure. It turns out InnoDB reads the same number of rows, but they fit in fewer pages after re-indexing, so it reads a couple hundred fewer pages. Still, either of the queries using the client index reads about 40 MiB of data, whether it’s run cold or warm.
The query that scans a range of the primary key reads 1.28 MiB cold, and zero when it’s warm — 82 page reads instead of thousands. It also makes many fewer requests to the buffer pool. And yet, it reads about 60% the number of rows. It’s just that these rows are contiguous within the table, and therefore much more efficient to read. It only makes two index reads, which is great. I assume these two are to find the beginning of the date range in the primary key. After that it just scans every row till it finds the end, which is why Handler_read_rnd is high — 31,000 in fact, which is the number of rows in the date range:
select count(*) from tracking where day between '2007-01-01' and '2007-01-31'; +----------+ | count(*) | +----------+ | 31000 | +----------+
This is one less than the number of rows InnoDB reports reading. I believe this is because InnoDB read an extra row, the one past the end of the date range, to determine where to stop scanning.
Finally, here are the Last_query_cost variables again. As before, the query optimizer thinks the clustered index scan is more expensive, but it’s wrong.
| Before Redesign | After Redesign | Redesign, FORCE INDEX |
|---|---|---|
| 21247.5 | 10526.9 | 86457.133551 |
Conclusion
This article measured the differences between running the query cold, and running it with the server warmed up and the data already in memory. I think this actually accentuates the second table design’s improved query performance, because even when it needs to read data and indexes from the disk, it doesn’t have to read as much data. I think I’ve explored the full depth of this example now.
Thanks for the feedback, and keep those comments coming! I’m by no means the expert on all this, so I hope you’ll teach me what you know. I’m still putting the finishing touches on a tool to profile queries easily, and will post an article on that shortly.
In the meantime, you can subscribe via e-mail or feeds to be notified when I do.
Technorati Tags:No Tags
I think there are several issues to consider here. First off, note that
the indexes used in step 1 and in step 2 of your previous article are
subtly different, with important consequences:
The table as defined at the start of the article has (amongst others) an
index like so:
key ad (ad, day)
But the table in step 2 uses
primary key (day, ad)
Note the change of order of variables! This has grave consequences, as
the MySQL manual tells us: “MySQL cannot use a partial index if the columns do not form a leftmost prefix of the index.“
Given that the WHERE clause restricts on day but not on ad, the index in
example 1 just cannot be used — period. In contrast, the index in example 2 could be used, since day is in fact a leftmost prefix of the primary key.
However, it will in fact not be used, as you mention in this article; my
own experience supports this observation. Why this is so is not clear to
me, either, because the manual explicitly says that BETWEEN clauses are
OK for using an index, and it also says that MySQL will “normally” use
the most selective index that finds the smallest number of rows. (In my
case, index client would select 36286 rows, but the index on (day,ad)
would select 24880, which is substantially less.)
Further experimentation with the paramaters of the query yields an
interesting result, though. Your original query used the WHERE clause
where client=11 and day between ‘2007-01-01′ and ‘2007-01-31′
When I tighten the date range to
where client=11 and day between ‘2007-01-01′ and ‘2007-01-21′
(note: only 21 instead of 31 days), I still get the same result: the
primary key will not be used. However, when I take a 20 day range
where client=11 and day between ‘2007-01-01′ and ‘2007-01-20′
then MySQL suddenly starts to use the primary key, as described in your
previous article!
A closer look at the rows considered in these cases shows:
So, strangely, MySQL considers 15305 less than 36286 (sounds plausible
to me…), but 15954 greater or equal than 36286. Interesting!
(By the way, since the table is filled with some degree of randomness,
your mileage may vary. But the results should be in comparable ranges.)
What becomes apparent is that the result is heavily influenced by the
optimizer’s idea of the current key distribution. For this reason,
ANALYZE TABLE is always A Good Thing after many inserts and/or deletes
on MyISAM tables. (For InnoDB, this should not be necessary.)
Now, by specifying USE INDEX(primary) as suggested by you in this
article (FORCE INDEX is a bit stronger but in fact not needed here), you
make up for some, ahem, non-standard comparison algorithm used by the MySQL query optimizer. However, contrary to what the article suggests, this trick will work both on the original table (with the artificial
primary key on id) and the modified one.
So, the message is not just “define a good primary key” (although that
is always a good idea) or “bad primary keys can prevent use of good
other indexes”, but rather threefold:
the individual fields.
the order on natural numbers.
Hi Gisbert, thanks for writing in. You are right about the (day,ad) and (ad,day) change — I may not have called that difference out clearly. It was crucial to getting better performance. I’ve written a lot about that in the past.
MySQL will choose a full scan in certain cases even when another index would be more selective. The reason? It’s cheaper to scan the data itself than to probe and index and look up, probe and look up, on and on. With InnoDB, secondary indexes are quite expensive compared to the primary key, because once you find an entry in the secondary index you only have a tuple from the primary key, which you have to navigate to find the actual row.
The ratio is not exactly clear from the documentation, but is said to be about 20 to 30 percent. In other words, if you have 100,000 rows and an index will select 35,000 of them, it’s probably cheaper to just scan the whole table, and MySQL will do that instead of using the index. I should look in the source code to find out where that decision is made, and maybe make it clearer what the magic number is (it also depends on statistics, which are generated from 8 random b-tree dives and so can be wildly inaccurate sometimes).
MySQL has now
acknowledged the strange preference for certain indexes as a bug.