How to optimize subqueries and joins in MySQL
I have written before about using joins instead of subqueries, especially for NOT IN queries, which can usually be rewritten as exclusion joins — sometimes with huge efficiency gains. In this article I’ll look more closely at the performance characteristics of a few queries I’ve optimized in MySQL 5.0.3. I’ll also show you some tricks you can use to get MySQL to optimize queries better when you know it’s being inefficient.
Updates in a join
I wrote recently about the theoretical problems caused by UPDATE statements with FROM clauses (Many-to-one problems in SQL). This particular query shows the performance difference between a “correct” query and a “bad, nonstandard” query I wrote recently at work.
The application for this query is to update a table with aggregated clicks per day from a table of click-tracking data for online advertising. I have pre-populated the aggregate table, with one row per ad per day, over the time period I’m interested in. I have a table called tracking that holds non-aggregated click data. Here are the simplified table structures:
create table aggregate(
day date not null,
ad int not null,
clicks int,
primary key(day,ad)
) type=InnoDB;
create table tracking (
tracking int not null auto_increment primary key,
day date not null,
ad int not null,
clicktype enum('real','fraud','unknown') not null,
clicks int not null,
unique index(day, ad, clicktype)
) type=InnoDB;
Since the tables are InnoDB, the clustered index is the primary key. Notice two things about the tracking table: it has a surrogate key so it’s not clustered on the natural primary key (day, ad, clicktype), and clicks are separated out by clicktype, so I’ll have to aggregate the clicks I want. These are the design constraints I have to work with; I didn’t design the table.
For this article, I populated my tables with some pseudo-random data. First, I filled my aggregate table with some ads and days over which I want to work. I want to aggregate ads 1 through 50 over the time period 2005-01-01 through 2005-01-10, so that’s 500 rows in the aggregate table. The tracking table is moderately large. I filled it with dates from 2003-04-07 to 2006-01-01 (1,000 days) for ads 1 through 1000, with one row for each click type. That’s a total of three million rows. The clicks are random numbers between 1 and 100.
Given a numbers table with at least 1,000 rows, here are scripts to populate the tables. The second query will probably take a while to run, and will create a medium-sized chunk of data, so don’t think something is wrong when it keeps on grinding (15 minutes isn’t unreasonable). By the way, it’s not a good idea to run these queries on a production server!
insert into aggregate(day, ad)
select date_add('2006-01-01', interval a.i - 1 day), b.i
from number as a
cross join number as b
where a.i <= 10 and b.i <= 50;
insert into tracking(day, ad, clicktype, clicks)
select date_add('2003-04-07', interval a.i - 1 day),
b.i,
c.type,
rand() * 100
from number as a
cross join number as b
cross join (
select 'real' as type
union all select 'fraud'
union all select 'unknown'
) as c
where a.i <= 1000 and b.i <= 1000;
Now that the tables are set up, I’ll move on to the queries. Since MySQL doesn’t provide really good tools to profile queries, I ran these queries several times, disregarding the first run because it may not have been cached the same way as subsequent runs. The first query is the “bad, evil” way to do it. The second is the official, standards-compliant way.
Notice that the first query is a join, but the second query is a dependent (correlated) subquery, where the subquery refers to values in the enclosing query. They are equivalent, but the query plan is likely to be completely different. My goal here is to measure the performance characteristics of the two methods.
update aggregate as a
inner join (
select day, ad, sum(t.clicks) as c
from tracking as t
where t.clicktype in ('real', 'unknown')
group by day, ad
) as t on a.ad = t.ad and a.day = t.day
set clicks = c;
update aggregate as a
set clicks = (
select sum(t.clicks)
from tracking as t
where t.day = a.day
and t.ad = a.ad
and t.clicktype in ('real', 'unknown')
);
The first query runs in 96.4 seconds on my machine, plus or minus .15 seconds. The second runs in 0.02 seconds consistently. That’s a pretty big performance difference. Since these are UPDATE statements, I can’t EXPLAIN them and see exactly what the optimizer is doing differently, but I know one thing: the first query fails entirely when the tracking table has 30 million rows instead of 3 million, because it runs out of space for locks.
I have seen cases where the join query is slightly less efficient on small sets of data, but more efficient on larger sets; depending on the data, the probing strategy might not scale very well as the size of the outer table grows.
The queries also perform differently depending on whether the values in aggregate are 50 ads and 10 days, or 5 ads and 100 days. This is because the indexes on tracking are leftmost-prefixed on day, not ad. Your mileage will vary. For example, when I insert 5,000 rows (10 days, 500 ads) into aggregate instead of 50, the joined query takes 104.45 seconds, but the subquery takes 0.04. When I insert 50,000 rows (1,000 days, 500 ads), the joined query runs in 98.71 seconds, and the subquery in 3.87. The performance depends heavily on the characteristics of the data, the way I’m aggregating, indexes, and so on.
If I were being really scientific, I’d run vmstat and iostat and count I/O and other statistics too, to see how the queries access the tables differently, but I leave that as an exercise for the reader.
In the query at work, I ended up using the join. It’s more efficient on that specific data. I was actually surprised when I wrote the test queries for this article and found the subquery performing so much better. The difference could be due to any number of things — data, indexes, the fact we haven’t optimized the tables at work because they’re too big to touch, different architecture (Xeon vs. my AMD64), server configuration, memory, disk speed… who knows.
Moral of the story: try different ways of doing the same thing!
Badly optimized subqueries
I have mentioned before, without going into specifics, that using joins instead of subqueries can be hugely more efficient. Even though two queries might mean the same thing, and even though you’re supposed to tell the server what to do and let it figure out how, sometimes you just have to tell it how. Otherwise, the query optimizer might choose a really stupid query plan. I ran into an example of this recently. I have a three-level hierarchy of tables: category¸ subcategory, and item. There are a few thousand rows in category, a few hundred thousand in subcategory, and millions in item. You can disregard the category table from now on; I mentioned it for context, but it’s not used in the queries. Here are the table create statements:
create table subcategory (
id int not null primary key,
category int not null,
index(category)
) engine=InnoDB;
create table item(
id int not null auto_increment primary key,
subcategory int not null,
index(subcategory)
) engine=InnoDB;
I’ll fill the tables with some sample data again:
insert into subcategory(id, category)
select i, i/100 from number
where i <= 300000;
insert into item(subcategory)
select id
from (
select id, rand() * 20 as num_rows from subcategory
) as x
cross join number
where i <= num_rows;
create temporary table t as
select subcategory from item
group by subcategory
having count(*) = 19
limit 100;
insert into item (subcategory)
select subcategory
from t
cross join number
where i < 2000;
Again, these queries may take a while to complete, and are not suitable for a production machine. The idea is to insert pseudo-random numbers of rows into item, so each subcategory has between 1 and 2018 items. It’s not a completely realistic distribution, but it’ll do.
I want to find all subcategories containing more than 2000 items in a given category. First, I find a subcategory with more than 2000 items, then use its category in the rest of the queries that follow. Here are queries that will do this:
select c.id
from subcategory as c
inner join item as i on i.subcategory = c.id
group by c.id
having count(*) > 2000;
-- choose one of the results, then
select * from subcategory where id = ????
-- result: category = 14
Now that I have found a suitable value (14), I’ll use it in subsequent queries. Here’s my query to find all subcategories with more than 2000 items in category 14:
select c.id
from subcategory as c
inner join item as i on i.subcategory = c.id
where c.category = 14
group by c.id
having count(*) > 2000;
In my specific data, there are 10 rows in the results, and the query finishes in a couple of seconds. EXPLAIN shows good index usage; it’s not a bad query at all, considering the table size. The query plan is to just run through ranges on some indexes and count the entries. So far, so good.
Now suppose I wanted to grab all columns from subcategory. I could join the results of the above query as a subquery, or select MAX or some such (since I know the values will be unique over the grouping set), but I could also do the following, right?
select * from subcategory
where id in (
select c.id
from subcategory as c
inner join item as i on i.subcategory = c.id
where c.category = 14
group by c.id
having count(*) > 2000
);
This query will finish about the time the Sun turns into a brown dwarf and swallows the Earth. I don’t know how long it will take because I’ve never felt like letting it run indefinitely. You’d think, from looking at the query, that it would a) evaluate the inner query and find the 10 values, b) go find those 10 rows in subcategory, which should be really fast to look up by their primary keys. Nope. Here’s the query plan:
*************************** 1. row ***************************
id: 1
select_type: PRIMARY
table: subcategory
type: ALL
possible_keys: NULL
key: NULL
key_len: NULL
ref: NULL
rows: 300783
Extra: Using where
*************************** 2. row ***************************
id: 2
select_type: DEPENDENT SUBQUERY
table: c
type: ref
possible_keys: PRIMARY,category
key: category
key_len: 4
ref: const
rows: 100
Extra: Using where; Using index; Using temporary; Using filesort
*************************** 3. row ***************************
id: 2
select_type: DEPENDENT SUBQUERY
table: i
type: ref
possible_keys: subcategory
key: subcategory
key_len: 4
ref: c.id
rows: 28
Extra: Using index
If you’re not familiar with analyzing MySQL query plans, here’s the synopsis: MySQL has decided to run the query from the outside in, not the inside out. I’ll look at each part of the query in turn.
The outer query simply becomes select * from subcategory. Even though there’s a limitation on subcategory in the inner query (WHERE category = 14), MySQL isn’t applying that filter to the outer query for some reason. I don’t know why. All I know is it’s doing a table scan (that’s what type: ALL means), and not using any indexes. That’s a table scan over several hundred thousand rows.
For each row in the outer query, it’s performing the inner query, even though there are no references in the inner query to values in the enclosing scope, because it has “optimized” the inner query by rewriting it to refer to the outer query. At this point, the query plan becomes nested loops. For each loop in the outer query, the query probes into the inner query. Here is the query plan, after the optimizer rewrites it:
select * from subcategory as s
where <in_optimizer>(
s.id,<exists>(
select c.id
from subcategory as c
join item as i
where ((i.subcategory = c.id) and (c.category = 14))
group by c.id
having ((count(0) > 2000)
and (<cache>(s.id) = <ref_null_helper>(c.id))))
)
You can get the optimized query from EXPLAIN EXTENDED followed by SHOW WARNINGS. Notice the reference to the outer scope in the HAVING clause.
I’m not bringing this up to bash MySQL’s optimization strategy. It’s pretty common knowledge that MySQL doesn’t yet optimize subqueries very well in some cases, and this particular problem is widely reported. I’m just pointing out that it is up to the programmer to check queries and make sure they aren’t badly optimized. In most cases, it’s safer just to stay away from subqueries if they’re not needed — especially WHERE... IN() or WHERE... NOT IN() queries.
My new rule for myself is “when in doubt, EXPLAIN the query.” If it’s a big table, I’m automatically doubtful.
How to force the inner query to execute first
The query in the preceding section suffers because MySQL executes it from the outside in as a correlated subquery, instead of from the inside out without correlation. It’s possible to get MySQL to execute the inner query first, materialized as a temporary table, and avoid the huge performance penalty.
MySQL materializes subqueries in the FROM clause (commonly, and somewhat misleadingly, known as derived tables). This means MySQL executes the inner query first and saves the results in a temporary table, then uses it in the rest of the table. This is exactly the behavior I wanted to happen when I wrote that query! Here’s the modified query:
select * from subcategory
where id in (
select id from (
select c.id
from subcategory as c
inner join item as i on i.subcategory = c.id
where c.category = 14
group by c.id
having count(*) > 2000
) as x
);
All I did was wrap the subquery in another subquery. MySQL thinks there’s a dependent subquery, but now it’s only the wrapper query, which is probing into a temporary table with only a few rows, so it is fast anyway. At this point this is a silly optimization; it would probably be better to just rewrite the query as a join. Among other things, that would avoid the danger of someone noticing the wrapper subquery is superfluous and “cleaning up the code.”
This optimization can be used in a number of ways, for example to prevent MySQL from complaining about a subquery selecting data from a table being modified elsewhere in the query. Unfortunately, it doesn’t work to get around the restriction about temporary tables only appearing once in a query.

The bad optimization of sub-queries by MySQL is something that they don’t plan on fixing until MySQL 5.2, which is scheduled to be released as production code in 2007. Personally, I feel that this issue should be resolved much earlier than that. If you agree, add your comments to bug 9090.
voiperster
8 May 06 at 6:51 am
I was having this problem too, but my problem went deeper, as I needed two tables joined, and I need the data out of them – I couldn’t see how to make your code work in that scenario.
My SQL looked like this:
This recursively broke the mysql engine, as I have 100000 records in both tables (I’m trying to build a table from a name-value pairs table BTW). All the keys in the world didn’t optimize the subqueries.
Finally after much trial and error I found creating a view was the answer. A view describing the two subqueries now has an execution plan with only one table scan, and additional columns can be added without much performance hit. Full SQL now looks like:
Interestingly temporary tables didn’t work quite as well (still wanted a full row scan – but I may have missed a key), but I didn’t like them anyway as they are not dynamically updated.
Hope that helps someone!
Anonymous
18 Jun 06 at 6:02 am
Thank you for the walk through. I am in the process of creating an image database much like your sub-categories and items description. So far I have avoided sub queries, but I think I will be using them soon. I did not even know that MySQL would do it any other way than outside in. I am going to print this out to use as a reference as I move forward. If you have found any other discussion on the optimization of sub-queries I would be very interested. Thanks!
Peter
29 Nov 06 at 1:07 pm
Hey, I’m really enjoying your blog and have had success using this particular trick to optimize the subquery execution. Now I’m having a related issue, although not identical. Your feedback would be greatly appreciated.
I’m in MySQL 4.1.12a
There are three tables: interval_span, impression, and ad_unit. The idea is that the impression table has records that need to be counted based primarily on their network_id and ad_unit_type_id (which is retrieved via a join to ad_unit), and secondarily on the view_time being within the applicable date range.
Some table structure details:
Now, to the meat of it… Here’s the update query we’re attempting:
This is the smallest date range I can set up (interval_span begin/end date ranges are 30 minutes long). This update query returns: Query OK, 3 rows affected (2 min 13.50 sec)
Rows matched: 70 Changed: 3 Warnings: 0
I need to get this time WAY down. Here are some notes I took while testing:
1. Running inner select count query with constant values takes 2.2 sec
2. Running whole update without inner join on ad_unit takes much longer
3. Creating the impression_idx_2 index on impression(view_time, network_id) reduced execution by 10%
Ok, here’s some more notes, that hopefully will lend more insight. In order to try to use mysql’s explain tool, I re-wrote the query as a select, it runs in almost identical time, so I think I’ve got it right, but haven’t figured out how to optimize it further:
Any insight you could lend would be GREATLY appreciated! I’ve been working on this for quite some time. Please feel free to publish this note and your response, I’ve found your responses to others very helpful, I think this could help others as well.
Thanks,
Lon
Lon B.
6 Dec 06 at 6:17 pm
Hi. Im trying to run your example,
but when I populate the tables with the inserts, i get this error:
ERROR 1146 (42S02): Table ‘vy_test.number’ doesn’t exist
I tried in Mysql 4.1.20 and 5.0.27-standard.
any help appreciated!
Tim
27 Feb 07 at 5:42 am
I should have referred you to the numbers table in the article’s text, but apparently I forgot to link to it.
Xaprb
27 Feb 07 at 8:32 am
Awesome! this is the best mysql blog ive seen. thanks for taking the time to write.
Tim
27 Feb 07 at 3:33 pm
[...] Google tells me that this issue has been reported since 2004 in early version of MySQL like 4.x. The problem was not focused in any of the updates afterward. After that, I read the article on the topic How to optimize subqueries and joins in MySQL [...]
Great Technical Skills · MySQL subquery bad performance
7 Jun 07 at 4:16 am
Hello, just wanted to say thanks for this article (btw, currently #10 googling “dependent subquery” ;-)!
The SQL programming is a new field to me (being a programmer for two decades in other fields), and I just happened to get into the problem explained here yesterday.
Even if you say the problem ‘MySQL doesn’t yet optimize subqueries very well’ is common knowledge, nobody cares to state the problem along with good solutions, except your article which is also well readable to someone not being SQL guru.
Here is my query with the problem (find the most recent items’ details for each ‘cid’ value in a table):
SELECT cid,txt FROM gpoints WHERE id IN
(SELECT MAX(id) FROM gpoints GROUP BY cid)
With 13800 entries in the table, and 13 results for the inner subquery, this query ran for over 15 minutes.
EXPLAIN showed me the cause, but I found no other way around, so I separated this into two queries (inner query first, then execute the outer query with the IN() list populated with values).
Now your solution makes my query run in 0.05sec:
SELECT cid,txt FROM gpoints WHERE id IN
( SELECT mi FROM
(SELECT MAX(id) AS mi FROM gpoints GROUP BY cid)
AS x)
(As an outsider, I just cannot understand why MySQL doesn’t have a way to specify evaluation order for subqueries like this when the optimizer cannot recognize the 10000-fold time difference…)
BB
10 Jul 07 at 3:27 am
Good Day,
I have one question i’m eager to know.
How can we set a field of a table to be a clustered index using MYSQL?
Thx
stephen
31 Jul 07 at 1:11 am
The only storage engine that offers clustered indexes at this time is InnoDB. The primary key is automatically clustered. You cannot change which index is clustered; you cannot make a primary key without making it clustered; and if you don’t create a primary key on an InnoDB table, it creates an invisible 6-byte primary key and clusters on it.
Xaprb
31 Jul 07 at 7:26 am
Just found this article and your explanation has turned my 45 second query into a 0.1 second query. :)
DaveO
20 Sep 07 at 10:44 am
Was given a script today that took approximately 40 minutes to run. It now runs in under a minute. Thanks to this form of query optimization. Many more issues to hammer out but this was a huge one. Thanks a ton!
Nolan Hurlburt
28 Sep 07 at 1:36 pm
[...] http://www.xaprb.com/blog/2006/04/30/how-to-optimize-subqueries-and-joins-in-mysql/ [...]
Re: bookmarks and keywords | Bla.es
1 Oct 07 at 5:21 am
[...] en glimrende artikel om et lille MySQL [...]
p2e.dk/noter » Blog Archive » Optimering af subselects
21 Nov 07 at 5:40 pm
This is ENORMOUSLY helpful. I was wondering why my subqueries seemed to be re-executing the subquery for each row of the outer query, even though there were no dependencies. Thanks!
Chris Rogers
17 Nov 08 at 11:43 pm
Thank you for your very readable article – I had a bad feeling about the results that I was getting using a subquery on a system that i am developing – hopefully your pointers will get my worries resolved.
Best wishes!
Victor
18 Feb 09 at 8:12 am
Thanks for the post!
I used your technique to force the inner query to execute first of subqueries, and it works most of the time. I have a few that are still very slow and are being rendered in temporary tables on disk, which I assume is the problem. I seem to have plenty of memory available and I increased the tmp_table_size and max_heap_table_size to avoid this, without success. It still seems to be going to disk.
Any ideas on how I can force it to go into memory?
Matt
12 Mar 09 at 11:33 am
[...] might constuct a correlated subquery. The end result of your efforts may be ten lines of code, but read the mysql tips here first, then consult the enhancements in mysql 6.0 or review Microsoft’s [...]
The Art of Database Adminstration | Heroix Blog
10 Apr 09 at 10:48 am
Thanks for the great post – even though it’s 3 years old as of this comment. I was running into super lengthy query times because of this exact problem and worked around it by splitting it into 2 queries and creating a temporary table. Which is the same result as the sub-sub query that you mention here. I was just wanting to know why the original query was so badly optimized in the first place – now I know!
codesmith
2 May 09 at 11:34 pm
Thank you for the post. The sub-sub query did the trick. I was getting 17K% Filtered on my explain extended because my nested table was touching the same table my outer query was joining on. The sub-sub query took response times on my query down from 6s to ~10ms. Great article!
Clay vanSchalkwijk
1 Jun 09 at 11:56 am
Thanks for the post. really good one.
I have the following performance problem problem:
The tables:
CREATE TABLE `tasks` (
`id` bigint(4) unsigned NOT NULL auto_increment,
`id_cpes` int(10) unsigned default NULL,
`priority` tinyint(2) NOT NULL default ‘50′,
`scheduled_time` datetime NOT NULL,
`status` enum(’Not Started’,'In Progress’),
PRIMARY KEY (`id`),
INDEX `idx_tasks` USING BTREE (`priority` ASC, `scheduled_time` ASC, `status`
ASC),
KEY `id_cpes` (`id_cpes`)
) ENGINE=InnoDB;
CREATE TABLE `cpes` (
`id` int(10) unsigned NOT NULL auto_increment,
`jacs_name` varchar(20) default NULL,
PRIMARY KEY (`id`),
UNIQUE KEY `cpe_id` (`cpe_id`),
KEY `jacs_name` (`jacs_name`)
) ENGINE=InnoDB;
I have performance problem with the following query:
SELECT tasks.id, tasks.priority, tasks.id_cpes FROM tasks JOIN cpes ON tasks.id_cpes = cpes.id WHERE (tasks.status=7 OR tasks.status=1) AND cpes.jacs_name=’JACS_1′ AND tasks.scheduled_time<=’2009-06-15 14:17:35′ ORDER BY tasks.priority ASC, tasks.scheduled_time
the explain for this query is:
+—-+————-+——-+——+——————-+———–+———+————–+——+———————————————————–+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+—-+————-+——-+——+——————-+———–+———+————–+——+———————————————————–+
| 1 | SIMPLE | cpes | ref | PRIMARY,jacs_name | jacs_name | 23 | const | 1 | Using where; Using index; Using temporary; Using filesort |
| 1 | SIMPLE | tasks | ref | id_cpes | id_cpes | 5 | jrms.cpes.id | 2 | Using where |
+—-+————-+——-+——
The tasks table is very big one, and I want to use the idx_tasks index, but it looks that because of the join it doesn’t do it.
Do you have a suggestion?
Regards, Yariv
yariv
15 Jun 09 at 12:37 pm
[...] How to optimize subqueries and joins in MySQL at Xaprb (tags: programming development article webdev performance sql join database) [...]
links for 2009-07-11 | NeXt
11 Jul 09 at 10:34 am
Hi,
I select the value of a table using type and I also want the mysql query to its next and previous type all this has to written in single query
select * from tb1 where type=’All calculators’,id=(all cal+1)
Is there is any option to do all this in one query.
Math
16 Oct 09 at 12:17 pm