How to optimize subqueries and joins in MySQL
Posted in Databases on Apr 30, 2006
I have written before about using joins instead of subqueries, especially for
NOT IN queries, which can usually be rewritten as exclusion joins – sometimes with huge efficiency gains. In this article I’ll look more closely at the performance characteristics of a few queries I’ve optimized in MySQL 5.0.3. I’ll also show you some tricks you can use to get MySQL to optimize queries better when you know it’s being inefficient.
I wrote recently about the theoretical problems caused by
UPDATE statements with
FROM clauses (Many-to-one problems in SQL). This particular query shows the performance difference between a “correct” query and a “bad, nonstandard” query I wrote recently at work.
The application for this query is to update a table with aggregated clicks per day from a table of click-tracking data for online advertising. I have pre-populated the
aggregate table, with one row per ad per day, over the time period I’m interested in. I have a table called
tracking that holds non-aggregated click data. Here are the simplified table structures:
create table aggregate( day date not null, ad int not null, clicks int, primary key(day,ad) ) type=InnoDB; create table tracking ( tracking int not null auto_increment primary key, day date not null, ad int not null, clicktype enum('real','fraud','unknown') not null, clicks int not null, unique index(day, ad, clicktype) ) type=InnoDB;
Since the tables are InnoDB, the clustered index is the primary key. Notice two things about the
tracking table: it has a surrogate key so it’s not clustered on the natural primary key (day, ad, clicktype), and clicks are separated out by clicktype, so I’ll have to aggregate the clicks I want. These are the design constraints I have to work with; I didn’t design the table.
For this article, I populated my tables with some pseudo-random data. First, I filled my
aggregate table with some ads and days over which I want to work. I want to aggregate ads 1 through 50 over the time period 2005-01-01 through 2005-01-10, so that’s 500 rows in the
aggregate table. The
tracking table is moderately large. I filled it with dates from 2003-04-07 to 2006-01-01 (1,000 days) for ads 1 through 1000, with one row for each click type. That’s a total of three million rows. The clicks are random numbers between 1 and 100.
Given a numbers table with at least 1,000 rows, here are scripts to populate the tables. The second query will probably take a while to run, and will create a medium-sized chunk of data, so don’t think something is wrong when it keeps on grinding (15 minutes isn’t unreasonable). By the way, it’s not a good idea to run these queries on a production server!
insert into aggregate(day, ad) select date_add('2006-01-01', interval a.i - 1 day), b.i from number as a cross join number as b where a.i <= 10 and b.i <= 50; insert into tracking(day, ad, clicktype, clicks) select date_add('2003-04-07', interval a.i - 1 day), b.i, c.type, rand() * 100 from number as a cross join number as b cross join ( select 'real' as type union all select 'fraud' union all select 'unknown' ) as c where a.i <= 1000 and b.i <= 1000;
Now that the tables are set up, I’ll move on to the queries. Since MySQL doesn’t provide really good tools to profile queries, I ran these queries several times, disregarding the first run because it may not have been cached the same way as subsequent runs. The first query is the “bad, evil” way to do it. The second is the official, standards-compliant way.
Notice that the first query is a join, but the second query is a dependent (correlated) subquery, where the subquery refers to values in the enclosing query. They are equivalent, but the query plan is likely to be completely different. My goal here is to measure the performance characteristics of the two methods.
update aggregate as a inner join ( select day, ad, sum(t.clicks) as c from tracking as t where t.clicktype in ('real', 'unknown') group by day, ad ) as t on a.ad = t.ad and a.day = t.day set clicks = c; update aggregate as a set clicks = ( select sum(t.clicks) from tracking as t where t.day = a.day and t.ad = a.ad and t.clicktype in ('real', 'unknown') );
The first query runs in 96.4 seconds on my machine, plus or minus .15 seconds. The second runs in 0.02 seconds consistently. That’s a pretty big performance difference. Since these are
UPDATE statements, I can’t
EXPLAIN them and see exactly what the optimizer is doing differently, but I know one thing: the first query fails entirely when the
tracking table has 30 million rows instead of 3 million, because it runs out of space for locks.
I have seen cases where the join query is slightly less efficient on small sets of data, but more efficient on larger sets; depending on the data, the probing strategy might not scale very well as the size of the outer table grows.
The queries also perform differently depending on whether the values in
aggregate are 50 ads and 10 days, or 5 ads and 100 days. This is because the indexes on
tracking are leftmost-prefixed on
ad. Your mileage will vary. For example, when I insert 5,000 rows (10 days, 500 ads) into
aggregate instead of 50, the joined query takes 104.45 seconds, but the subquery takes 0.04. When I insert 50,000 rows (1,000 days, 500 ads), the joined query runs in 98.71 seconds, and the subquery in 3.87. The performance depends heavily on the characteristics of the data, the way I’m aggregating, indexes, and so on.
If I were being really scientific, I’d run
iostat and count I/O and other statistics too, to see how the queries access the tables differently, but I leave that as an exercise for the reader.
In the query at work, I ended up using the join. It’s more efficient on that specific data. I was actually surprised when I wrote the test queries for this article and found the subquery performing so much better. The difference could be due to any number of things – data, indexes, the fact we haven’t optimized the tables at work because they’re too big to touch, different architecture (Xeon vs. my AMD64), server configuration, memory, disk speed… who knows.
Moral of the story: try different ways of doing the same thing!
I have mentioned before, without going into specifics, that using joins instead of subqueries can be hugely more efficient. Even though two queries might mean the same thing, and even though you’re supposed to tell the server what to do and let it figure out how, sometimes you just have to tell it how. Otherwise, the query optimizer might choose a really stupid query plan. I ran into an example of this recently. I have a three-level hierarchy of tables:
item. There are a few thousand rows in
category, a few hundred thousand in
subcategory, and millions in
item. You can disregard the
category table from now on; I mentioned it for context, but it’s not used in the queries. Here are the table create statements:
create table subcategory ( id int not null primary key, category int not null, index(category) ) engine=InnoDB; create table item( id int not null auto_increment primary key, subcategory int not null, index(subcategory) ) engine=InnoDB;
I’ll fill the tables with some sample data again:
insert into subcategory(id, category) select i, i/100 from number where i <= 300000; insert into item(subcategory) select id from ( select id, rand() * 20 as num_rows from subcategory ) as x cross join number where i <= num_rows; create temporary table t as select subcategory from item group by subcategory having count(*) = 19 limit 100; insert into item (subcategory) select subcategory from t cross join number where i < 2000;
Again, these queries may take a while to complete, and are not suitable for a production machine. The idea is to insert pseudo-random numbers of rows into
item, so each subcategory has between 1 and 2018 items. It’s not a completely realistic distribution, but it’ll do.
I want to find all subcategories containing more than 2000 items in a given category. First, I find a subcategory with more than 2000 items, then use its category in the rest of the queries that follow. Here are queries that will do this:
select c.id from subcategory as c inner join item as i on i.subcategory = c.id group by c.id having count(*) > 2000; -- choose one of the results, then select * from subcategory where id = ???? -- result: category = 14
Now that I have found a suitable value (14), I’ll use it in subsequent queries. Here’s my query to find all subcategories with more than 2000 items in category 14:
select c.id from subcategory as c inner join item as i on i.subcategory = c.id where c.category = 14 group by c.id having count(*) > 2000;
In my specific data, there are 10 rows in the results, and the query finishes in a couple of seconds.
EXPLAIN shows good index usage; it’s not a bad query at all, considering the table size. The query plan is to just run through ranges on some indexes and count the entries. So far, so good.
Now suppose I wanted to grab all columns from
subcategory. I could join the results of the above query as a subquery, or select
MAX or some such (since I know the values will be unique over the grouping set), but I could also do the following, right?
select * from subcategory where id in ( select c.id from subcategory as c inner join item as i on i.subcategory = c.id where c.category = 14 group by c.id having count(*) > 2000 );
This query will finish about the time the Sun turns into a brown dwarf and swallows the Earth. I don’t know how long it will take because I’ve never felt like letting it run indefinitely. You’d think, from looking at the query, that it would a) evaluate the inner query and find the 10 values, b) go find those 10 rows in
subcategory, which should be really fast to look up by their primary keys. Nope. Here’s the query plan:
*************************** 1. row *************************** id: 1 select_type: PRIMARY table: subcategory type: ALL possible_keys: NULL key: NULL key_len: NULL ref: NULL rows: 300783 Extra: Using where *************************** 2. row *************************** id: 2 select_type: DEPENDENT SUBQUERY table: c type: ref possible_keys: PRIMARY,category key: category key_len: 4 ref: const rows: 100 Extra: Using where; Using index; Using temporary; Using filesort *************************** 3. row *************************** id: 2 select_type: DEPENDENT SUBQUERY table: i type: ref possible_keys: subcategory key: subcategory key_len: 4 ref: c.id rows: 28 Extra: Using index
If you’re not familiar with analyzing MySQL query plans, here’s the synopsis: MySQL has decided to run the query from the outside in, not the inside out. I’ll look at each part of the query in turn.
The outer query simply becomes
select * from subcategory. Even though there’s a limitation on
subcategory in the inner query (
WHERE category = 14), MySQL isn’t applying that filter to the outer query for some reason. I don’t know why. All I know is it’s doing a table scan (that’s what
type: ALL means), and not using any indexes. That’s a table scan over several hundred thousand rows.
For each row in the outer query, it’s performing the inner query, even though there are no references in the inner query to values in the enclosing scope, because it has “optimized” the inner query by rewriting it to refer to the outer query. At this point, the query plan becomes nested loops. For each loop in the outer query, the query probes into the inner query. Here is the query plan, after the optimizer rewrites it:
select * from subcategory as s where <in_optimizer>( s.id,<exists>( select c.id from subcategory as c join item as i where ((i.subcategory = c.id) and (c.category = 14)) group by c.id having ((count(0) > 2000) and (<cache>(s.id) = <ref_null_helper>(c.id)))) )
You can get the optimized query from
EXPLAIN EXTENDED followed by
SHOW WARNINGS. Notice the reference to the outer scope in the
I’m not bringing this up to bash MySQL’s optimization strategy. It’s pretty common knowledge that MySQL doesn’t yet optimize subqueries very well in some cases, and this particular problem is widely reported. I’m just pointing out that it is up to the programmer to check queries and make sure they aren’t badly optimized. In most cases, it’s safer just to stay away from subqueries if they’re not needed – especially
WHERE... IN() or
WHERE... NOT IN() queries.
My new rule for myself is “when in doubt,
EXPLAIN the query.” If it’s a big table, I’m automatically doubtful.
The query in the preceding section suffers because MySQL executes it from the outside in as a correlated subquery, instead of from the inside out without correlation. It’s possible to get MySQL to execute the inner query first, materialized as a temporary table, and avoid the huge performance penalty.
MySQL materializes subqueries in the
FROM clause (commonly, and somewhat misleadingly, known as derived tables). This means MySQL executes the inner query first and saves the results in a temporary table, then uses it in the rest of the table. This is exactly the behavior I wanted to happen when I wrote that query! Here’s the modified query:
select * from subcategory where id in ( select id from ( select c.id from subcategory as c inner join item as i on i.subcategory = c.id where c.category = 14 group by c.id having count(*) > 2000 ) as x );
All I did was wrap the subquery in another subquery. MySQL thinks there’s a dependent subquery, but now it’s only the wrapper query, which is probing into a temporary table with only a few rows, so it is fast anyway. At this point this is a silly optimization; it would probably be better to just rewrite the query as a join. Among other things, that would avoid the danger of someone noticing the wrapper subquery is superfluous and “cleaning up the code.”
This optimization can be used in a number of ways, for example to prevent MySQL from complaining about a subquery selecting data from a table being modified elsewhere in the query. Unfortunately, it doesn’t work to get around the restriction about temporary tables only appearing once in a query.