Stay Curious!

How to optimize subqueries and joins in MySQL

I have written before about using joins instead of subqueries, especially for NOT IN queries, which can usually be rewritten as exclusion joins – sometimes with huge efficiency gains. In this article I’ll look more closely at the performance characteristics of a few queries I’ve optimized in MySQL 5.0.3. I’ll also show you some tricks you can use to get MySQL to optimize queries better when you know it’s being inefficient.

Updates in a join

I wrote recently about the theoretical problems caused by UPDATE statements with FROM clauses (Many-to-one problems in SQL). This particular query shows the performance difference between a “correct” query and a “bad, nonstandard” query I wrote recently at work.

The application for this query is to update a table with aggregated clicks per day from a table of click-tracking data for online advertising. I have pre-populated the aggregate table, with one row per ad per day, over the time period I’m interested in. I have a table called tracking that holds non-aggregated click data. Here are the simplified table structures:

create table aggregate(
    day date not null,
    ad int not null,
    clicks int,
    primary key(day,ad)
) type=InnoDB;

create table tracking (
    tracking int not null auto_increment primary key,
    day date not null,
    ad int not null,
    clicktype enum('real','fraud','unknown') not null,
    clicks int not null,
    unique index(day, ad, clicktype)
) type=InnoDB;

Since the tables are InnoDB, the clustered index is the primary key. Notice two things about the tracking table: it has a surrogate key so it’s not clustered on the natural primary key (day, ad, clicktype), and clicks are separated out by clicktype, so I’ll have to aggregate the clicks I want. These are the design constraints I have to work with; I didn’t design the table.

For this article, I populated my tables with some pseudo-random data. First, I filled my aggregate table with some ads and days over which I want to work. I want to aggregate ads 1 through 50 over the time period 2005-01-01 through 2005-01-10, so that’s 500 rows in the aggregate table. The tracking table is moderately large. I filled it with dates from 2003-04-07 to 2006-01-01 (1,000 days) for ads 1 through 1000, with one row for each click type. That’s a total of three million rows. The clicks are random numbers between 1 and 100.

Given a numbers table with at least 1,000 rows, here are scripts to populate the tables. The second query will probably take a while to run, and will create a medium-sized chunk of data, so don’t think something is wrong when it keeps on grinding (15 minutes isn’t unreasonable). By the way, it’s not a good idea to run these queries on a production server!

insert into aggregate(day, ad)        
    select date_add('2006-01-01', interval a.i - 1 day), b.i
    from number as a
        cross join number as b 
    where a.i <= 10 and b.i <= 50;

insert into tracking(day, ad, clicktype, clicks)
    select date_add('2003-04-07', interval a.i - 1 day),
        b.i,
        c.type,
        rand() * 100
    from number as a 
        cross join number as b
        cross join (
            select 'real' as type
            union all select 'fraud'
            union all select 'unknown'
        ) as c
    where a.i <= 1000 and b.i <= 1000;

Now that the tables are set up, I’ll move on to the queries. Since MySQL doesn’t provide really good tools to profile queries, I ran these queries several times, disregarding the first run because it may not have been cached the same way as subsequent runs. The first query is the “bad, evil” way to do it. The second is the official, standards-compliant way.

Notice that the first query is a join, but the second query is a dependent (correlated) subquery, where the subquery refers to values in the enclosing query. They are equivalent, but the query plan is likely to be completely different. My goal here is to measure the performance characteristics of the two methods.

update aggregate as a
    inner join (
        select day, ad, sum(t.clicks) as c
        from tracking as t
        where t.clicktype in ('real', 'unknown')
        group by day, ad
    ) as t on a.ad = t.ad and a.day = t.day
set clicks = c;

update aggregate as a
    set clicks = (
        select sum(t.clicks)
        from tracking as t
        where t.day = a.day
            and t.ad = a.ad
            and t.clicktype in ('real', 'unknown')
    );

The first query runs in 96.4 seconds on my machine, plus or minus .15 seconds. The second runs in 0.02 seconds consistently. That’s a pretty big performance difference. Since these are UPDATE statements, I can’t EXPLAIN them and see exactly what the optimizer is doing differently, but I know one thing: the first query fails entirely when the tracking table has 30 million rows instead of 3 million, because it runs out of space for locks.

I have seen cases where the join query is slightly less efficient on small sets of data, but more efficient on larger sets; depending on the data, the probing strategy might not scale very well as the size of the outer table grows.

The queries also perform differently depending on whether the values in aggregate are 50 ads and 10 days, or 5 ads and 100 days. This is because the indexes on tracking are leftmost-prefixed on day, not ad. Your mileage will vary. For example, when I insert 5,000 rows (10 days, 500 ads) into aggregate instead of 50, the joined query takes 104.45 seconds, but the subquery takes 0.04. When I insert 50,000 rows (1,000 days, 500 ads), the joined query runs in 98.71 seconds, and the subquery in 3.87. The performance depends heavily on the characteristics of the data, the way I’m aggregating, indexes, and so on.

If I were being really scientific, I’d run vmstat and iostat and count I/O and other statistics too, to see how the queries access the tables differently, but I leave that as an exercise for the reader.

In the query at work, I ended up using the join. It’s more efficient on that specific data. I was actually surprised when I wrote the test queries for this article and found the subquery performing so much better. The difference could be due to any number of things – data, indexes, the fact we haven’t optimized the tables at work because they’re too big to touch, different architecture (Xeon vs. my AMD64), server configuration, memory, disk speed… who knows.

Moral of the story: try different ways of doing the same thing!

Badly optimized subqueries

I have mentioned before, without going into specifics, that using joins instead of subqueries can be hugely more efficient. Even though two queries might mean the same thing, and even though you’re supposed to tell the server what to do and let it figure out how, sometimes you just have to tell it how. Otherwise, the query optimizer might choose a really stupid query plan. I ran into an example of this recently. I have a three-level hierarchy of tables: category¸ subcategory, and item. There are a few thousand rows in category, a few hundred thousand in subcategory, and millions in item. You can disregard the category table from now on; I mentioned it for context, but it’s not used in the queries. Here are the table create statements:

create table subcategory (
    id int not null primary key,
    category int not null,
    index(category)
) engine=InnoDB;

create table item(
    id int not null auto_increment primary key,
    subcategory int not null,
    index(subcategory)
) engine=InnoDB;

I’ll fill the tables with some sample data again:

insert into subcategory(id, category)
    select i, i/100 from number
    where i <= 300000;

insert into item(subcategory)
    select id
    from (
        select id, rand() * 20 as num_rows from subcategory
    ) as x
        cross join number
    where i <= num_rows;

create temporary table t as
    select subcategory from item
    group by subcategory
    having count(*) = 19
    limit 100;

insert into item (subcategory)
    select subcategory
    from t
        cross join number
    where i < 2000;

Again, these queries may take a while to complete, and are not suitable for a production machine. The idea is to insert pseudo-random numbers of rows into item, so each subcategory has between 1 and 2018 items. It’s not a completely realistic distribution, but it’ll do.

I want to find all subcategories containing more than 2000 items in a given category. First, I find a subcategory with more than 2000 items, then use its category in the rest of the queries that follow. Here are queries that will do this:

select c.id
from subcategory as c
    inner join item as i on i.subcategory = c.id
group by c.id
having count(*) > 2000;

-- choose one of the results, then
select * from subcategory where id = ????
-- result: category = 14

Now that I have found a suitable value (14), I’ll use it in subsequent queries. Here’s my query to find all subcategories with more than 2000 items in category 14:

select c.id
from subcategory as c
    inner join item as i on i.subcategory = c.id
where c.category = 14
group by c.id
having count(*) > 2000;

In my specific data, there are 10 rows in the results, and the query finishes in a couple of seconds. EXPLAIN shows good index usage; it’s not a bad query at all, considering the table size. The query plan is to just run through ranges on some indexes and count the entries. So far, so good.

Now suppose I wanted to grab all columns from subcategory. I could join the results of the above query as a subquery, or select MAX or some such (since I know the values will be unique over the grouping set), but I could also do the following, right?

select * from subcategory
where id in (
    select c.id
    from subcategory as c
        inner join item as i on i.subcategory = c.id
    where c.category = 14
    group by c.id
    having count(*) > 2000
);

This query will finish about the time the Sun turns into a brown dwarf and swallows the Earth. I don’t know how long it will take because I’ve never felt like letting it run indefinitely. You’d think, from looking at the query, that it would a) evaluate the inner query and find the 10 values, b) go find those 10 rows in subcategory, which should be really fast to look up by their primary keys. Nope. Here’s the query plan:

*************************** 1. row ***************************
           id: 1
  select_type: PRIMARY
        table: subcategory
         type: ALL
possible_keys: NULL
          key: NULL
      key_len: NULL
          ref: NULL
         rows: 300783
        Extra: Using where
*************************** 2. row ***************************
           id: 2
  select_type: DEPENDENT SUBQUERY
        table: c
         type: ref
possible_keys: PRIMARY,category
          key: category
      key_len: 4
          ref: const
         rows: 100
        Extra: Using where; Using index; Using temporary; Using filesort
*************************** 3. row ***************************
           id: 2
  select_type: DEPENDENT SUBQUERY
        table: i
         type: ref
possible_keys: subcategory
          key: subcategory
      key_len: 4
          ref: c.id
         rows: 28
        Extra: Using index

If you’re not familiar with analyzing MySQL query plans, here’s the synopsis: MySQL has decided to run the query from the outside in, not the inside out. I’ll look at each part of the query in turn.

The outer query simply becomes select * from subcategory. Even though there’s a limitation on subcategory in the inner query (WHERE category = 14), MySQL isn’t applying that filter to the outer query for some reason. I don’t know why. All I know is it’s doing a table scan (that’s what type: ALL means), and not using any indexes. That’s a table scan over several hundred thousand rows.

For each row in the outer query, it’s performing the inner query, even though there are no references in the inner query to values in the enclosing scope, because it has “optimized” the inner query by rewriting it to refer to the outer query. At this point, the query plan becomes nested loops. For each loop in the outer query, the query probes into the inner query. Here is the query plan, after the optimizer rewrites it:

select * from subcategory as s
where <in_optimizer>(
   s.id,<exists>(
   select c.id
   from subcategory as c
      join item as i
   where ((i.subcategory = c.id) and (c.category = 14))
   group by c.id
   having ((count(0) > 2000)
      and (<cache>(s.id) = <ref_null_helper>(c.id))))
)

You can get the optimized query from EXPLAIN EXTENDED followed by SHOW WARNINGS. Notice the reference to the outer scope in the HAVING clause.

I’m not bringing this up to bash MySQL’s optimization strategy. It’s pretty common knowledge that MySQL doesn’t yet optimize subqueries very well in some cases, and this particular problem is widely reported. I’m just pointing out that it is up to the programmer to check queries and make sure they aren’t badly optimized. In most cases, it’s safer just to stay away from subqueries if they’re not needed – especially WHERE... IN() or WHERE... NOT IN() queries.

My new rule for myself is “when in doubt, EXPLAIN the query.” If it’s a big table, I’m automatically doubtful.

How to force the inner query to execute first

The query in the preceding section suffers because MySQL executes it from the outside in as a correlated subquery, instead of from the inside out without correlation. It’s possible to get MySQL to execute the inner query first, materialized as a temporary table, and avoid the huge performance penalty.

MySQL materializes subqueries in the FROM clause (commonly, and somewhat misleadingly, known as derived tables). This means MySQL executes the inner query first and saves the results in a temporary table, then uses it in the rest of the table. This is exactly the behavior I wanted to happen when I wrote that query! Here’s the modified query:

select * from subcategory
where id in (
    select id from (
        select c.id
        from subcategory as c
            inner join item as i on i.subcategory = c.id
        where c.category = 14
        group by c.id
        having count(*) > 2000
    ) as x
);

All I did was wrap the subquery in another subquery. MySQL thinks there’s a dependent subquery, but now it’s only the wrapper query, which is probing into a temporary table with only a few rows, so it is fast anyway. At this point this is a silly optimization; it would probably be better to just rewrite the query as a join. Among other things, that would avoid the danger of someone noticing the wrapper subquery is superfluous and “cleaning up the code.”

This optimization can be used in a number of ways, for example to prevent MySQL from complaining about a subquery selecting data from a table being modified elsewhere in the query. Unfortunately, it doesn’t work to get around the restriction about temporary tables only appearing once in a query.

Posted on Sun, Apr 30, 2006. Approximately 2500 Words.

Databases