Xaprb

Stay curious!

How to write a lazy UNION in MySQL

with 16 comments

The other day I was explaining options to someone who wanted to know about archiving data in MySQL. “So,” he said, “I might have to code my app to look for the data in two places?” The disadvantage of this is that his app might be more complex. Another disadvantage is that it might take two queries — if you look for a user in the usual location and it’s not there, you have to look for it elsewhere.

One way to deal with this, as long as the archived data is on the same server, is a UNION.

select user_id from user where user_id = 123
union all
select user_id from user_archive where user_id = 123;

The benefit is that you don’t have to issue two queries. That saves network round trips, and makes your code shorter. But it has a disadvantage, too: you’re still querying the archive table when you don’t need to. Does this matter? Yes, it does. Your archive table may be very large and slow — perhaps stored on a big slow hard drive, perhaps on a SAN — and just peeking at it is kind of expensive in some cases.

Something occurred to me a couple of weeks ago: why not write a UNION that stops executing as soon as one part of it finds a row? Then you can UNION to your heart’s content and not incur the overhead of that second lookup unless you need it. For lack of a better term, I’m calling this a lazy UNION.

My idea here is to use a user variable. If the first part of the UNION finds a row, it sets the variable. The second part has the variable in its WHERE clause, and won’t execute if the variable has been set by the first part. To make the whole thing self-contained, I’m adding a third part of the UNION, which will always execute but never return any rows; it will set the variable back to its initial state of NULL.

Here’s a complete example:

drop table if exists user, user_archive;
create table user(user_id int not null primary key);
create table user_archive like user;
insert into user(user_id) values(1);
insert into user_archive(user_id) values(2);

select greatest(@found := -1, user_id) as user_id, 'user' as which_tbl
   from user where user_id = 1
union all
select user_id as user_id, 'user_archive' as which_tbl
   from user_archive where user_id = 1 and @found is null
union all
select 1, '' from dual where ( @found := null ) is not null;

select greatest(@found := -1, user_id) as user_id, 'user' as which_tbl
   from user where user_id = 2
union all
select user_id as user_id, 'user_archive' as which_tbl
   from user_archive where user_id = 2 and @found is null
union all
select 1, '' from dual where ( @found := null ) is not null;

You can play around with it and verify that indeed, the second part of the UNION never executes if the first part finds a row. There are several ways to do this: with EXPLAIN, by adding some more variables to show which part of the query executes, etc. The results of the above query are as follows:

+---------+-----------+
| user_id | which_tbl |
+---------+-----------+
|       1 | user      | 
+---------+-----------+
1 row in set (0.00 sec)

+---------+--------------+
| user_id | which_tbl    |
+---------+--------------+
|       2 | user_archive | 
+---------+--------------+
1 row in set (0.00 sec)

I have not benchmarked this. My gut feeling is that whether it’s beneficial is going to depend on your workload. But it’s a fun little hack I thought I’d share with you. By the way, there’s no reason you have to stop at two; you could add any number of queries to the UNION.

Written by Xaprb

June 15th, 2008 at 2:48 pm

16 Responses to 'How to write a lazy UNION in MySQL'

Subscribe to comments with RSS or TrackBack to 'How to write a lazy UNION in MySQL'.

  1. Very creative technique. Could you explain how is it exactly that the greatest() function sets the variable if no rows are found?

    Eran Galperin

    15 Jun 08 at 3:26 pm

  2. It sets the variable if rows ARE found. The greatest() is just there to “hide” the variable assignment by doing two things at once.

    Xaprb

    15 Jun 08 at 3:38 pm

  3. Very nice hack indeed!
    One little thing I do not fully understand though - you’re using the third part of the union only to set the user variable back to NULL?
    If so, isn’t it better to use “SET @found := null”, OR “DO @found:=null” performance/clarity wise?

    Shlomo Priymak

    15 Jun 08 at 5:38 pm

  4. Isn’t this as undefined as f(i++, i++) or i = i++; in other programming languages? Is the server allowed to use multiple cpus by running all parts of the UNION ALL in parallel, for example? Or do you have to change your application if something like this gets implemented? I always write this as IFNULL(MAX(user_id),()) but of course that makes it harder to return multiple columns or rows without tricks like concatenation or simply saving the round trip with a stored procedure.

    strcmp

    15 Jun 08 at 5:49 pm

  5. Shlomo, I’m using the third part of the UNION to do this because a separate query for such a simple operation could be unacceptable overhead. The overhead of sending another query across the wire, parsing, optimizing and executing it, and returning the answer back is likely to dwarf the cost of that third part of the UNION, in my opinion. However, as I said I have NOT benchmarked this! Finding a real-life situation to benchmark this on is the only real answer as to whether it’s a good idea.

    Xaprb

    15 Jun 08 at 7:25 pm

  6. strcmp, user variables are always a loaded gun because their behavior is supposed to be indeterminate, but I find it works pretty well in most cases. MySQL does not do any intra-query paralellization. You can’t use MAX() because it’s an aggregate function.

    Xaprb

    15 Jun 08 at 7:27 pm

  7. oh sorry, the contents of the () disappeared, there was a subselect: IFNULL(MAX(user_id),(SELECT user_id FROM …)) i.e. the query on the 2nd table is only considered if MAX() was NULL, and you have to use an aggregate function (or an outer join) here to get a row at all. and user_id just looked like the primary key for your user table making MAX() a NOP if there is a match.

    i know MySQL does not do any parallelization (making it lose tons of performance for serialized ‘huge’ queries on parallel hardware) but it would be a simple enough strategy to consider. the trick just looked very server version dependent and hard to detect in case of changed server behavior, because you won’t get an error message, just strange results or bad performance. IFNULL() makes the intended evaluation order explicit.

    strcmp

    16 Jun 08 at 1:25 am

  8. hehe, submitting the parts of a UNION ALL in parallel and coalescing the result sets would be perfectly possible in something like mysql proxy. just a quick thought.

    strcmp

    16 Jun 08 at 1:45 am

  9. Heh, lots of things are possible using the mysql proxy, but the solution here involves just the bare basics: mysql server and nothing else. Thanks for the idea, Baron.

  10. This is a great idea, fun!

    Jenni

    16 Jun 08 at 4:31 pm

  11. Any particular reason why you are avoiding merge tables / table partitioning?

    xzilla

    16 Jun 08 at 10:55 pm

  12. Benchmark would be interesting.

    As far as I remember MySQL as of MySQL 5.1 always creates temporary table if UNION is used, even for UNION all, which may cause overhead even if second UNION clause is not executed.

    Very creative !, though I would not like to support application which uses a lot of such techniques as it is not very transparent.

    Peter Zaitsev

    16 Jun 08 at 11:16 pm

  13. Nice hack! Would you happen to know if this would work with Oracle as well?

    Victor

    17 Jun 08 at 12:05 pm

  14. Right, MySQL always uses a temp table for a UNION. That’s why a benchmark would be so important: what’s the cost of that temp table? It depends, and I wouldn’t like to guess it. For example, if the UNION has a TEXT column, it might be pretty high because the temp table would go to an on-disk MyISAM table right away.

    Someone with a real-life use for this may want to share their results.

    Xaprb

    17 Jun 08 at 12:41 pm

  15. Sorry, I don’t know if this will work with Oracle. Try it and post your results!

    Xaprb

    17 Jun 08 at 12:41 pm

  16. I would typically use a stored procedure to achieve this type of thing without the overhead of 2 trips to the db. I haven’t used SPs yet with MySQL — are they frowned upon?

    Craig

    1 Jul 08 at 1:17 pm

Leave a Reply