Xaprb

Stay curious!

Why large IN clauses are problematic

with 17 comments

I’ve seen a lot of SQL code that uses an IN clause as a place to put a variable number of parameters, allowing the query to be more flexible. There are several downsides to this technique. This article discusses those downsides and explains how to avoid them.

Introduction

I work in a Perl shop at the moment. We use two SQL utility modules from CPAN extensively: Class::DBI (which I’ll discuss in another article) and Ima::DBI. Ima::DBI allows us to keep our SQL layer in one place and helps abstract away a lot of the drudgery of connecting, preparing, and executing.

Just for the record, I’m not a huge fan of it for a variety of reasons, but I won’t go into that; it’s a bit off-topic.

Ima::DBI allows defining sql statements as subroutines, like this:

__PACKAGE__->set_sql('foo', 'select * from foo', 'conn');
# elsewhere:
$statements->sql_foo->execute();

That’s code for “create a subroutine named sql_foo, which will execute the SELECT against a connection named conn“. Later, the code executes that subroutine.

There’s a lot more that can be done with this. ? placeholders can go in the SQL definition, like so:

...'select * from foo where bar = ?'...
# elsewhere:
$statements->sql_foo->execute(5);

That’s standard DBI prepared-statement syntax for inserting a ‘5′ where the question mark is, but look at this:

'select * from foo where bar in (%s)

That’s a string substitution parameter, sprintf style, which gets used at runtime to alter the statement before executing it, like so:

$sth = $statements->sql_foo("?, ?, ?");
$sth->execute(5, 6, 7);

This last usage results in the statement

select * from foo where bar in (5, 6, 7)

It’s very easy to slip into this coding style when working with lists of things. For example, a program that accepts a list of account numbers on the command line. The traffic data roll-up system I’ve mentioned works this way; we run the roll-up program with a list of client IDs.

I’m using Perl as an example in this article, but I’ve seen this type of coding in many languages. This usage is problematic, and that’s what I want to talk about in this article.

Problem 1: Performance

The first problem with a large IN clause is performance. IN is equivalent to OR. For example, bar in(5, 6, 7) is the same as bar=5 or bar=6 or bar=7. That might not seem like a performance problem, but a bunch of OR clauses are much harder for the server to optimize than other methods of limiting results. Because an OR clause can have 1 to infinity parts, no single optimization strategy can always apply, and analyzing the clauses to find out which strategy would be best is probably not realistically possible. Therefore, every RDBMS I know of just evaluates each comparison until it finds a true result. That can be much less efficient than, say, a join, which might be able to use an index.

One solution to this is to move the IN clause to the FROM clause. It may be counter-intuitive, but giving the query a ‘table’ to act as a filter can be much more efficient, depending on the platform. Here’s the above query, re-written:

select * from foo
   inner join (
      select 5 as bar
      union all select 6
      union all select 7
   ) as x on foo.bar = x.bar

One example where this worked well on MySQL is explained in a recent e-mail from my coworker:

… that was a good tip on replacing ‘in’ clauses with joins to subqueries in the from clause. The queries in reporting were totally hitting a wall, so I took the query as shown in mytop, moved the subquery from the where to the from, and it went from 1M 56seconds to 20 seconds.

I wouldn’t say a factor-of-six improvement is revolutionary, but every little bit helps, especially when the query is executed a lot. Your mileage may vary. I know some situations where the improvement is dramatic.

Problem 2: Maintenance and debugging

Performance may or may not be a real problem, but maintainability definitely is. It’s really hard to debug or understand what queries are doing when the query text isn’t written until runtime (With Ima::DBI, it’s even harder because the subroutines get written as closures, which the debugger can’t step into). The code to work with these types of queries also gets really ugly. This is onerous:

@params = $something_from_arguments;
$placeholders = join(',', '?' x scalar(@params));
$sth = $statements->sql_foo1($placeholders);
$sth->execute(@params);
# ... do that 15 times

And when I see the query being executed at runtime, with 250 question marks and 250 variables to take their places, I really want to pull my hair out. Debugging statements don’t help. I rewrote one such application that had obviously been hard to debug, because it printed debugging output all over the place, ostensibly to help the programmer ensure the correct number of question marks was being created to accept the correct number of variables (there were other parameters to the query besides the IN clause, making it even more complex).

There’s an easy solution to this: start the set of queries by storing all those numbers in a temporary table, and join against the temporary table wherever needed to filter the results. I re-wrote the rollup program with this style of programming and eliminated a lot of code, leaving both the program and the queries much clearer and easier to debug.

Summary

Large IN clauses are an easy tool in the toolbag, but they don’t scale well, from both a performance and maintainability point of view. I recommend transforming them into a join to a temporary table, which can be filled with the data that would have gone into the IN clause to begin with. Once the temporary table is populated with a known set of data, queries are easy to write and understand.

Written by Xaprb

June 28th, 2006 at 7:29 pm

Posted in Coding, SQL

17 Responses to 'Why large IN clauses are problematic'

Subscribe to comments with RSS

  1. I’m afraid your statement about IN() being equivalent to multiple ORs is incorrect with regards to MySQL Server. What actually happens is that the values in the IN() list are either looked up in a relevant index, or sorted and matched in reverse. That is, if you have a value, the server does a quick binary search in the list to see if that value is part of it. So, a large IN() would be much faster than an equivalent OR construct.

    Of course, you should use a join where appropriate. Different uses.

    Arjen Lentz

    9 Jul 06 at 7:10 pm

  2. Arjen, thanks for that correction. In Microsoft SQL Server the query optimizer actually transforms the IN() into ORs, though I can’t remember how to prove that now (a trainer once showed me the rewritten query in some query buffer or other). I should have said “every RDBMS I know internal details about” instead of “every one I know of”. I’m glad to hear MySQL is smarter than that.

    Does the binary search optimization also apply in a query written like this?

    select * from table where 5 in (col1, col2, col3, col4)

    Xaprb

    9 Jul 06 at 8:53 pm

  3. Well most RDBMS will actually try to be clever here, or they will be clever simply due to having a clever caching algorithm. Obviously an IN statement does a lot of OR’s in a single column, which is something to exploit.

    Aside from that, it’s really annoying that people are using emulated prepared statements to do things that are normally not supported by native prepared statements. I have had a lot of that during PEAR::MDB2 development. In the end I put in a switch to allow people to keep their beloved emulated prepared statements where they should instead just use sprintf() to construct the SQL to begin with.

    Lukas

    10 Jul 06 at 3:01 am

  4. I was asking about this on the #mysql irc channel, and I was told that IN is much faster than doing a bunch of unions. Do you have any benchmarks to back this up, or is this based on MS SQL?

    Jordan

    Jordan

    13 Jul 06 at 6:29 pm

  5. I decided to verify this by benchmarking and found that you are correct. I generated a table with 90,000 rows in it, and did a whole bunch of selects using both IN and the JOIN method described above. The JOIN method was consistently faster than the IN method. This is on MySQL 5.0.22.

    Whether you’re only retrieving a few ids, or a whole bunch of them, in my testing the JOIN method was about 30% faster.

    Jordan

    14 Jul 06 at 2:05 pm

  6. Wow, I’m glad this thing is moderated. Uhhh, I got those results totally backwards in the last post.

    In my testing, using the IN method is consistently between 10% and 25% faster than using the JOINs whether we’re retrieving a few ids, or a whole bunch of them. I tested on a table with 90,000 rows, selecting anywhere between 10 rows, 100 rows 1000 rows and 10,000 rows in each query, and the IN method is consistently faster.

    This is with MySQL 5.0.22, so maybe the optimizer has gotten better. But it does seem that just using a big IN clause is faster than doing the JOINs.

    Jordan

    14 Jul 06 at 2:11 pm

  7. [...] Just a few days ago Baron Schwartz wrote an article on how to replace large IN clauses with a UNION SELECT as a subquery in a JOIN (Why large IN clauses are problematic). Unfortunately this trick will not work anymore in MySQL 5.0.23, at least when you haven’t selected a default database previously. As we used similar approaches in our code, bug #21002 introduced in 5.0.23 is breaking some of our applications that were running fine since early 4.1 releases of MySQL. [...]

  8. Thanks everyone for your insight. You’re helping me learn a lot. Plus you’re encouraging me not to stop at just noticing something, and I’m reading a lot of source these days. Sorry for the slow moderation — I was away for a week.

    Some of my experience is with Microsoft SQL Server, which definitely, at least in SQL Server 2000, did NOT optimize IN well. I also realize now that I didn’t write the article that clearly, distinguishing between all the different types of things that can go into an IN. There can be lists of numbers, a subselect, lists of columns, and maybe more. Subselects are not that well optimized in MySQL, as discussed ad nauseum elsewhere. But it looks like lists of numbers are very well optimized, which is great.

    Xaprb

    19 Jul 06 at 9:14 pm

  9. Xaprb,

    I love your posts. However, as others have pointed out, you’re offbase wrt INs and mysql. I’ll go one further and say that I have quite often been able to optimize painful joins away with INs at my consulting clients. However, my other finding is that the gain is reversely proportional to the clue coefficient of the engineers who designed the db schema ;]

    Anonymous

    7 Aug 06 at 10:08 am

  10. Yes, after further thought I also realize the pain that prompted this article is mainly the maintainability issue. I’m not averse to using IN(), but the ways I see it commonly used (at leat the ways I notice it — perhaps I only notice it when I don’t like it) are hard to maintain.

    It would be great if you can post an example of the kind of join you mention.

    Xaprb

    7 Aug 06 at 1:29 pm

  11. Hi, today tried to test for myself, if this union-select-join is actually slower or faster for my use case. And it turned out to be faster. But when I tested with really big in-clauses, i saw this:

    ERROR 1064 (42000): parser stack overflow near '21053740 ) as aids on a.id = aids.aid
      JOIN kategorie k JOIN hersteller h JOIN ' at line 5334
    Segmentation fault

    Now, I couldn’t reproduce the segfault, but the parse-error occurred reliably.

    My mysql-version is 5.0.22.

    ephes

    9 Aug 06 at 6:51 am

  12. On a postgreSQL-DB (8.1), I have this results (1,000,000 of records):

    search 1000 recs via:

    IN: 270ms
    OR: 270ms
    JOIN: 200ms (excluded INSERT time)

    BASH script:

    #!/bin/bash
    
    M=1000
    SQLB="SELECT AVG(id) FROM x ";
    s=$[1000000/$M]
    v=0
    
    #
    # IN test
    #
    SQL="$SQLB WHERE id IN("
    for((i=1;i<$M;i++)); do
      SQL="$SQL$v,"
      v=$[$v+$s]
    done
    
    SQL="$SQL$v);"
    time /usr/bin/psql -d test -c "$SQL"
    
    #
    # OR test
    #
    v=0
    SQL="$SQLB WHERE "
    
    for((i=1;i<$M;i++)); do
      SQL="$SQL id=$v OR"
      v=$[$v+$s]
    done
    
    SQL=$SQL" id=$v;"
    time /usr/bin/psql -d test -c "$SQL"
    
    #
    # JOIN table test
    #
    v=0
    SQL="DROP TABLE tx;CREATE TABLE tx (idFK INTEGER);"
    SQL=$SQL"BEGIN; "
    
    for((i=1;i<$M;i++)); do
      SQL=$SQL"INSERT INTO tx VALUES($v);"
      v=$[$v+$s]
    done
    
    SQL=$SQL"INSERT INTO tx VALUES($v);COMMIT;"
    psql -d test -c "$SQL"
    SQL="$SQLB INNER JOIN tx ON tx.idFK=x.id"
    time /usr/bin/psql -d test -c "$SQL"

    Roberto Colmegna

    rcolmegna

    5 Nov 06 at 5:22 am

  13. In postgreSQL, the suggestion to use inner joins can result in dramatic improvements, because postgreSQL does use ORs as described and proved above.

    In our case, we went from ~4500ms for the query to ~300ms

    An order of magnitude improvement!

    Thanks for the great post!!

    Ben Truitt

    4 Jun 08 at 12:13 pm

  14. I just wanted to share additional news on this.

    My DBA had a really hard time believing my solution in the first email in this thread should be faster. So after digging some, he ran an ANALYZE on the tables in question.

    This helps the PostgreSQL planner out to create more optimal query plans.

    It turns out that after analyze the original query that uses an “IN” takes 1ms.

    So we went from:
    IN Clause: ~4500ms
    Inner Join: ~300ms
    Analyze with IN Clause: ~1ms

    Holy cow.

    The lesson is: use explain, use analyze, and consult your DBAs. They are your friends.

    Ben Truitt

    4 Jun 08 at 5:57 pm

  15. Baron….I thought you were opposed to using temp tables? I love them (especially memory) as a method to get variable data from the app layer with no schema dependency. I’d love to see a post from you regarding when you find them risky, and when this approach is ok.

    Dewey

    16 Jun 08 at 10:36 am

  16. I think I wrote this post when I was seeing a particular problem, and before I’d started to see the problems with temp tables. So my dogma switched as I learned :-)

    Xaprb

    16 Jun 08 at 12:05 pm

  17. yes your article is good enough for optimization but i want to know wheather we make in Clause as Fast enough with creating appropriate indexes

    Venkat

    19 Nov 09 at 7:50 am

Leave a Reply