Stay Curious!

How to use ORDER BY and LIMIT on multi-table updates in MySQL

One of my colleagues recently redesigned a system for scheduling work, and needed to write a multiple-table UPDATE with ORDER BY and LIMIT, neither of which is supported for multiple-table UPDATE in MySQL. This article explains how to do it anyway, and shows how to rewrite a first attempt for hundreds of times better performance.

Introduction

Without revealing too much about my employer’s internal systems, I want to give a little bit of context. First, we have a table that holds work to be done:

create table work_to_do (
   client int not null,
   work_unit int not null,
   priority float not null,
   processor int not null default 0,
   primary key(client, work_unit),
   key(priority),
   key(processor, priority)
) engine=InnoDB;

Each row in this table represents a job for a client. The actual work is defined elsewhere, as rows in another table that can be looked up from work_unit. When there’s some work to be done, the row is inserted with processor = 0 and priority set to some number calculated by another process. The idea is to do the highest-priority jobs first.

We have a program that may have multiple running instances, each of which looks at the table and tries to claim a set of work for itself, to be handled by no other process. In order to do this, each process tries to update some rows so processor is its UNIX process ID. Once processor is non-zero, the row is reserved for that process to do.

There’s another table that defines which clients are eligible for processing, whether they’re in the work_to_do table or not.

The first attempt

The query to claim a set of rows for a process needs to update the ten highest-priority unclaimed rows for eligible clients. Since it needs to access more than one table, and needs to order and limit the results, at first it seems you can’t do this query at all, because ORDER BY and LIMIT aren’t allowed for multi-table updates:

For the multiple-table syntax, UPDATE updates rows in each table named in table_references that satisfy the conditions. In this case, ORDER BY and LIMIT cannot be used.

However, a standard searched UPDATE statement with a subquery doesn’t count as a multi-table update, so you can do it like this:

-- @process_id is actually passed in as a parameter
-- set @process_id = 17603;
update work_to_do
   set processor = @process_id
   where processor = 0
      and client in (select client from eligible_client)
   order by priority desc
   limit 10

There’s only one problem: performance sucks. You are probably aware that IN() subqueries are converted by the in_optimizer to correlated subqueries, sometimes with severe performance penalties.

I created the tables on my home computer, and ran the above query with 5,000 rows in work_to_do. With 50 rows in eligible_client the query completes in a 0.13 seconds, but with 5,000 the correlated subquery becomes a real performance problem; it takes 9.8 seconds, which is very slow for this machine. It means this may be an O(n^2) algorithm.

The query after optimizing

To eliminate the badly optimized subquery, you need to rewrite the subquery as a join, but how can you do that and retain the LIMIT and ORDER BY? One way is to find the rows to be updated in a subquery in the FROM clause, so the LIMIT and ORDER BY can be nested inside the subquery. In this way work_to_do is joined against the ten highest-priority unclaimed rows of itself. Normally you can’t self-join the update target in a multi-table UPDATE, but since it’s within a subquery in the FROM clause, it works in this case.

update work_to_do as target
   inner join (
      select w. client, work_unit
      from work_to_do as w
         inner join eligible_client as e on e.client = w.client
      where processor = 0
      order by priority desc
      limit 10
   ) as source on source.client = target.client
      and source.work_unit = target.work_unit
   set processor = @process_id;

Now the query runs in 0.04 seconds. That’s a factor of 250 speedup, and it’s even several times faster than the un-optimized query when there were only 50 rows in the eligible_client table.

There is one downside: the rows are not locked in primary key order. This may help explain the occasional deadlock we get on this table (see my article about a “little-known” way to cause a deadlock). Do you have any suggestions for how to work around this?

Summary

In this article I’ve shown you how to use subqueries to work around limitations on multi-table UPDATE statements, and then I demonstrated how an alternate way to rewrite the query performs about 250 times better on a small data set than the obvious solution.

If this article was useful to you, subscribe via e-mail or feeds to receive future articles conveniently and free.

Posted on Thu, Aug 10, 2006. Approximately 800 Words.

Databases