One of my colleagues recently redesigned a system for scheduling work, and needed to write a multiple-table
ORDER BY and
LIMIT, neither of which is supported for multiple-table UPDATE in MySQL. This article explains how to do it anyway, and shows how to rewrite a first attempt for hundreds of times better performance.
Without revealing too much about my employer’s internal systems, I want to give a little bit of context. First, we have a table that holds work to be done:
create table work_to_do ( client int not null, work_unit int not null, priority float not null, processor int not null default 0, primary key(client, work_unit), key(priority), key(processor, priority) ) engine=InnoDB;
Each row in this table represents a job for a client. The actual work is defined elsewhere, as rows in another table that can be looked up from
work_unit. When there’s some work to be done, the row is inserted with
processor = 0 and
priority set to some number calculated by another process. The idea is to do the highest-priority jobs first.
We have a program that may have multiple running instances, each of which looks at the table and tries to claim a set of work for itself, to be handled by no other process. In order to do this, each process tries to update some rows so
processor is its UNIX process ID. Once
processor is non-zero, the row is reserved for that process to do.
There’s another table that defines which clients are eligible for processing, whether they’re in the
work_to_do table or not.
The first attempt
The query to claim a set of rows for a process needs to update the ten highest-priority unclaimed rows for eligible clients. Since it needs to access more than one table, and needs to order and limit the results, at first it seems you can’t do this query at all, because
ORDER BY and
LIMIT aren’t allowed for multi-table updates:
For the multiple-table syntax, UPDATE updates rows in each table named in table_references that satisfy the conditions. In this case, ORDER BY and LIMIT cannot be used.
However, a standard searched
UPDATE statement with a subquery doesn’t count as a multi-table update, so you can do it like this:
-- @process_id is actually passed in as a parameter -- set @process_id = 17603; update work_to_do set processor = @process_id where processor = 0 and client in (select client from eligible_client) order by priority desc limit 10
There’s only one problem: performance sucks. You are probably aware that
IN() subqueries are converted by the
in_optimizer to correlated subqueries, sometimes with severe performance penalties.
I created the tables on my home computer, and ran the above query with 5,000 rows in
work_to_do. With 50 rows in
eligible_client the query completes in a 0.13 seconds, but with 5,000 the correlated subquery becomes a real performance problem; it takes 9.8 seconds, which is very slow for this machine. It means this may be an O(n^2) algorithm.
The query after optimizing
To eliminate the badly optimized subquery, you need to rewrite the subquery as a join, but how can you do that and retain the
ORDER BY? One way is to find the rows to be updated in a subquery in the
FROM clause, so the
ORDER BY can be nested inside the subquery. In this way
work_to_do is joined against the ten highest-priority unclaimed rows of itself. Normally you can’t self-join the update target in a multi-table
UPDATE, but since it’s within a subquery in the
FROM clause, it works in this case.
update work_to_do as target inner join ( select w. client, work_unit from work_to_do as w inner join eligible_client as e on e.client = w.client where processor = 0 order by priority desc limit 10 ) as source on source.client = target.client and source.work_unit = target.work_unit set processor = @process_id;
Now the query runs in 0.04 seconds. That’s a factor of 250 speedup, and it’s even several times faster than the un-optimized query when there were only 50 rows in the
There is one downside: the rows are not locked in primary key order. This may help explain the occasional deadlock we get on this table (see my article about a “little-known” way to cause a deadlock). Do you have any suggestions for how to work around this?
In this article I’ve shown you how to use subqueries to work around limitations on multi-table
UPDATE statements, and then I demonstrated how an alternate way to rewrite the query performs about 250 times better on a small data set than the obvious solution.
If this article was useful to you, subscribe via e-mail or feeds to receive future articles conveniently and free.