Archive for June, 2006

Why large IN clauses are problematic

I’ve seen a lot of SQL code that uses an IN clause as a place to put a variable number of parameters, allowing the query to be more flexible. There are several downsides to this technique. This article discusses those downsides and explains how to avoid them.

Introduction

I work in a Perl shop at the moment. We use two SQL utility modules from CPAN extensively: Class::DBI (which I’ll discuss in another article) and Ima::DBI. Ima::DBI allows us to keep our SQL layer in one place and helps abstract away a lot of the drudgery of connecting, preparing, and executing.

Just for the record, I’m not a huge fan of it for a variety of reasons, but I won’t go into that; it’s a bit off-topic.

Ima::DBI allows defining sql statements as subroutines, like this:

__PACKAGE__->set_sql('foo', 'select * from foo', 'conn');
# elsewhere:
$statements->sql_foo->execute();

That’s code for “create a subroutine named sql_foo, which will execute the SELECT against a connection named conn“. Later, the code executes that subroutine.

There’s a lot more that can be done with this. ? placeholders can go in the SQL definition, like so:

...'select * from foo where bar = ?'...
# elsewhere:
$statements->sql_foo->execute(5);

That’s standard DBI prepared-statement syntax for inserting a ‘5′ where the question mark is, but look at this:

'select * from foo where bar in (%s)

That’s a string substitution parameter, sprintf style, which gets used at runtime to alter the statement before executing it, like so:

$sth = $statements->sql_foo("?, ?, ?");
$sth->execute(5, 6, 7);

This last usage results in the statement

select * from foo where bar in (5, 6, 7)

It’s very easy to slip into this coding style when working with lists of things. For example, a program that accepts a list of account numbers on the command line. The traffic data roll-up system I’ve mentioned works this way; we run the roll-up program with a list of client IDs.

I’m using Perl as an example in this article, but I’ve seen this type of coding in many languages. This usage is problematic, and that’s what I want to talk about in this article.

Problem 1: Performance

The first problem with a large IN clause is performance. IN is equivalent to OR. For example, bar in(5, 6, 7) is the same as bar=5 or bar=6 or bar=7. That might not seem like a performance problem, but a bunch of OR clauses are much harder for the server to optimize than other methods of limiting results. Because an OR clause can have 1 to infinity parts, no single optimization strategy can always apply, and analyzing the clauses to find out which strategy would be best is probably not realistically possible. Therefore, every RDBMS I know of just evaluates each comparison until it finds a true result. That can be much less efficient than, say, a join, which might be able to use an index.

One solution to this is to move the IN clause to the FROM clause. It may be counter-intuitive, but giving the query a ‘table’ to act as a filter can be much more efficient, depending on the platform. Here’s the above query, re-written:

select * from foo
   inner join (
      select 5 as bar
      union all select 6
      union all select 7
   ) as x on foo.bar = x.bar

One example where this worked well on MySQL is explained in a recent e-mail from my coworker:

… that was a good tip on replacing ‘in’ clauses with joins to subqueries in the from clause. The queries in reporting were totally hitting a wall, so I took the query as shown in mytop, moved the subquery from the where to the from, and it went from 1M 56seconds to 20 seconds.

I wouldn’t say a factor-of-six improvement is revolutionary, but every little bit helps, especially when the query is executed a lot. Your mileage may vary. I know some situations where the improvement is dramatic.

Problem 2: Maintenance and debugging

Performance may or may not be a real problem, but maintainability definitely is. It’s really hard to debug or understand what queries are doing when the query text isn’t written until runtime (With Ima::DBI, it’s even harder because the subroutines get written as closures, which the debugger can’t step into). The code to work with these types of queries also gets really ugly. This is onerous:

@params = $something_from_arguments;
$placeholders = join(',', '?' x scalar(@params));
$sth = $statements->sql_foo1($placeholders);
$sth->execute(@params);
# ... do that 15 times

And when I see the query being executed at runtime, with 250 question marks and 250 variables to take their places, I really want to pull my hair out. Debugging statements don’t help. I rewrote one such application that had obviously been hard to debug, because it printed debugging output all over the place, ostensibly to help the programmer ensure the correct number of question marks was being created to accept the correct number of variables (there were other parameters to the query besides the IN clause, making it even more complex).

There’s an easy solution to this: start the set of queries by storing all those numbers in a temporary table, and join against the temporary table wherever needed to filter the results. I re-wrote the rollup program with this style of programming and eliminated a lot of code, leaving both the program and the queries much clearer and easier to debug.

Summary

Large IN clauses are an easy tool in the toolbag, but they don’t scale well, from both a performance and maintainability point of view. I recommend transforming them into a join to a temporary table, which can be filled with the data that would have gone into the IN clause to begin with. Once the temporary table is populated with a known set of data, queries are easy to write and understand.

Technorati Tags:No Tags

You might also like:

  1. How to write SQL JOIN clauses more compactly
  2. How to simulate optional parameters in SQL
  3. How to write a SQL exclusion join
  4. How to write subqueries without using subqueries in SQL
  5. How to avoid an extra index scan in MySQL

How to convert text to columns in OpenOffice.org Calc

Unlike Microsoft Excel, OpenOffice.org 2.0 Calc doesn’t have a built-in “text to columns” feature, which is hard to live without once you’re used to it. OpenOffice.org has an extensible add-on architecture, and someone has written a “text to columns” add-on, but installation may be confusing. In this article I’ll explain how to install the add-on.

I hope this feature will be added into the office suite at some point. Oddly, it seems to already be implemented, but not in the way it’s needed. Writer has a text-to-columns feature already, and Calc’s Open process has what looks to me like the needed functionality too — when opening a delimited text file, it brings up a dialog that does exactly what I’d do with the text-to-columns feature in Excel.

Update

Here is something I’ve noticed under GNU/Linux since writing this article: if I’ve copied text to the primary selection, for example by highlighting it in a terminal window, I can get OO.org to “text to columns” the text just by middle-click-pasting it into the spreadsheet. Instead of actually pasting it, this opens up the “Text Import (Pasted Data)” dialog, which lets me choose delimiters, etc — exactly what I need. (Of course, it’d be nice if it were even smarter and auto-detected that for me). So far I have not found any other way to cause this dialog to appear, which is puzzling.

Getting the add-on

The add-on is available through SourceForge and the OOoMacros site, which is both a source of add-ons and help and examples for those wishing to make their own add-ons. You can download text to columns for Calc here. There are actually two add-ons, but apparently the “Text2Columns fixed width” one is redundant, since the “Text to Columns” one does that too.

How to install an add-on in OpenOffice.org

This is the part I found confusing. The difference between macros and add-ons in OpenOffice.org isn’t very clear to me, and the tools to manage them aren’t either. I went down the wrong path with macros until I realized this isn’t a macro, it’s an add-on. Then I tried to learn how to install an add-on. I found lots of references to something called unopkg, but nothing about where it’s installed. It wasn’t in my PATH. I searched my filesystem and found it in /usr/lib/openoffice/program/unopkg. Then I ran it:

xaprb $ /usr/lib/openoffice/program/unopkg gui

OpenOffice.org Package Manager

Lo and behold, it brought up the same dialog I can access through the Tools > Package Manager menu entry. All that searching for nothing. I recommend not running it from the command-line; just run it through Tools > Package Manager!

Installing and using the package

Once I found it, installing the package was as easy as selecting the My Packages entry and pressing Add.. to browse for the file. It installs itself and shows up under the Tools > Add-Ons menu. This is really easy to do, but it took me a while to abandon my misdirected efforts to install it as a macro.

Here’s a screenshot of what it looks like (click the screenshot for a full-size look).

OpenOffice.org text to columns

So far it has worked fine for myself and my coworker. I hope it’s useful to you too.

Update For those of you using GNU/Linux, Gnumeric has a built-in text-to-columns converter that’s very nice. Gnumeric also loads much faster and runs with much less memory than OpenOffice.org.

Technorati Tags:No Tags

You might also like:

  1. How to label Excel and OpenOffice.org XY scatter plots
  2. Excel vs. OpenOffice.org Calc in number formatting
  3. How to create stepping slides in OpenOffice.org Impress
  4. How to update a GCC profile on Gentoo
  5. How to auto-mount removable devices in GNU/Linux

How to select from an update target in MySQL

MySQL doesn’t allow referring to a table that’s targeted for update in a FROM clause, which can be frustrating. There’s a better way than creating endless temporary tables, though. This article explains how to update a table while selecting from it in a subquery.

The problem

Suppose I want to update a table with data from a subquery that refers to the same table. I might want to do this for a variety of reasons, such as trying to populate a table with its own aggregate data (this would require assignment from a grouped subquery), updating one row from another row’s data without using non-standard syntax, and so on. Here’s a contrived example:

create table apples(variety char(10) primary key, price int);

insert into apples values('fuji', 5), ('gala', 6);

update apples
    set price = (select price from apples where variety = 'gala')
    where variety = 'fuji';

The error message is ERROR 1093 (HY000): You can't specify target table 'apples' for update in FROM clause. The MySQL manual mentions this at the bottom of the UPDATE documentation: “Currently, you cannot update a table and select from the same table in a subquery.”

It’s pretty easy to work around the problem in this contrived example, but there are times when it’s not possible to write the query without subqueries that refer to the update target. There is a workaround, though.

The workaround

Since MySQL materializes subqueries in the FROM clause (“derived tables”) as temporary tables, wrapping the subquery into another inner subquery in the FROM clause causes it to be executed and stored into a temporary table, then referenced implicitly in the outer subquery. The following statement will succeed:

update apples
   set price = (
      select price from (
         select * from apples
      ) as x
      where variety = 'gala')
   where variety = 'fuji';

If you want to know more about how this works, read the relevant section in the MySQL Internals Manual.

Problems this trick doesn’t avoid

One common frustration this doesn’t solve is the issue of badly optimized queries in the IN() clause, which are rewritten as correlated subqueries, sometimes (usually?) causing terrible performance. Wrapping the subquery in another subquery doesn’t prevent the optimizer from rewriting it as a correlated subquery, though, unless I go to extremes. In any case it’s better to just rewrite such a query as a join.

Another thing it won’t do is allow a query to refer to a temporary table more than once. Neither of these issues is solvable with the “wrap it in a subquery” trick because they are created at query compile time, whereas the update issue I was able to solve above happens at query run time.

If you enjoyed this article, subscribe via feeds or e-mail to receive my articles regularly.

Update 2006-08-29 The queries I’ve given here are sloppy, performance-wise. You don’t want to just select * from table in the subquery in real life; I just wanted to keep the examples simple. In reality you should only be selecting the columns you need in that innermost query, and adding a good WHERE clause to limit the results, too.

Technorati Tags:No Tags

You might also like:

  1. How to write a SQL exclusion join
  2. How to understand SQL joins
  3. How to simulate FULL OUTER JOIN in MySQL
  4. What is a SQL blind insert?
  5. How to write multi-table, cross-database deletes with aliases in MySQL

How to do efficient forward-only SQL maintenance jobs

I’ve recently written about techniques for archiving, purging, and re-indexing huge database tables. These techniques exploit both data structure and usage patterns. In this article I’ll develop that theme further, and explain how to write more efficient non-backtracking maintenance jobs when the update and insertion patterns permit.

Motivation

In my current employment, I’ve been optimizing databases for size, speed, and consistency. As part of my regular monitoring, I checked our master MySQL server for deadlock information and found that a nightly cron job’s query had caused other queries to time out or deadlock, then became a deadlock victim itself and died, after loading the server for a long time.

The query was performing an update in a table scan on a non-indexed column. The table is very large, and is the business’s core table, so it’s constantly accessed. There’s a column that indicates something true or false about each row, and the nightly job updates that column by joining with a regular expression match against another table. The query looks like this:

update core_table as c
   inner join client_patterns as p on c.client = p.client
      and c.phrase rlike p.pattern
   set c.important_phrase = 1;

None of the relevant columns in core_table is indexed. word is a large VARCHAR that could definitely stand to be indexed, but the table is too large to index at this point, and I can’t nibble it down to size (since it’s the core table, it’s very hard to know when a row is archivable; maybe at some point I’ll figure out how). This is one of the tables that remains troublesome. Until we find a suitable archive strategy, we’ve just got to walk on eggshells around it.

A table scan in an InnoDB table like this reads along the entire table, placing read locks on every row, and write locks on rows that need to be updated. When it runs into another transaction’s locks, whether read or write, it blocks and waits for the locks it needs to proceed. That’s one reason table scans are so bad for concurrency.

The query eventually ran into some other transaction’s locks and timed out, but by the time it did, it had accumulated tens of thousands of InnoDB lock structures (I don’t yet understand the exact relationship between locks and lock structures, but it doesn’t seem to be one-to-one).

Long-running transactions must die!

The solution

My first thought was to write a nibbler to do the job one row at a time. This would avoid concurrency issues, but it would take many hours. I had some other thoughts too, such as gathering the eligible rows into a temporary table, then nibbling from that, but none of this seemed right.

For one thing, not all the matching rows needed to be updated. In fact, as I thought about it more, and asked the people who knew, I realized this table is just like our cost rollup tables: only rarely do we need to reach way back and update a lot of rows. We usually just need to update the new rows that have been inserted every day. Sometimes the second table in the join is updated and everything needs to be re-examined, but that’s very rare.

In short, the data is randomly accessed, but we add and update at the end. This can be exploited to gain efficiency.

I created a table to hold a “bookmark” of the last row we processed. This could theoretically be stored on disk or by some other means (clay tablets anyone?), but the database is a good place. I wrote a job which runs nightly and does the following:

  1. make a snapshot of all new rows in the table
  2. update the new rows
  3. store the marker again

The “snapshot” is a temporary table that holds all the rows greater than or equal to the marker. Here are the queries I wrote:

create temporary table snapshot as 
   select distinct c.id from core_table as c
      inner join client_patterns as p on c.client = p.client
         and c.phrase rlike p.pattern
      where c.id > (
         select coalesce(min(last_row), 0)
         from last_processed_row
         where job = 'update core_table'
      );

update snapshot as s
   inner join core_table as c on c.id = s.id
      set c.important_phrase = 1;

insert into last_processed_row(job, last_row)
   select "update core_table", max(id)
   from snapshot
   on duplicate key update
      last_row = greatest(last_row, values(last_row));

There are some subtleties in these queries that bear explanation. First, I just chose a string value — any arbitrary value will do — to hold the “job” in the last_processed_row bookmark table. That “job” value is the table’s primary key. As long as no other job uses the same value, and every query in a single job uses the same value, it’ll work.

The first query uses a subquery in the WHERE clause to select the job’s bookmark value. Why do all that fancy COALESCE(MIN()) stuff? The job’s ID is the bookmark table’s primary key, so there’s only one row, right? I should be able to just select that row. I don’t need to take the MIN() of a single value.

That’s mostly true, but what if there is no such row in the bookmark table? In that case, there’d be zero rows in the subquery. Using an aggregate function like MIN() or MAX() will always return a value, “propping open” the subquery to make sure it doesn’t “collapse” to zero rows. If there are zero rows, the result is NULL, so COALESCE() makes sure zero rows equates to a value of zero.

Finally, the last query uses some non-standard MySQL features to insert a row into the bookmark table if it doesn’t exist, and update to the most recent value if it does exist. I do not use REPLACE because it may decrease the value, and I want the value to be monotonically increasing — the point of this algorithm is to avoid backtracking. You can read more about these and other magical “do many things at once” queries in my article on flexible inserts and updates in MySQL.

Results

The new queries run very quickly, of course. Daily updates take a small fraction of a second instead of grinding on for half an hour. This is mostly because updates are constrained to a comparatively small data set, but it’s also because the updates are much better for concurrency. When the working set is off in a temporary table, the query isn’t trying to do everything at once.

We decided to let the full update happen once a week, just in case t2 gets some update through the week. This only takes a couple minutes, even without indexes on columns that really need indexes.

Conclusion

I hope this article has given you some ideas on how to take advantage of any patterns you know to be true about data. Not only data structure, but data usage patterns, can create great opportunities for making things better.

Technorati Tags:No Tags

You might also like:

  1. How to give locking hints in MySQL
  2. How to avoid unique index violations on updates in MySQL

Announcement: Xaprb scripts are re-licensed

I have re-licensed some of my scripts under the LGPL, which means you can use them as part of other non-GPL software.

The scripts in question are:

Why the change?

I’ve always released my work under the GPL, but now that I’m writing and making available smaller scripts that might be used as part of a larger software system, the GPL isn’t as appropriate, because of the licensing requirements it places on the rest of the systems with which the scripts may be used.

I’m not making the decision casually. I’m a firm believer in free software, as in software freedom. Not only should software be open and un-encumbered by restrictions, but this freedom should be guaranteed even to the extent of prohibiting someone from making a restricted non-free work based on it. Software is encoded knowledge that belongs to humanity as a whole, just as certainly as mathematical formulae. That’s why I oppose non-free software and software patents.

However, for certain small scripts I’ve written, this rigid approach doesn’t make sense. I’m not releasing these scripts as “software.” The scripts mainly serve as working examples of principles and methods I’m trying to illustrate. From that point of view, the knowledge encoded in the scripts is already free — it’s in the articles I write to introduce the scripts, techniques and algorithms. Plus, I’m not breaking any new ground with the articles, either. Nothing I say here is revolutionary.

I read Richard Stallman’s writing carefully. Though he has a reputation for being a hard-liner, I value his opinions and judgement highly. When I’ve had contact with him, he’s come across just as he does in his essays: as a man of principles, which he will not yield. I respect that. One of his essays is about not using the LGPL. For the reasons stated above, I don’t think the points raised in that essay are applicable to the scripts I’ve released on this website. I believe these scripts are best released under a more permissive license.

The future

This is also a forward-looking change. I’m not trying to be grandiose about these little scripts I’ve written. They’re small snippets and they won’t change the world. But I’ve been working on some other things that are more significant, and I want everything I release to be licensed consistently with my beliefs. When I release these other projects, I’ll be careful to license them in accordance with what they are, what they do, and how others might find them useful.

I hope you, or the projects you’re working on, will find my work useful. Please feel free to contact me if you have any questions, comments, or improvements.

Technorati Tags:No Tags

You might also like:

  1. How to read the clipboard from JavaScript
  2. MySQL Community Member of the Year
  3. The truth about MySQL Community and Enterprise
  4. Why I write Free Software

3 ways to write UPSERT and MERGE queries in MySQL

This article is a quick pointer on MySQL’s three techniques for UPSERT (update/insert) or MERGE queries. I’ve written about this before, but since SQL 2003 has a MERGE statement (which Oracle and DB2 support), many people are asking for similar functionality in MySQL without realizing it already exists. I hope this article helps you find what you need.

The three tools are

  1. REPLACE
  2. INSERT ... ON DUPLICATE KEY UPDATE ...
  3. Transactional UPDATE followed by INSERT

I’ve discussed them in great detail in my article on flexible insert and update statements in MySQL.

There are other methods too, such as INSERT IGNORE, but these are the three most important.

Technorati Tags:No Tags

You might also like:

  1. How to implement a queue in SQL
  2. How to write flexible INSERT and UPDATE statements in MySQL
  3. How to avoid unique index violations on updates in MySQL
  4. How to write INSERT IF NOT EXISTS queries in standard SQL
  5. A progress report on MySQL Table Sync

How to avoid unique index violations on updates in MySQL

There is a bug in MySQL that causes an UPDATE to fail with a unique index violation, even though the statement doesn’t create duplicate values. In this article I’ll explain when this bug can happen, and how to work around it.

The bug

This is easiest to demonstrate with SQL:

create table t (i int not null primary key);
insert into t(i) values (1), (2), (3), (4);
update t set i = i + 1;
-- ERROR 1062 (23000): Duplicate entry '2' for key 1

The bug is caused by MySQL’s method of updating the values. It updates the first row (in index order), then checks for index violations. Since there is now a duplicate row, it fails. The correct standards-compliant behavior would be to update all the rows, then check for violations, but that is much more difficult and less efficient, so MySQL does not follow the standard.

The workaround

The solution is to update the rows in a different order. MySQL allows an ORDER BY clause on UPDATE statements:

update t set i = i + 1 order by i desc;

Now the query updates 4 to 5, then 3 to 4, and so on, avoiding any conflicts.

More complex cases

There are cases where the workaround can’t be as simple as the above:

update t set i = case when i > 2 then i + 1 else i - 1 end;
-- ERROR 1062 (23000): Duplicate entry '4' for key 1
update t set i = case when i > 2 then i + 1 else i - 1 end order by i desc;
-- ERROR 1062 (23000): Duplicate entry '1' for key 1

I can’t find a foolproof way to work around this. Here’s one statement that works on this particular situation:

update t
   set i = case when i > 2 then i + 1 else i - 1 end
   order by case when i > 2 then 1000 - i else i end

Depending on the data, it might not be that easy. There are cases where no ordering can work at all, such as when swapping numbers:

update t
   set i = case when i = 1 then 2 else 1 end
   where i in (1,2);

I’ll write another post on swapping numbers in MySQL.

Beware a half-updated table

Unfortunately, this bug can sometimes cause the table to be left in an inconsistent state. In the more complex example above, if the table is InnoDB, the entire update is rolled back when it fails. If the table is MyISAM, half the updates are successful and half fail. This is the case with all updates or inserts that fail in MyISAM, since the storage engine is not transactional.

Technorati Tags:No Tags

You might also like:

  1. How to reverse a sequence in SQL
  2. An introduction to InnoDB error handling
  3. How to write flexible INSERT and UPDATE statements in MySQL
  4. How to implement a queue in SQL
  5. Maatkit version 1297 released

How to re-index a large database table

In recent articles I explained how I’ve optimized queries against large datasets at my current employer, and how I’ve written efficient archiving and purging jobs to trim the tables down to a manageable size. This article explains how I re-indexed some of those tables without taking the server offline.

The technique I used is suitable for when a table has gotten “too fat to exercise.” In other words, the table is in critical condition, but that very condition makes it hard to do anything with the table, much less re-index or re-arrange the data. Before trying any of the techniques I outline in this article, I suggest a thorough analysis of data and index size, typical queries, row format and length, and how much data needs to be in the table (e.g. can it be nibbled down to a manageable size with an archiving job?).

While this specific situation involved MySQL, the techniques I’ll discuss below apply equally to any RDBMS.

The situation

We have a series of tables holding traffic data for online advertisements. Some of them hold the raw data from the source. These are then merged into one cost-by-day table, which is rolled up into various kinds of aggregation — ad by week, client by day, and so forth.

Though the archiving jobs were nibbling the data steadily, and some of the tables had even reached their target maintenance size, there was still too much data, and it wasn’t indexed optimally. One of the most important tables was still an order of magnitude too big, and archiving it was slow work. Some of the queries were taking too long and causing frequent timeouts and deadlocks, and a beefy server with high-speed disks was hamstrung by repeated multi-gigabyte table scans. Up to 85% of CPU time was spent waiting for disk reads and writes.

This is an illustration of when not to use surrogate keys on InnoDB tables. The tables were originally designed with surrogate keys as their primary key, which meant they weren’t clustered on the most frequently queried criteria: date ranges. Since the rollups and queries almost always happened by date, clustering them date-first meant any given query would be restricted to just part of the table. But they were clustered on the surrogate key — totally useless for the intended purpose. In fact, the surrogate keys were never referenced in any queries except the archiving jobs. As a result, most queries had to scan the entire table.

Some of the tables had indexes on the important columns — usually date and ad ID — but in a few of the largest, most important tables, the indexes were not date-first, so they weren’t helpful, as I verified with EXPLAIN. Massive multi-gigabyte indexes weren’t even being used in any queries. They were just slowing down anything that updated, inserted or deleted (and making our backups enormous and slow, always another cause of anxiety).

We had also just signed a new client, one of the Internet’s largest advertisers. A huge pile of new data was getting ready to body-slam us. We feared there was no way we could take on that additional load; the rollups alone might bring us down.

We needed those tables re-indexed, and soon. We were prepared to take the servers offline and put everything on hold for days while we re-indexed them, if it came to that. We didn’t want to, but we worked on a strategy just in case.

The offline re-indexing strategy

I’ve mentioned before that every secondary index in InnoDB tables holds values from the clustered index (primary key) at its leaf nodes. This has profound implications for any attempt to change the clustered index, because changing the clustered index requires re-writing every secondary index. We definitely wanted to re-cluster our tables, date-first. But doing it in one fell swoop would be a terrible idea! Re-writing just one large index is bad enough. Re-writing all of them in one single succeed-or-fail operation, plus physically re-arranging every row in the table along the new clustered index? Forget it! It might take weeks per table, even if nothing else were touching the database.

And it wasn’t just the clustered index that needed to change; many of the other indexes needed to be changed or removed, and we needed new indexes too.

The only strategy that we thought could work on such large tables would be to

  1. Stop everything from accessing the database
  2. Drop every secondary index, one at a time
  3. Drop the primary key
  4. Drop the unwanted columns
  5. Create a new primary key
  6. Create new secondary indexes, one at a time

Based on my tests with some smaller tables, I estimated this process would take days on some of the tables. It could take much longer; I don’t know how the work scales relative to the size of the table. The best I could hope for is linear scaling, and it might be exponential, in which case we’d be in bad trouble. To give an idea, dropping a single index would probably take at least three hours on some of the tables.

This could not be done while the server was online, because all sorts of things might go wrong — way too risky. There was one table this strategy wouldn’t work for, anyway: a table that had no unique constraint on the columns we needed to cluster (every other table was OK). Not only did we need to re-index this table, we needed to aggregate it by the data that would be its new primary key.

This still looked pretty gloomy, but we were prepared to do it if necessary. Before we went to such extremes, I wanted to try one strategy I thought might let us do all this without taking the server offline, much more cheaply.

The constraints

Here are the ground rules again: If we re-indexed without taking the server offline, we had to ensure that

  1. The server didn’t get slammed with huge queries
  2. There were no long-running transactions or locks

By “huge” and “long-running” I mean on the order of hours. Those are simple constraints, but it’s not obvious how to work within them.

The don’t-go-offline strategy

My idea hinged on three things: the data access patterns, the cost of dropping a table, and the ability to lock and rename tables.

The first thing that gave me hope we could do this online is the way data is updated. Since this is historical data, we usually work a day at a time, adding new data only, then rolling up. Only in exceptional circumstances do we reach into the past and update things. Therefore, most of the data never changes, and didn’t need to be locked to ensure its integrity (a simple “don’t roll up tables!” email took care of the human side of things).

Second, it’s easy to drop tables — even large ones — because we keep a fixed-size tablespace, so it doesn’t require a lot of disk activity.

Finally, it’s inexpensive to lock tables, and they can be renamed while they’re locked (though that does drop the locks).

For each table that needed to be re-indexed, my idea was to

  1. Create a new table with the desired structure and indexes
  2. Insert the old data into the new table in small chunks, except for the most recent few days
  3. Lock both tables
  4. Insert the last few days into the new table
  5. Swap the tables out by renaming them
  6. Check for data integrity
  7. Drop the old table

I didn’t create the new tables without indexes and then add the indexes later. As I’ve said before, this hits the server way too hard. It’s much better to do the work gradually as the data is inserted (in fact, I did miss an index on one of the tables, and it took a very long time to add it after the fact).

I tried this with some of the smaller tables, and it seemed to go well. To insert the data from the old table into the new “in small chunks,” I wrote the “nibbler” archiving job in a shell script. Instead of doing it one row at a time, I did one day’s worth of data at a time. I ran it early in the morning before anyone else was on the system, and each day’s worth of data seemed to take from a few seconds to a few minutes, depending on which table. That was an encouraging sign. Here’s what the shell script looked like:

#!/bin/bash

for day in `seq 90 -1 2`;
do
   echo "loading $day";
   cmd="insert into new_table (cols) select cols from old_table where day = date_sub(current_date, interval $day day)";
   if [ -f /home/xaprb/oktorun ]; then
      time echo $cmd | mysql;
   fi
done

I used a file as an “OK” signal for the script. If I decided to stop in the middle of the process, all I had to do was remove the file and the script would loop through the remaining days without actually sending the query to the server. I did do this once or twice when an automated job started.

Apparently inserting data at the end of the table is fairly cheap, because the inserts took practically the same time as querying without inserting. Then all I had to do was connect with the command-line client, lock the tables, insert the remaining days manually, and rename the tables. It was easy to check the integrity of the new table against the old and delete it when I was convinced it was all OK.

The results

I was surprised at how quickly it went. Some of the larger tables took many hours to “nibble” into the new structure, but that was OK. We wanted to keep 90 days of data for the day-level tables, and they were nibbled down to that size, so it was easy to just check in on the script every little bit, then finish them when the script quit. All in all, I think I spent a week and a half analyzing queries, choosing table structures and indexes for the new tables, and moving the data to the new tables. Much of this time I worked on other things as well.

The real tests were yet to come, however. Did the re-indexing save space? Are queries faster? Is there less contention? Are timeouts and deadlocks avoided? I had a pretty good idea these would all be “yes,” but my calculations (and I did do calculations!) could have been grievously wrong. As it turned out, though, they were right. Rollups that used to run for hours before deadlocking literally take three to ten minutes now (I re-wrote optimized queries for them, which helps too). Common queries finish in seconds or minutes instead of half a day; I was just told a 4-hour query runs in 48 seconds now. When the server is heavily used, it actually pegs the CPU, not the disk! SHOW TABLE STATUS shows tables are much smaller (sometimes 1/4th the space for the same data!), and the indexes are much smaller, too. In some cases I was able to trim table size by a factor of 20 or more with the combination of nibbling and re-indexing.

This strategy helped in another key area. One table wasn’t fully archived, and it was one of the most frequently queried and updated, so the archiving job often got the short end of the stick. We had 24 million rows to go. I wanted to wait as long as possible on this one and see how far it would get, but in the end I needed to go ahead and build the new table with the most recent 90 days of data, and leave the old table in place so we can keep archiving from it until the archive table has everything the new table doesn’t have. One benefit of this decision is that the old table is no longer accessed, so the archiving is going much faster now. At some point, the new table and the archive table will together constitute the entire data set, the archiving job will end, and I’ll drop the old table. In the meantime this data is spread out over three tables, but that’s OK; we’ll eventually get that data where it needs to be.

Most importantly, our database is under control, and we can just do routine maintenance from now on. There are still some large tables to purge and/or archive and re-index, but they aren’t the core tables the business depends on. Our new client isn’t going to crush the database server.

Technorati Tags:No Tags

You might also like:

  1. When to use surrogate keys in InnoDB tables
  2. How to write efficient archiving and purging jobs in SQL
  3. How to do efficient forward-only SQL maintenance jobs
  4. Archive strategies for OLTP servers, Part 1
  5. How to exploit MySQL index optimizations

Benchmarks for DATE operations in MySQL

This article compares the relative speed of extracting the date part of a value in MySQL with LEFT() and with the DATE() function.

LEFT() is faster than DATE(). To prove this, I inserted two million un-indexed sequential values into a table and selected the minimum and maximum values. Both queries are table scans, so it does read through all the records. The table below lists the time in seconds for MAX() on my computer. I tested with three data types: DATE, TIMESTAMP and DATETIME.

I don’t know why it’s faster to use LEFT() than DATE(). I would assume the reverse to be true, but clearly it’s not, at least on the systems I’ve tested.

The time for MIN() is one or two milliseconds faster than MAX(), probably because the values are sequential and only one assignment is performed, whereas the MAX() query must perform two million assignments.

Technorati Tags:No Tags

You might also like:

  1. Why I use explicit date math in SQL

How to find data distributions with SQL

In an earlier article I wrote about grouping data into ranks with a catch-all bucket. In this article I’ll show you how to group the data into variable-sized buckets any way you please.

This query came up when a business partner asked me to send over the distribution of some hierarchical data. It’s the same category/subcategory/item data as in my article about optimizing joins and subqueries. The partner wanted to know, broadly speaking, “how many subcategories have very small and very large numbers of items”.

I could have done a simple query:

select category, count(*) as num
from item group by category

That would have resulted in one row for every category, which would have been thousands of rows — not very useful for answering the question. I needed just a few rows, showing how many subcategories are large and how many are small. I started by filling a temporary table with my desired size ranges:

create temporary table ranges (
   s int not null, -- start of range
   e int not null  -- end of range
);

insert into ranges (s, e) values
   (1, 1),
   (2, 10),
   (11, 50),
   (51, 100),
   (101, 200),
   (201, 500),
   (501, 1000),
   (1000, 9999);

Then I grouped the data by subcategory, joined it against the ranges by size, and grouped again by range, counting and summing the sizes of each of the subcategories to get totals. In the query below, I analyze the distribution of items in category 14:

set @category := 14;

select concat(s, '-', e) as range, sum(num) as total, count(*) as num
from ranges
inner join (
   select s.id, count(*) as num
   from subcategory as s
      inner join item as i on i.subcategory = s.id
   where s.category = @category
   group by s.id
) as x on x.num between ranges.s and ranges.e
group by ranges.s, ranges.e

The results look roughly like this:

The distribution is clearly biased towards single-item categories, with the occasional huge category. Part of the goal was to rewrite our grouping algorithm to chunk things together in groups of 20 to 80 (depending on a variety of complex factors I won’t explain here). This query helped us get a realistic picture before and after the algorithm change.

Technorati Tags:No Tags

You might also like:

  1. How to optimize subqueries and joins in MySQL
  2. How to write subqueries without using subqueries in SQL
  3. How to find contiguous ranges with SQL
  4. How to group data in SQL with a catch-all bucket
  5. How to find duplicate rows with SQL