Xaprb

Stay curious!

Archive for the ‘archiving’ tag

How to write a lazy UNION in MySQL

with 20 comments

The other day I was explaining options to someone who wanted to know about archiving data in MySQL. “So,” he said, “I might have to code my app to look for the data in two places?” The disadvantage of this is that his app might be more complex. Another disadvantage is that it might take two queries — if you look for a user in the usual location and it’s not there, you have to look for it elsewhere.

One way to deal with this, as long as the archived data is on the same server, is a UNION.

select user_id from user where user_id = 123
union all
select user_id from user_archive where user_id = 123;

The benefit is that you don’t have to issue two queries. That saves network round trips, and makes your code shorter. But it has a disadvantage, too: you’re still querying the archive table when you don’t need to. Does this matter? Yes, it does. Your archive table may be very large and slow — perhaps stored on a big slow hard drive, perhaps on a SAN — and just peeking at it is kind of expensive in some cases.

Something occurred to me a couple of weeks ago: why not write a UNION that stops executing as soon as one part of it finds a row? Then you can UNION to your heart’s content and not incur the overhead of that second lookup unless you need it. For lack of a better term, I’m calling this a lazy UNION.

My idea here is to use a user variable. If the first part of the UNION finds a row, it sets the variable. The second part has the variable in its WHERE clause, and won’t execute if the variable has been set by the first part. To make the whole thing self-contained, I’m adding a third part of the UNION, which will always execute but never return any rows; it will set the variable back to its initial state of NULL.

Here’s a complete example:

drop table if exists user, user_archive;
create table user(user_id int not null primary key);
create table user_archive like user;
insert into user(user_id) values(1);
insert into user_archive(user_id) values(2);

select greatest(@found := -1, user_id) as user_id, 'user' as which_tbl
   from user where user_id = 1
union all
select user_id as user_id, 'user_archive' as which_tbl
   from user_archive where user_id = 1 and @found is null
union all
select 1, '' from dual where ( @found := null ) is not null;

select greatest(@found := -1, user_id) as user_id, 'user' as which_tbl
   from user where user_id = 2
union all
select user_id as user_id, 'user_archive' as which_tbl
   from user_archive where user_id = 2 and @found is null
union all
select 1, '' from dual where ( @found := null ) is not null;

You can play around with it and verify that indeed, the second part of the UNION never executes if the first part finds a row. There are several ways to do this: with EXPLAIN, by adding some more variables to show which part of the query executes, etc. The results of the above query are as follows:

+---------+-----------+
| user_id | which_tbl |
+---------+-----------+
|       1 | user      | 
+---------+-----------+
1 row in set (0.00 sec)

+---------+--------------+
| user_id | which_tbl    |
+---------+--------------+
|       2 | user_archive | 
+---------+--------------+
1 row in set (0.00 sec)

I have not benchmarked this. My gut feeling is that whether it’s beneficial is going to depend on your workload. But it’s a fun little hack I thought I’d share with you. By the way, there’s no reason you have to stop at two; you could add any number of queries to the UNION.

Written by Xaprb

June 15th, 2008 at 2:48 pm

Posted in Uncategorized

Tagged with , , , ,

Maatkit version 1709 released

with 4 comments

Download Maatkit

This release contains bug fixes and new features. It also contains a new tool: my implementation of Paul Tuckfield’s relay log pipelining idea. I have had quite a few responses to that blog post, and requests for the code. So I’m releasing it as part of Maatkit.

Changelog for mk-archiver:

2008-01-24: version 1.0.7

   * Added --quiet option.
   * Added --plugin option.  The plugin interface is not backwards compatible.
   * Added --bulkins option.
   * Added --bulkdel option.
   * Added --nodelete option.
   * Changed negatable --ascend option to --noascend.

Changelog for mk-parallel-dump:

2008-01-24: version 1.0.5

   * The fix for bug #1863949 added an invalid argument to gzip (bug #1866137)
   * --quiet caused a crash.

Changelog for mk-parallel-restore:

2008-01-24: version 1.0.4

   * The -D option was used as a default DB for the connection (bug #1870415).

Changelog for mk-slave-prefetch:

2008-01-24: version 1.0.0

   * Initial release.

Changelog for mk-table-sync:

2008-01-24: version 1.0.4

   * Made the --algorithm option case-insensitive (bug #1873152).
   * Fixed a quoting bug.
   * Made the UTF-8 options configurable.

Written by Xaprb

January 24th, 2008 at 9:11 am

Posted in Uncategorized

Tagged with , , , ,

MySQL Archiver can now archive each row to a different table

without comments

One of the enhancements I added to MySQL Archiver in the recent release was listed innocently in the changelog as “Destination plugins can now rewrite the INSERT statement.” Not very exciting or informative, huh? Keep reading.

If you’ve used plugins with MySQL Archiver you know that I created a series of “hooks” where plugins can take some action: before beginning, before archiving each row, etc etc. This lets plugins do things like create new destination tables, aggregate archived rows to summary tables during archiving (great for building data warehouses, though not as sophisticated as Kettle), and so on. Well, this release added a new hook for plugins: custom_sth.

This lets a plugin override the prepared statement the tool will use to insert rows into the archive. By default the prepared statement just inserts into the destination table. But the custom_sth hook lets the plugin inspect the row that’s about to be archived and decide what to do with it. This lets it do interesting things like archive rows to different tables.

This came up because some of the tables I’m archiving to suddenly hit the bend in the hockey-stick curve. I diagnosed the problem very simply: inserts began taking most of the time during archiving. As you might know, MySQL Archiver has a statistics mode where it profiles every operation and reports the stats at the end. I’m archiving out of InnoDB into MyISAM; take a look at the stats:

Action          Count       Time        Pct
inserting      800584 12722.8245      88.35
deleting       800584  1464.1040      10.17
print_file     800584    58.3453       0.41
commit           3204    29.4391       0.20
select           1602     8.5654       0.06
other               0   116.5321       0.81

Inserting suddenly took 88% of the time spent archiving, when it had been taking a very small fraction of the time. I’d been meaning to split the archived data out by date and/or customer, and this convinced me it was time to stop procrastinating. There are columns in the archived rows for both of these dimensions in the data, so it shouldn’t be hard. So I added the custom_sth hook, wrote a 40-line plugin, and did it. Results:

Action             Count       Time        Pct
deleting           51675   525.2777      87.62
inserting          51675    49.3903       8.24
print_file         51675     4.4639       0.74
commit               208     2.1553       0.36
custom_sth         51675     1.4575       0.24
select               104     0.6714       0.11
before_insert      51675     0.1135       0.02
before_begin           1     0.0001       0.00
plugin_start           1     0.0000       0.00
after_finish           1     0.0000       0.00
other                  0    15.9868       2.67

(You can see the effect of having a plugin, because the time taken for all the hooks is listed in the stats. There was no plugin previously.)

Now inserting takes only 8% of the time required to archive. Put another way, it used to insert 63 rows per second, now it inserts 1046 rows per second. This is single-row inserts. (It is not intended to archive fast; it is intended to archive without disturbing the OLTP processes. Obviously this server can do a lot more inserts and deletes than this.)

What had happened? The MyISAM tables on the destination end had just gotten too big for their indexes to fit in memory, and the inserts had suddenly slowed dramatically. I didn’t want to give them a lot more memory, because I want the memory to be used for the InnoDB data on that machine. This is the same kind of thing, I’d guess, that Kevin Burton just wrote about.

Oh yeah, while I was at it, I totally rewrote the archiver with unit-tested, test-driven, test-first, other-buzzword-compliant code. I added a lot of other improvements, too. For example, it can now archive tables that have much harder keys to optimize efficiently, such as nullable non-unique non-primary keys.

Written by Xaprb

November 5th, 2007 at 9:28 am