Archive for the ‘data warehousing’ tag
MySQL Archiver can now archive each row to a different table
One of the enhancements I added to MySQL Archiver in the recent release was listed innocently in the changelog as “Destination plugins can now rewrite the INSERT statement.” Not very exciting or informative, huh? Keep reading.
If you’ve used plugins with MySQL Archiver you know that I created a series of “hooks” where plugins can take some action: before beginning, before archiving each row, etc etc. This lets plugins do things like create new destination tables, aggregate archived rows to summary tables during archiving (great for building data warehouses, though not as sophisticated as Kettle), and so on. Well, this release added a new hook for plugins: custom_sth.
This lets a plugin override the prepared statement the tool will use to insert rows into the archive. By default the prepared statement just inserts into the destination table. But the custom_sth hook lets the plugin inspect the row that’s about to be archived and decide what to do with it. This lets it do interesting things like archive rows to different tables.
This came up because some of the tables I’m archiving to suddenly hit the bend in the hockey-stick curve. I diagnosed the problem very simply: inserts began taking most of the time during archiving. As you might know, MySQL Archiver has a statistics mode where it profiles every operation and reports the stats at the end. I’m archiving out of InnoDB into MyISAM; take a look at the stats:
Action Count Time Pct inserting 800584 12722.8245 88.35 deleting 800584 1464.1040 10.17 print_file 800584 58.3453 0.41 commit 3204 29.4391 0.20 select 1602 8.5654 0.06 other 0 116.5321 0.81
Inserting suddenly took 88% of the time spent archiving, when it had been taking a very small fraction of the time. I’d been meaning to split the archived data out by date and/or customer, and this convinced me it was time to stop procrastinating. There are columns in the archived rows for both of these dimensions in the data, so it shouldn’t be hard. So I added the custom_sth hook, wrote a 40-line plugin, and did it. Results:
Action Count Time Pct deleting 51675 525.2777 87.62 inserting 51675 49.3903 8.24 print_file 51675 4.4639 0.74 commit 208 2.1553 0.36 custom_sth 51675 1.4575 0.24 select 104 0.6714 0.11 before_insert 51675 0.1135 0.02 before_begin 1 0.0001 0.00 plugin_start 1 0.0000 0.00 after_finish 1 0.0000 0.00 other 0 15.9868 2.67
(You can see the effect of having a plugin, because the time taken for all the hooks is listed in the stats. There was no plugin previously.)
Now inserting takes only 8% of the time required to archive. Put another way, it used to insert 63 rows per second, now it inserts 1046 rows per second. This is single-row inserts. (It is not intended to archive fast; it is intended to archive without disturbing the OLTP processes. Obviously this server can do a lot more inserts and deletes than this.)
What had happened? The MyISAM tables on the destination end had just gotten too big for their indexes to fit in memory, and the inserts had suddenly slowed dramatically. I didn’t want to give them a lot more memory, because I want the memory to be used for the InnoDB data on that machine. This is the same kind of thing, I’d guess, that Kevin Burton just wrote about.
Oh yeah, while I was at it, I totally rewrote the archiver with unit-tested, test-driven, test-first, other-buzzword-compliant code. I added a lot of other improvements, too. For example, it can now archive tables that have much harder keys to optimize efficiently, such as nullable non-unique non-primary keys.
High Performance MySQL, Second Edition: Backup and Recovery
Progress on High Performance MySQL, Second Edition is coming along nicely. You have probably noticed the lack of epic multi-part articles on this blog lately — that’s because I’m spending most of my spare time on the book. At this point, we have significant work done on some of the hardest chapters, like Schema Optimization and Query Optimization. I’ve been deep in the guts of those hard optimization chapters for a while now, so I decided to venture into lighter territory: Backup and Recovery, which is one of the few chapters we planned to “revise and expand” from the first edition, rather than completely writing from scratch.
Since we decided to take that approach, I began by following the outline from the first edition, and figured I’d re-read the first edition’s chapter and re-outline, then add more material as appropriate. To my surprise, I found this chapter in the first edition is one of the most cursory (I don’t mean to criticize too much — you’ll see where I’m going with this in a second). It’s quite short and doesn’t really discuss recovery at all, despite the chapter title. There’s one sub-section titled “Recovery,” but it’s only a few paragraphs, and mostly discusses dumping, not recovery! [Edit: whoops, I see each subsection in the "Tools and Techniques" has a few words about how to restore backups created with that specific tool. But there's still not much general advice about how to restore backups.]
The chapter devotes a lot of space to code listings and such, and not enough on how to do high-performance backups in a high-performance application, in my opinion. I quickly decided it needs to be significantly expanded, not just updated, and I scrapped the original text and became more liberal with the outline. I’m referring to the first edition as I write, but I’m not keeping any of the text. Chalk it up to perfectionism.
The outline, as I have it so far, is as follows. If you compare it to the first edition, you’ll see I’ve rearranged it quite a bit:
1 Why Backups? (very brief, even more so than the first edition) 2 Considerations and Tradeoffs 2.1 How Much Can You Afford to Lose? 2.2 Online or Offline? 2.3 Dump or Raw Backup? 2.4 Onsite or Offsite? 2.5 What to Back Up 2.6 Storage Engines and Consistency 2.7 Replication 3 Restoring from a Backup 3.1 Copying Files Across the Network 3.2 Starting MySQL 3.3 Point-In-Time Recovery 4 Tools and Techniques 4.1 mysqldump 4.2 mysqlhotcopy 4.3 Zmanda Recovery Manager 4.4 InnoDB Hot Backup 4.5 Offline Backups 4.6 Filesystem Snapshots 4.7 MySQL Global Hot Backup 4.8 Automating and Scripting Backups 5 Rolling Your Own Backup Script
At this point, I have written sections 1, 2 and 3, which are about 11 pages in OpenOffice.org (compare to 6 pages on paper in the first edition). I’m sure this will only grow as other things occur to me. The outline of section 4 is completely open to change, and section 5 might not even happen; if you can script, you can script. Otherwise, you might want to use one of the tools listed in section 4. All in all, I’d say we’re looking at about 25 to 30 pages, just based on what’s in my head and not yet written down.
Now, to come to my point: what would be helpful to you? Are there any challenges you’d like me to cover, such as how you back up a data warehouse with terabytes of data? (I’ve already done that, in What To Back Up, but feel free to ask anyway.) Are there challenges you have had to solve, which you think would be very helpful to others? This chapter is largely open to suggestion at this point. If you tell me/us what you’d like to see, this is your opportunity to get at least four experts to solve your problems in-depth.
The usual disclaimers apply: no guarantees, this is all open to change, this is top-secret pre-production material anyway and you never saw this web page. What is the first rule of Fight Club, again?
I’m looking forward to your feedback.
Archive strategies for OLTP servers, Part 3
In the first two articles in this series, I discussed archiving basics, relationships and dependencies, and specific archiving techniques for online transaction processing (OLTP) database servers. This article covers how to move the data from the OLTP source to the archive destination, what the archive destination might look like, and how to un-archive data. If you can un-archive easily and reliably, a whole new world of possibilities opens up.
For your reference, here are links to part 2 and part 1, and the original article on efficient SQL queries for archiving, which is the basis for this whole series.
How to move the data
At some point you have to actually take the data from the source and put it into the archive. This is a three-step process:
- Find archivable rows
- Insert them into the archive
- Delete them from the source
I wrote an article on how to find archivable rows efficiently, so I won’t go into it more here. Inserting and deleting are usually straightforward, but there are subtleties and tricks that can lead to nifty solutions.
Transactions
The most important question about actually moving the data is how to do it safely, with or without transactions. Even if the source and archive are on different servers, you can do distributed transactions, either in your application logic or with a two-phase commit supported by your database product. For most purposes, I’ve found it just as reliable and more efficient to handle the transaction in your application logic.
For many of the reasons mentioned in the second article in this series, I would recommend relaxing the consistency requirements between source and archive, so you can keep the archived data out of the source’s transaction. You can do this safely by performing the operations in the order I listed above: insert, delete, commit the insert, then commit the delete. If you are archiving to a file at the same time, you can also write to the file before the insert.
Your archive might also be non-transactional. If you’re using MySQL, you should think about using a faster non-transactional engine that stores the data more compactly, such as MyISAM, for the archived data.
Use replication to unload the OLTP server
One of the most effective ways to archive an OLTP server without impacting it too much is to do the hard work of finding and inserting on a slave server, then performing the delete on the master and letting it flow through replication. Here’s an example from a past employer: we replicated the order table to a “transfer” database on the data warehouse server. A job on the data warehouse server looked for orders that had completed and shipped, and thus could be archived. It copied these in a transaction to the long-term storage, then deleted on the OLTP server. This delete flowed through replication back to the data warehouse, and removed the rows from the transfer database.
The archive server
I’ve already mentioned some ways you might design the archive server, but there are a few other things I want to cover briefly. The first is what happens when you don’t use a different server at all, and just archive to adjacent tables on the OLTP server. This can be a fine way to do things. As long as the data isn’t used, it’s just taking up space on disk. However, it might make backups more difficult.
If you use a different server to hold the archived data, you should probably consider some kind of partitioning scheme. If your server doesn’t support partitioned tables natively, you might want to archive into a different table every so often, building views over all the tables to present them as a single logical table. There are some important advantages to this, especially if you eventually want to purge really old data. It is much easier to drop an entire table when it ages out than to delete rows from a huge table.
This gets into the topic of how to build a large data warehouse, so I’ll just leave it at this: if you forsee the archived data getting large, start planning for it now.
Duplicated data
Unless you use distributed transactions or some clever way to guarantee atomicity, there’s a chance you’ll insert to your archive but fail to delete from the source. Now you have duplicate data. How will you handle this?
First, decide if you can tolerate the situation. (I told you we hadn’t seen the last of the consistency requirements!) I suggest you take it as a given, if at all possible. Design your archiving jobs so they can tolerate existing data in case they get terminated uncleanly or otherwise have errors. If they try to insert rows that exist, you should probably overwrite the existing rows with new ones, which might have changed on the OLTP server. Make sure you don’t lose data from this, one way or another.
If you are archiving summary tables, you might need to be careful. A row that’s built incrementally on the OLTP system might need to be re-aggregated, instead of replaced, if it already exists in the archive.
Duplicated data makes some queries hard to get consistent. For instance, a view that takes the union of archived and un-archived data will tell a lie if a row exists in both places. You need to factor this in when deciding how to do the archiving. Duplicates can also happen during un-archiving.
Un-archiving
Why would you ever want to un-archive?
Here are some reasons you might benefit from being able to un-archive easily:
- You treat all the data as equally important, but some of it as more likely to be accessed
- You know there’s unused data but it’ll be inefficient to figure out which rows
- You can’t get an exact answer on whether rows are archivable
Think of it this way: the ability to un-archive lowers the risk of archiving the wrong data, and allows you to archive things you might otherwise be unable to touch. It takes away or mitigates the downside.
This goes back to my analogy of archiving as flushing from a cache. You probably don’t treat databases as a multi-tier cache, and that’s a good thing. If the data isn’t where you expect, your applications would need to look elsewhere and retrieve it. Unless you write a wrapper around your database access layer that handles it automatically, this is probably infeasible.
However, you can still use the concept of retrieving missing data under certain circumstances. Does the following sound workable?
- Make most applications tolerate missing data and just do what they can with the data they have
- Identify points of entry where incoming data is a signal to un-archive something
- Hook an un-archiving system into the points of entry
- Archive freely and let un-archiving bring back anything it needs to
Here’s a concrete example from the advertising system I mentioned previously. This system archives ads eagerly; if they don’t have any traffic in a while, it moves them away. There are limited points of entry for ad-related data: traffic data, and a record of a sale that is attributed to an ad. The systems that load this incoming data can simply check whether all referenced ads exist, and if not, attempt to un-archive any missing ones. This happens as part of the routine checks for bad incoming data. This approach is fairly new for us, and might have some kinks we haven’t yet discovered, but there is virtually no downside. The data isn’t gone, it’s just elsewhere. Now we can archive data we couldn’t before, because it was too hard to get a definite “yes, this can be archived” answer. (It’s often easy to get a “no,” but hard to get a “yes.”)
Un-archiving is non-trivial. In fact, depending on your schema, you might need to be more careful about consistency requirements than you are with your archiving strategy. However, if you’re archiving correctly, un-archives should be few and far between, so you can afford a more expensive process.
In many ways, your options for un-archiving strategies might be similar to archiving strategies. In the systems I’ve worked on, we take the depth-first-search, dependencies-first, all-one-chunk approach I think is too inefficient to use for archiving.
If your archive is non-transactional, be careful to commit the insert into the OLTP system before you delete from the archive. Otherwise your delete will be permanent, but your insert might be rolled back, and the data is lost.
You don’t have to un-archive
If you don’t want to build an un-archiving system, you can build your applications to look in the archive for data they need and can’t find in the OLTP system. If you do this seldom enough, it might work fine. One order-history system I know of does this to find orders that aren’t in the OLTP server.
Archiving miscellany
To round out this series, here is a collection of notes and references I didn’t want to mix in along the way, but I think belong here somewhere.
First, if you’re using MySQL, I’ve written tools that can handle much of the work I’ve described here. The first is MySQL Archiver, which can find and move archivable data very efficiently. It is full-featured and should handle most needs as-is, but it’s also extensible, so you can plug your own code in to handle complex archivability checks, dependency-first archiving, and so forth. Another of my tools, MySQL Find, can monitor and create historical records of table sizes, so you can get a sense of which tables are largest and/or growing the fastest (this is a one-liner that can go into your daily crontab). If you are archiving from InnoDB tables, you might want to record deadlocks with MySQL Deadlock Logger, for obvious reasons.
All these tools are part of the MySQL Toolkit, which is Free Software. Another useful tool is innotop, a real-time interactive MySQL and InnoDB monitor I’ve written.
A couple more notes on MySQL: the choice of storage engines makes MySQL especially attractive for single-server archiving solutions, in my opinion. It’s quite practical to run your intense OLTP workload on InnoDB tables (or another transactional storage engine, such as Paul McCullagh’s PBXT) and store the archive tables side-by-side in MyISAM or ARCHIVE tables. If you’re using MySQL 5.1, you also get partitioned tables; if you’re not that bleeding-edge, you might consider the strategy I suggested at the recent MySQL Conference and Expo: archive to a small InnoDB table for high concurrency, then regularly convert it to MyISAM and place it into a MERGE collection, with a view that unions the InnoDB and MERGE tables (Sheeri Critzer blogged about this also, though I’m not sure how many people are actually doing it).
I don’t really like triggers and foreign key magic, so I relegated this suggestion to here: you can use triggers and/or foreign key constraints with ON UPDATE actions to help with archiving and purging. I don’t like these approaches because I think they’re hidden, dangerous code. In Microsoft SQL Server I usually used stored procedures to archive, but in MySQL these days I use MySQL Archiver (linked above).
MySQL’s Edwin DeSouza wrote me to bring my attention to some of Craig Mullins’ recent articles about archiving. Craig’s insight is valuable if you’re researching archiving.
I think that’s it for the miscellany.
Conclusion
This series has covered what I believe to be the full scope of archiving strategies, from requirements to specific techniques you can use to archive and un-archive data from your OLTP systems. As a reminder, the larger context here is to offer scaling back as an alternative to the usual scale-up-or-scale-out dichotomy. There are always more options than you think! Archiving can be difficult and complex, but the potential gains for some types of data can be large, especially compared to some of the more frequently-discussed scaling tactics. Like other solutions, it doesn’t work for all situations, but if you forsee a huge amount of data coming at you, you should consider archiving along with other scaling techniques.





