Archive for the ‘archiving’ tag
In May 2005, I wrote a widely-referenced article about how to efficiently archive and/or purge data from online transaction processing (OLTP) database servers. That article focused on how to write efficient archiving SQL. In this article I’ll discuss archiving strategy, not tactics. OLTP servers tend to have complex schemas, which makes it important and sometimes difficult to design a good archiving strategy.
The 50,000-foot view
Archiving is actually a very large topic! My goal is to at least mention many of the things to consider, and go into some of them in detail. Here’s what I’ll cover in this and upcoming articles:
- Goals of archiving
- Where to store the archived data
- How to choose which rows are archivable
- How to deal with complex data relationships and dependencies
- How to actually archive the data
- Un-archiving strategy
Archiving: why do it?
Archiving is about scaling. There are several different views on how to scale systems with MySQL in particular, and any given system in general. Opinions vary on the best way for any given usage, but popular strategies are
- Buy bigger, more expensive, faster hardware (vertical scaling)
- Split data into logical partitions or shards, and distribute these among many machines (horizontal scaling)
- Use federated views, clusters, or other distributed technologies that pretend there’s one server, but underneath there are actually many (federation)
Each strategy has its place, but each focuses on how to handle the same or more data while providing redundancy or high availability, improved performance, and so on. Sometimes the elephant in the room is the obvious but easy to overlook strategy of don’t scale up or out, scale back. This may not be possible, but when it is, it’s often preferable.
How do you know if you can do this? Just ask whether you need all the data. Most of the time you don’t. Think about how true this is throughout computing as a whole. Caching is everywhere in computing, and it works precisely because you don’t need all the data, or at least you don’t need all of it all the time. Most data has some spatial or temporal hot spots, and often there’s a pretty large chunk of it that’s never accessed at all. If you don’t need all the data all the time, you might be able to move some of it away to bigger, cheaper, slower storage. That’s archiving.
If you can scale back rather than up or out, you may be able to keep your OLTP servers lean and fast without changing anything. This can easily buy you an order of magnitude. It’s generally not easy to do that; if you want to get an order of magnitude with other strategies, you may need ten times as many servers so each has an order of magnitude less data. Scaling up tends to be even more expensive; the price-to-performance ratio climbs very sharply as you get into big-iron machines. You also need to consider the cost of rack space, cooling, power, maintenance and so forth.
When people are a roadblock
I think it’s important to resist building a system that will always provide all the data that ever existed, just in case you need it. If you’re worried you might need the data someday, keep it on DVD, but you don’t need to keep it on your OLTP servers. And don’t compare yourself to Amazon or Google. You probably have limited resources, and though it may be good bragging rights, trying to engineer something massive is likely bad for your business. Sometimes people want to build unreasonable systems because they have something to prove, or they have second-system syndrome. Know your limits as an individual, and if you’re a manager, take a moment to assess whether your engineers are advocating for more than is good for the company.
Gather requirements for the archive
If you’ve identified that you can archive some of your data, I think the next important step is to figure out what you require of the stored data. I’ve worked in a place where the only requirement was legal; we needed to be able to retrieve the data only if we got audited or had to go to court. By contrast, my current employer needs infrequent access to archived data for long-term analysis, but it needs to be queryable live and quickly.
You need to balance convenience, speed, space, expense, durability, legal, and other requirements you identify. Chances are there’s a fairly obvious solution for you:
- If you never need the data again, you can simply purge it.
- If you only need it for legal requirements, you can archive it to a file, then burn the files to CDs and put them in a safe deposit box.
- If you need it for infrequent queries that can be slow, you can archive to a table, then burn the table to CD. When you need to get it back, you can mount the CD and get your server to read the table right from CD. One thing to beware, though: future server versions aren’t guaranteed to read old table formats!
- If you need it for frequent or fast querying, you might build a data warehouse, or you may be able to simply move the data to “adjacent” tables on the same server.
Any of these ideas could be mixed and matched. Let me give some examples of what I’ve done in real life.
At one company using SQL Server, we archived data to a beefy OLAP (online analytical processing) server via replication. I’ll go through the exact strategy in an upcoming article, but basically we replicated the data to the OLAP server into a spot reserved for in-transit data. From there it went into the long-term storage tables transactionally, and analytics applications read from views that gave a union over the in-transit and the long-term storage.
At my current employer using MySQL, we archive transactional InnoDB tables to adjacent tables on the same server, which use a different storage engine and sometimes fewer indexes. At the same time we archive to files, which we periodically burn to CD and put in safe storage. This would have worked well even if we’d used the InnoDB storage engine for the archive tables, simply because the OLTP tables would have been smaller, but archiving to another storage engine has the added benefits of getting it out of the InnoDB tablespace and giving more compact storage.
In this article I surveyed the preliminaries of archiving: motivations to archive and requirements of the archived data. While some of these decisions will actually depend on things I’ll write about in upcoming articles, it’s good to have several options in mind before you evaluate specific strategies.
In the next article I’ll discuss how to select which data is archivable, and how to deal with the complexities of OLTP schemas during the archiving process.
This release fixes some minor bugs and adds a plugin mechanism. Now you can extend MySQL Archiver with your own code easily. You could use this to run setup and tear-down, hook code into the archiving process, and more. Possibilities include building summary tables in a data warehouse during archiving, handling dependencies such as foreign keys before archiving each row, or applying advanced logic to determine which rows to archive.
The documentation contains full details about the plugin interface, including example code.
MySQL Archiver is the implementation of the efficient forward-only archiving and purging strategies I wrote about more than a year ago. It nibbles rows from a table, then inserts them into another table and/or writes them to a file. The object is to do this without interfering with critical online transaction-processing (OLTP) queries.
Several people have asked me to release this code, which I originally wrote for my employer. As it turns out, the delay has been fruitful. I learned a lot more about query optimization during this time, found bugs with my original approach, and got exposure to different archiving needs and techniques. As a result, this tool runs something like four to ten times faster than the code I wrote last year.
I decided to write and release it now because my employer has grown to the point we need to archive more data, faster, more flexibly. Instead of just open-sourcing the code I wrote last year, I have rewritten it from the ground up. We are using exactly the same code, and hope to benefit from community feedback and improvements.
I think the result is a good tool that does a lot of work for you:
- It automatically writes efficient queries by inspecting table structures and indexes.
- It handles transactions, lock timeouts and deadlocks.
- It writes archived data to a file in the same format
LOAD DATA INFILEuses by default.
It has a lot of options and functionality, so I won’t go into it too much here. I also have several ideas I want to implement in the future, but I want to see what the community thinks of what I’ve done so far before I work on it too much more.
Despite the improvements, the basic approach remains the same: it finds the first row(s), and then on subsequent queries, it continues from where it left off, rather than scanning the whole table from the start. This makes it efficient to archive in small “nibbles,” which avoids contention with OLTP queries.
I’ve put almost 30 extra-curricular hours into this recently. Most of the time has gone into making sure every different type of archiving job my employer needs to run can be generated as efficiently as possible with a minimum of fuss, such as a simple command-line option or two. I’m eager to hear what you think of it, whether it meets your needs, and how it can be improved. And I’m glad I’ve finally gotten it done after all this time!
About MySQL Toolkit
MySQL Toolkit is a set of essential tools for MySQL users, developers and administrators. The project’s goal is to make high-quality command-line tools that follow the UNIX philosophy of doing one thing and doing it well. They are designed for scriptability and ease of processing with standard command-line utilities such as