Tag Archive for 'scaling'

How to scale writes with master-master replication in MySQL

This post is SEO bait for people trying to scale MySQL’s write capacity by writing to both servers in master-master replication. The short answer: you can’t do it. It’s impossible.

I keep hearing this line of reasoning: “if I make a MySQL replication ‘cluster’ and move half the writes to machine A and half of them to machine B, I can increase my overall write capacity.” It’s a fallacy. All writes are repeated on both machines: the writes you do on machine A are repeated via replication on machine B, and vice versa. You don’t shield either machine from any of the load.

In addition, doing this introduces a very dangerous side effect: in case of a problem, neither machine has the authoritative data. Neither machine’s data can be trusted, but neither machine’s data can be discarded either. This is a very difficult situation to recover from. Save yourself grief, work, and money. Never write to both masters.

Technorati Tags:, , ,

You might also like:

  1. How to sync tables in master-master MySQL replication

Using BASE instead of ACID for scalability

My editor Andy Oram recently sent me an ACM article on BASE, a technique for improving scalability by being willing to give up some other properties of traditional transactional systems.

It’s a really good read. In many ways it is the same religion everyone who’s successfully scaled a system Really Really Big has advocated. But this is different: it’s a very clear article, with a great writing style that really cuts out the fat and teaches the principles without being specific to any environment or sounding egotistical.

He mentions a lot of current thinking in the field, including the CAP principle, which Robert Hodges of Continuent first turned me onto a couple months ago. It has been proven formally, though I have not read the proof myself.

One of the most important concepts he advances is giving up the illusion of control. As programmers and DBAs, I think we may tend to like control too much. Foreign keys are a perfect example. I think the point here is that these things make you feel safe, but they don’t really make you safe. Just as with so many things in life, recognizing our inability to really control the systems we build is key to working with their strengths — instead of trying to bind them with iron bands.

Another great point is idempotency. This is a great way to help avoid problems with MySQL replication, by the way. I’ll leave the “why” as an exercise for the reader, but let me just point out that the file MySQL uses to remember its current position in replication is not synced to disk, so it will almost certainly get out of whack if MySQL dies ungracefully. (Google has solved this problem.)

A highly recommended read — worth more than most case studies about how specific companies have scaled their specific systems.

Technorati Tags:, , , , , , , , , , ,

You might also like:

  1. Get a free sample chapter of High Performance MySQL Second Edition

More progress on High Performance MySQL, Second Edition

Whew! I just finished a marathon of revisions. It’s been a while since I posted about our progress, so here’s an update for the curious readers.

I just finished revising the last two major chapters that Peter Zaitsev hasn’t yet reviewed. Peter has been essentially going through the chapters like a very thorough technical reviewer. He makes corrections, points out where things aren’t clear or need examples, and adds more material.

By “finished revising,” I mean finished expanding the outline into a full chapter. We’re still working at the level of “this chapter is mostly there, but we might decide to revise it more.” We will most certainly do so in many cases. There are some chunks of material that I’ve marked TODO to put into other chapters, for example. We’re not at the level of a final draft with any chapter except the chapter on MySQL’s architecture, but we’re getting close with the others now.

Most of the chapters are in tech review now, and we’ve gotten a few of them back. The comments from the reviewers have been very helpful. We expanded the Replication chapter quite a bit after tech review. (And then Peter reviewed it and we expanded it even more). When the tech reviewers return comments on the other chapters, we’ll revise some more.

We’re up to 529 pages in OpenOffice.org now. At my calculated ratio of 1 page = 1.1 pages in print, that’s about 582 pages in print. And that’s not counting the Replication chapter, which doesn’t have all of its illustrations yet. I predicted we’d break 500 pages; we might get close to 600. These are very, very densely written, too. No offense to the first edition, but the tone is quite different; much less light-hearted banter, much more compressed information. Peter is a walking encyclopedia, and never seems to run out of details we really ought to include because they’re important (and they are).

We may, or may not, go to production in the next few weeks. Regardless, I think we’re still on track to have the book on shelves by the MySQL Conference & Expo in April. Look for me there. I’ll be easy to find: I’ll be the tall guy with a permanent silly grin. (You’d grin too if you finished writing a book that’s been this much work!)

I’ve posted rough outlines for many of the other chapters. The two Peter and I just finished working on are the Scaling/HA/Load-Balancing/Failover chapter, and the Application-Level Optimization chapter. The Scaling/HA chapter is pretty long and very involved, and goes into a lot of detail on scaling in particular, especially horizontal scaling via sharding. (We use “sharding” because it’s less confusing than calling it “partitioning,” which already means too many different things in databases).

The Application-Level Optimization chapter is a little shorter. It’s mostly about caching strategies, how to make a web server run well, and so on. These aren’t what the book focuses on directly, but you can either help or hurt the database server a lot with your application design. Our goal here is to help people avoid the common mistakes.

For the curious, here’s the current outline for these two chapters:

Scaling and High Availability
  Terminology
  Scaling MySQL
    Planning for Scalability
    Buying Time Before Scaling
    Scaling Up
    Scaling Out
      Functional Partitioning
      Data Sharding
      Choosing a Partitioning Key
        Multiple Partitioning Keys
      Querying Across Shards
      Allocating Data, Shards, and Nodes
        Arranging Shards on Nodes
      Fixed Allocation
      Dynamic Allocation
        Mixing Dynamic and Fixed Allocation
      Explicit Allocation
      Sidebar: Re-Balancing Shards
      Tools for Sharding
    Scaling Back
      Keeping Active Data Separate
    Scaling by Clustering
      Clustering
      Federation
  Load Balancing
    Connecting Directly
      Splitting Reads and Writes in Replication
      Changing Application Configuration
      Changing DNS Names
      Moving IP Addresses
    Introducing a Middleman
      MySQL Proxy
      Load Balancers
    Load Balancing Algorithms
      Adding and Removing Servers in the Pool
    Load Balancing with a Master and Multiple Slaves
  High Availability
    Planning for High Availability
    Adding Redundancy
      Shared-Storage Architectures
      Replicated-Disk Architectures
      Synchronous MySQL Replication
    Failover and Failback
      Promoting a Slave or Switching Roles
      Virtual IP Addresses or IP Takeover
      MySQL Master-Master Replication Manager
      Middleman Solutions
      Handling Failover in the Application

And here’s the outline for the Application-Level Optimization chapter:

Application-Level Optimization
  Application Performance Overview
    Find the Source of the Problem
    Look for Common Problems
  Web Server Issues
    Finding the Optimal Concurrency
  Caching
    Sidebar: Caching Doesn't Always Help
    Caching Below the Application
    Application-Level Caching
    Cache Control Policies
    Cache Object Hierarchies
    Pre-Generating Content
  Extending MySQL
  Alternatives to MySQL

The thing that makes me the happiest right now is that we’re clearly going to make it. For a while, there was just so much work left to do that it was impossible to estimate how much. (Ask my wife: I was wrong many times when she asked how long it would take me to finish a chapter). I also didn’t know how much revision would be necessary, which is very scary; revising takes about four times as long as writing a first draft, by my reckoning. At this point, the remaining work is much smaller, and much easier to estimate. And now I no longer flip-flop daily between “I think we can, I think we can” and “please don’t ask, because I don’t know and I want a vacation.”

Subversion shows me that Peter has the Security chapter locked right now. This one is not a huge one, and Arjen Lentz has already reviewed it as well, so I don’t expect it to be a huge amount of work to revise. After that, it’s minor chapters and appendices. (We might actually convert the chapters on Server Status and Tools into appendices, since they got cannibalized when we realized their material fit better elsewhere. They also don’t have a very chapter-ish feel; they feel more like appendices). We’ve added a few more appendices, including one on EXPLAIN and one on debugging server and storage-engine locking problems. These are all great reference material.

See you at the conference in April!

Technorati Tags:, , , , , , , , , , , , ,

You might also like:

  1. High Performance MySQL, Second Edition: Replication, Scaling and High Availability
  2. Progress report on High Performance MySQL, Second Edition
  3. Progress on High Performance MySQL, Second Edition
  4. High Performance MySQL, Second Edition: Advanced SQL Functionality
  5. High Performance MySQL, Second Edition: Backup and Recovery

Progress report on High Performance MySQL, Second Edition

It’s been a while since I’ve written about progress on the book. I actually stopped working on it as much at the beginning of the month, because on October 31st I managed to finish a first draft of the last big chapter (Scaling and High Availability)! Now I’m back to full-time work at my employer, and I’m working on the book in the evenings and weekends only.

This doesn’t mean the book is close to being done, though. The editor is sending out some chapters for technical review, and there’s still a lot more writing and revising to be done.

Last weekend I revised the Security chapter from the first edition, which I think will be the only chapter that we’ll just revise and update, rather than completely rewriting (well, maybe the Architecture chapter could be considered a revision instead of a rewrite, but it’s a stretch; we changed it a lot). I removed a lot of the material that repeated the MySQL manual, and added a lot of information and best practices on grants, new privileges and objects in MySQL 5, common tasks, common mistakes, and so on. The chapter ended up being nearly as long, even though I stripped out all the code listings and so on from the first edition (in fact, I reduced the first edition’s material to a few paragraphs).

Beyond that, though, there are little details to finish out in many of the chapters. Examples that need to be finished, figures that need to be re-drawn, material that doesn’t quite fit and needs to be re-arranged or even moved to another chapter; it’s a lot of work. Peter Zaitsev has been reviewing some of the core chapters on query and schema optimization etc, and I’m revising them in response to his comments. That’s what I spent today doing.

I think the biggest chunks of work that remain are going to be making chapters 3, 4, 5, 6 and 7 (benchmarking, profiling, schema, indexing, query optimization, advanced features, and server tuning) flow together well. The challenge here is how to organize the vast amount of material so it reads well, without too many forward references, and still be useful as a reference work. The detail we’ve gone into is incredible. It makes it very hard to find the single best place to mention each little bit of wisdom, because all of this material is completely inter-related. It’s tough to flatten the graph of knowledge into a one-dimensional narrative.

It’s not just these chapters that have a lot of inter-related material, of course. It’s hard to talk about tuning the server settings (chapter 7) without bringing the OS and hardware (chapter 8) into it, and whenever you do this you also need to think about measuring and monitoring status information (chapter 14). Of course, you need to do that for benchmarking and profiling, too (chapter 3). I’m sure you see the dilemma!

The good news is, if we succeed in doing this well, you will find the book enormously useful. Stay tuned!

Technorati Tags:, , , , , , , , , , , , , , ,

You might also like:

  1. Organizing High Performance MySQL, 2nd Edition
  2. Progress on High Performance MySQL, Second Edition
  3. High Performance MySQL, Second Edition: Backup and Recovery
  4. High Performance MySQL, Second Edition: Advanced SQL Functionality
  5. More progress on High Performance MySQL, Second Edition

High Performance MySQL, Second Edition: Replication, Scaling and High Availability

Continuing in the tradition, which I hope has been as helpful to you as it has been to me, I’m opening the floor for suggestions on chapter 9 of the upcoming High Performance MySQL, Second Edition. Unlike the other chapters for which I’ve listed outlines, this one isn’t substantially written yet. It’s in detailed outline form at this point (a tactic that has worked very well for us so far — I’ll write about that someday).

I’m trying to get feedback much earlier in this chapter’s lifecycle, for several reasons. Two of the most important are that this is one of the first chapters I’ve had a chance to really take from scratch, and the chapters I haven’t written from scratch have been harder to organize, as you’ve probably seen from the last few outlines I posted. There’s a lot of value in working top-down on this deep encyclopedia-style material.

The outline, as it stands now, is basically headings with bulleted lists of important details. Here are the top-level headings:

[Intro]
Scaling and High Availability Requirements
Replication Overview
Configuring Replication
Under the Hood of Replication
Replication Topologies
Replication Administration and Maintenance
Replication Problems and Solutions
The Future of MySQL Replication
Scaling MySQL Horizontally
Clustering with MySQL
   MySQL Cluster
   Other Clustering Solutions
Load Balancing

Just a few notes. These sections are top-level, and will likely be split into many sub-sections like other chapter outlines I’ve posted. A typical section has a couple dozen bullet-points in it, at a high level of granularity, such as “Using DRBD for log replication only.” I think we’ll also add in a separate section on fail-over and fail-back, but that’s not in the outline as of right now (what do you think belongs in it?).

I don’t know what it’s like for you to read outlines and see little bits of the book being assembled, but the process of writing this book is just fascinating to me. It’s endlessly interesting and educational — just the process of writing, let alone the subject matter! This is a really fun project. A heck of a lot of work, but fun nonetheless, and the openness of the project makes it even more fun for me. I’ve learned a lot of surprising and interesting things about writing. I keep wishing I had time to write about this process, but I really need to keep my eye on the deadlines and put that off for later.

Anyway, the usual requests apply: what’s missing, what do you think is cool and should be included, etc etc? Thanks, as usual, for your time and feedback.

Technorati Tags:, , , , , , ,

You might also like:

  1. More progress on High Performance MySQL, Second Edition
  2. What are your favorite MySQL replication filtering rules?
  3. Progress report on High Performance MySQL, Second Edition
  4. High Performance MySQL, Second Edition: Backup and Recovery
  5. Progress on High Performance MySQL, Second Edition

Coming soon: High Performance MySQL, Second Edition

We’ve begun writing the second edition of the now-classic High Performance MySQL. “We” means co-authors Arjen Lentz (formerly of MySQL), Baron Schwartz (that’s me), and Vadim Tkachenko and Peter Zaitzev, both formerly of MySQL’s high-performance team and now partners at Percona, a high-performance MySQL consultancy firm and host of the popular MySQL Performance Blog. Neither of the first edition’s authors (Jeremy Zawodny and Derek Balling) is working on this project, but they’re with us in spirit, I think. O’Reilly is still the publisher, and Andy Oram is still the editor.

Though we’re theoretically revising and updating the first edition, we’re actually starting from scratch and re-writing the book. We’re expanding it from the first edition’s 265 pages to 384, according to the contract, but my unofficial guess is it’ll go well over 400 pages. A lot has changed since Jeremy and Derek wrote the first edition — high performance MySQL is a bigger subject today, with different techniques, tools and technologies, and of course a much more complicated MySQL server. The second edition will remain the definitive reference for building high-performance, scalable systems with MySQL.

We’re early in the process, so it’s hard to know how far into the future we can safely look. Still, just to whet your appetite, here’s the table of contents:

  1. Preface
  2. Back to basics
  3. MySQL Architecture
  4. Finding Bottlenecks: Profiling and Benchmarks
  5. Schema Optimization and indexing
  6. Query Performance Optimization
  7. Advanced SQL Functionality
  8. Optimizing Server Settings
  9. Operating System and Hardware Optimization
  10. Scaling and High Availability
  11. Application Level Optimization
  12. Backup and Recovery
  13. Security
  14. Analyzing Server Status
  15. Tools for High Performance

Stay tuned for more news as the book progresses. The four of us plan to blog as we go.

Technorati Tags:, , , , , , , , ,

You might also like:

  1. Progress on High Performance MySQL, Second Edition
  2. Get a free sample chapter of High Performance MySQL Second Edition
  3. More progress on High Performance MySQL, Second Edition
  4. High Performance MySQL, Second Edition: Backup and Recovery
  5. High Performance MySQL, Second Edition: Advanced SQL Functionality

Archive strategies for OLTP servers, Part 3

In the first two articles in this series, I discussed archiving basics, relationships and dependencies, and specific archiving techniques for online transaction processing (OLTP) database servers. This article covers how to move the data from the OLTP source to the archive destination, what the archive destination might look like, and how to un-archive data. If you can un-archive easily and reliably, a whole new world of possibilities opens up.

For your reference, here are links to part 2 and part 1, and the original article on efficient SQL queries for archiving, which is the basis for this whole series.

How to move the data

At some point you have to actually take the data from the source and put it into the archive. This is a three-step process:

  1. Find archivable rows
  2. Insert them into the archive
  3. Delete them from the source

I wrote an article on how to find archivable rows efficiently, so I won’t go into it more here. Inserting and deleting are usually straightforward, but there are subtleties and tricks that can lead to nifty solutions.

Transactions

The most important question about actually moving the data is how to do it safely, with or without transactions. Even if the source and archive are on different servers, you can do distributed transactions, either in your application logic or with a two-phase commit supported by your database product. For most purposes, I’ve found it just as reliable and more efficient to handle the transaction in your application logic.

For many of the reasons mentioned in the second article in this series, I would recommend relaxing the consistency requirements between source and archive, so you can keep the archived data out of the source’s transaction. You can do this safely by performing the operations in the order I listed above: insert, delete, commit the insert, then commit the delete. If you are archiving to a file at the same time, you can also write to the file before the insert.

Your archive might also be non-transactional. If you’re using MySQL, you should think about using a faster non-transactional engine that stores the data more compactly, such as MyISAM, for the archived data.

Use replication to unload the OLTP server

One of the most effective ways to archive an OLTP server without impacting it too much is to do the hard work of finding and inserting on a slave server, then performing the delete on the master and letting it flow through replication. Here’s an example from a past employer: we replicated the order table to a “transfer” database on the data warehouse server. A job on the data warehouse server looked for orders that had completed and shipped, and thus could be archived. It copied these in a transaction to the long-term storage, then deleted on the OLTP server. This delete flowed through replication back to the data warehouse, and removed the rows from the transfer database.

The archive server

I’ve already mentioned some ways you might design the archive server, but there are a few other things I want to cover briefly. The first is what happens when you don’t use a different server at all, and just archive to adjacent tables on the OLTP server. This can be a fine way to do things. As long as the data isn’t used, it’s just taking up space on disk. However, it might make backups more difficult.

If you use a different server to hold the archived data, you should probably consider some kind of partitioning scheme. If your server doesn’t support partitioned tables natively, you might want to archive into a different table every so often, building views over all the tables to present them as a single logical table. There are some important advantages to this, especially if you eventually want to purge really old data. It is much easier to drop an entire table when it ages out than to delete rows from a huge table.

This gets into the topic of how to build a large data warehouse, so I’ll just leave it at this: if you forsee the archived data getting large, start planning for it now.

Duplicated data

Unless you use distributed transactions or some clever way to guarantee atomicity, there’s a chance you’ll insert to your archive but fail to delete from the source. Now you have duplicate data. How will you handle this?

First, decide if you can tolerate the situation. (I told you we hadn’t seen the last of the consistency requirements!) I suggest you take it as a given, if at all possible. Design your archiving jobs so they can tolerate existing data in case they get terminated uncleanly or otherwise have errors. If they try to insert rows that exist, you should probably overwrite the existing rows with new ones, which might have changed on the OLTP server. Make sure you don’t lose data from this, one way or another.

If you are archiving summary tables, you might need to be careful. A row that’s built incrementally on the OLTP system might need to be re-aggregated, instead of replaced, if it already exists in the archive.

Duplicated data makes some queries hard to get consistent. For instance, a view that takes the union of archived and un-archived data will tell a lie if a row exists in both places. You need to factor this in when deciding how to do the archiving. Duplicates can also happen during un-archiving.

Un-archiving

Why would you ever want to un-archive?

Here are some reasons you might benefit from being able to un-archive easily:

  • You treat all the data as equally important, but some of it as more likely to be accessed
  • You know there’s unused data but it’ll be inefficient to figure out which rows
  • You can’t get an exact answer on whether rows are archivable

Think of it this way: the ability to un-archive lowers the risk of archiving the wrong data, and allows you to archive things you might otherwise be unable to touch. It takes away or mitigates the downside.

This goes back to my analogy of archiving as flushing from a cache. You probably don’t treat databases as a multi-tier cache, and that’s a good thing. If the data isn’t where you expect, your applications would need to look elsewhere and retrieve it. Unless you write a wrapper around your database access layer that handles it automatically, this is probably infeasible.

However, you can still use the concept of retrieving missing data under certain circumstances. Does the following sound workable?

  1. Make most applications tolerate missing data and just do what they can with the data they have
  2. Identify points of entry where incoming data is a signal to un-archive something
  3. Hook an un-archiving system into the points of entry
  4. Archive freely and let un-archiving bring back anything it needs to

Here’s a concrete example from the advertising system I mentioned previously. This system archives ads eagerly; if they don’t have any traffic in a while, it moves them away. There are limited points of entry for ad-related data: traffic data, and a record of a sale that is attributed to an ad. The systems that load this incoming data can simply check whether all referenced ads exist, and if not, attempt to un-archive any missing ones. This happens as part of the routine checks for bad incoming data. This approach is fairly new for us, and might have some kinks we haven’t yet discovered, but there is virtually no downside. The data isn’t gone, it’s just elsewhere. Now we can archive data we couldn’t before, because it was too hard to get a definite “yes, this can be archived” answer. (It’s often easy to get a “no,” but hard to get a “yes.”)

Un-archiving is non-trivial. In fact, depending on your schema, you might need to be more careful about consistency requirements than you are with your archiving strategy. However, if you’re archiving correctly, un-archives should be few and far between, so you can afford a more expensive process.

In many ways, your options for un-archiving strategies might be similar to archiving strategies. In the systems I’ve worked on, we take the depth-first-search, dependencies-first, all-one-chunk approach I think is too inefficient to use for archiving.

If your archive is non-transactional, be careful to commit the insert into the OLTP system before you delete from the archive. Otherwise your delete will be permanent, but your insert might be rolled back, and the data is lost.

You don’t have to un-archive

If you don’t want to build an un-archiving system, you can build your applications to look in the archive for data they need and can’t find in the OLTP system. If you do this seldom enough, it might work fine. One order-history system I know of does this to find orders that aren’t in the OLTP server.

Archiving miscellany

To round out this series, here is a collection of notes and references I didn’t want to mix in along the way, but I think belong here somewhere.

First, if you’re using MySQL, I’ve written tools that can handle much of the work I’ve described here. The first is MySQL Archiver, which can find and move archivable data very efficiently. It is full-featured and should handle most needs as-is, but it’s also extensible, so you can plug your own code in to handle complex archivability checks, dependency-first archiving, and so forth. Another of my tools, MySQL Find, can monitor and create historical records of table sizes, so you can get a sense of which tables are largest and/or growing the fastest (this is a one-liner that can go into your daily crontab). If you are archiving from InnoDB tables, you might want to record deadlocks with MySQL Deadlock Logger, for obvious reasons.

All these tools are part of the MySQL Toolkit, which is Free Software. Another useful tool is innotop, a real-time interactive MySQL and InnoDB monitor I’ve written.

A couple more notes on MySQL: the choice of storage engines makes MySQL especially attractive for single-server archiving solutions, in my opinion. It’s quite practical to run your intense OLTP workload on InnoDB tables (or another transactional storage engine, such as Paul McCullagh’s PBXT) and store the archive tables side-by-side in MyISAM or ARCHIVE tables. If you’re using MySQL 5.1, you also get partitioned tables; if you’re not that bleeding-edge, you might consider the strategy I suggested at the recent MySQL Conference and Expo: archive to a small InnoDB table for high concurrency, then regularly convert it to MyISAM and place it into a MERGE collection, with a view that unions the InnoDB and MERGE tables (Sheeri Critzer blogged about this also, though I’m not sure how many people are actually doing it).

I don’t really like triggers and foreign key magic, so I relegated this suggestion to here: you can use triggers and/or foreign key constraints with ON UPDATE actions to help with archiving and purging. I don’t like these approaches because I think they’re hidden, dangerous code. In Microsoft SQL Server I usually used stored procedures to archive, but in MySQL these days I use MySQL Archiver (linked above).

MySQL’s Edwin DeSouza wrote me to bring my attention to some of Craig Mullins’ recent articles about archiving. Craig’s insight is valuable if you’re researching archiving.

I think that’s it for the miscellany.

Conclusion

This series has covered what I believe to be the full scope of archiving strategies, from requirements to specific techniques you can use to archive and un-archive data from your OLTP systems. As a reminder, the larger context here is to offer scaling back as an alternative to the usual scale-up-or-scale-out dichotomy. There are always more options than you think! Archiving can be difficult and complex, but the potential gains for some types of data can be large, especially compared to some of the more frequently-discussed scaling tactics. Like other solutions, it doesn’t work for all situations, but if you forsee a huge amount of data coming at you, you should consider archiving along with other scaling techniques.

Technorati Tags:, , , , , , , , , , , , , , ,

You might also like:

  1. Archive strategies for OLTP servers, Part 1
  2. MySQL Archiver can now archive each row to a different table
  3. Archive strategies for OLTP servers, Part 2
  4. How to write a lazy UNION in MySQL
  5. MySQL Archiver 0.9.1 released

Archive strategies for OLTP servers, Part 2

In the first article in this series on archiving strategies for online transaction processing (OLTP) database servers, I covered some basics: why to archive, and what to consider when gathering requirements for the archived data itself. This article is more technical. I want to help you understand how to choose which rows are archivable, and how to deal with complex data relationships and dependencies. In that context, I’ll also discuss a few concrete archiving strategies, their strengths and shortcomings, and how they can satisfy your requirements, especially requirements for data consistency, which as you will see is one of the most difficult problems in archiving.

Remember I’m basing these articles on the nibbling principle I explained in my very first article on archiving strategies. The goal is not to move away tables or take gigantic chunks out of tables manually. If you need to archive, you’ll need to do it frequently, perhaps continuously. That means you need to write automated, incremental jobs that nibble small chunks of unwanted data away without impacting the OLTP work you’re trying to optimize.

It’s a different matter if you’re archiving or purging from an OLAP system such as a data warehouse, of course.

What data should you archive?

You can’t archive until you know what you’re going to archive. You need to prioritize your efforts, and then for each type of data, which typically means each table, you need to know whether each row is archivable.

Prioritize

The first thing to do is determine priorities. You might know some data can be archived, but come back to that later. Focus on what needs to go away to make your transactional processes run more quickly. What gets queried the most? Consider tables “heavily queried” if they have extremes of any type of query — many small queries can cause just as much work as frequent long-running ones. Which tables have the most rows, the most data, the largest indexes? Which tables are growing the most quickly? Think also about access patterns; if you have tables that are frequently scanned partially or wholly for aggregates, those too are candidates.

Identify the worst offenders and think about which of them you need to do first. If you know some frequently-queried tables are growing very fast, I would consider prioritizing them, even if they’re not yet too large. I’ve seen tables reach a point where they become too large to query, and thus difficult to archive.

Another thing you might consider is how easy it will be to archive a row, especially if you are running close to capacity. If you can archive the second-most-important table easily, and it will give the server significant performance headroom, you might want to do that first. You can archive the more work-intensive most-important table later, when your server has some capacity to spare.

Is it archivable?

Now that you’ve identified which tables to focus on, you need to determine which rows are candidates for archiving. Rather than query the table, I would focus on identifying rules, which ideally you should be able to express as a WHERE clause in a query. The simpler the better. Here are some examples.

  • Is the row related to something that’s no longer current? For example, is it a product you don’t sell anymore, or related to a client with whom you no longer work?
  • Is it scratch data that never got used? For example, an abandoned shopping cart, an order that was invalidated and cancelled, or data that was prepared for upload to a business partner but was rejected by their systems?
  • Is it data that used to record a fact, but has now been “cancelled” and just records “zero?” For example, one system I’ve worked on records advertising traffic. For various reasons it can end up with rows that say “on this day, this advertisement was clicked zero times and caused zero sales.” This happened because of artifacts in the rollup processes. If a row records the same thing as the absence of a row, it can go (actually it should probably be purged, not archived).
  • Is it an orphan (its parent record is gone) or a widow[1] (it has no child records)?
  • Does it fall into some time window that makes it no longer likely to be accessed?

This last case is probably one of the most complex, but it’s important because data can often be archived by this criteria when it doesn’t meet any other. One example is summary tables to support decisions on OLTP systems (as opposed to a decision support system running from a data warehouse). If you do many calculations on the most recent data — say, the most recent quarter — you can probably archive previous quarters to slower storage.

Of course, you might not be able to get a definite answer on whether a row can be archived. Often the true answer is too expensive to get, or is in the outside world, perhaps even in the future. Sometimes the answer is “I think it can go, but if the client asks, it has to be there” or “I think it can go, as long as we don’t get some improbable event.” In these cases you can build your systems to cope with cache misses, as it were, and go ahead and archive the data based on probability. This is why I made the comparison with caching in the first article.

An example of this comes from the advertising system I mentioned. It is designed so an advertisement that doesn’t generate any traffic for some time can be archived, but retrieved if it gets traffic in the future. I’ll write more about un-archiving in the next article.

Finally, you might not need to specify criteria for each type of data by itself. OLTP systems often have parent-child relationships among tables, so in addition to orphan or widow checks, you can decide transitively. If a row is archivable when its parent is older than a certain age, then the row’s children can be archived by looking at their “grandparent” or “uncle.” I will call this a “transitive check” from now on.

How to handle data dependencies

Relationships among data, and ensuring consistency with respect to the relationships, usually make archiving itself complex and difficult. Just as you did with archived storage requirements, you need to gather requirements for the instantaneous state of the system during archiving. There are several strategies for meeting different requirements, depending on your data’s relationships.

Types of relationships, consistency guarantees, and efficiency

I already mentioned parent/child relationships. In general, whenever you have some data in one table that’s used to look up data in another table, you have one of several possibilities:

  1. There’s an actual database-enforced foreign key between the tables
  2. Some business logic, whether in queries, stored procedures, constraints or just application code, expects certain data in one table when it sees data in another
  3. Your systems treat it as a pleasant surprise when they find corresponding data in the two tables

My guess is most of you have a fair amount of data that’s described by cases one and two, and much less that falls into case three. Wouldn’t you know it, case three is the easiest to handle!

There are several levels of consistency you might choose to follow, and you can mix and match them depending on the data. These flow from the three kinds of relationships:

  • No orphans. Foreign keys in most systems enforce this rule. If you attempt to move a parent row away, the child will become an orphan, and most foreign keys will raise an error. Your application code might also forbid orphans. In this case you’ll have to archive the children before the parent.
  • Orphans are okay. In this case you can archive the parents first, and the children next. In some systems you can disable database-enforced foreign keys, if they’re the only reason for a no-orphans-allowed situation.
  • No matter. In this case you can archive however you please.

You need to balance your consistency requirements against the need to archive efficiently. If you require 100% consistent database state all the time, you might end up doing a lot of database transactions. Transactions are built to ensure consistency, but as I mentioned in previous articles, your archiving jobs need to be designed so they are the victims if there’s ever a conflict between them and OLTP processes. Huge transactions to ensure consistency may not be practical while archiving!

Strategies for archiving

Once you’ve decided what level of inconsistency you can tolerate, you can choose archive strategies. Just as with everything else, archive strategies are a trade-off. In this case the compromise is usually between efficiency and consistency.

Here are three basic types of archiving strategies. There may be others, but these are the ones I’ve used and/or considered using most of the time:

  1. Archive parents and children with recursion.
  2. Leave orphans, then clean them up later.
  3. Archive at the leaves of the dependency tree and work toward the root.

The first strategy is a classic computer-science depth-first-traversal problem. You typically start with a central table — the root of the tree or the root of a subtree — and for each row you recursively archive its children, only archiving the parent after all the children are gone. This can be difficult, especially since your OLTP schema might not be a tree; it can be a directed acyclic graph, or it can even be a graph with cycles. Satisfying all the dependencies without introducing infinite recursion can be a hard problem. You might also find that not all the dependencies are themselves archivable. You might need to gather a list of all data that is archivable before archiving any of it, which actually means two tree traversals. Finally, you might end up with a chunk of data that has to be archived all at once, which I think is impractical in many OLTP systems. The advantage, if you can pull this off, is that you get a consistent state as long as you do it all in one transaction.

By the way, it’s typically hard to archive tables that have inter-row relationships, such as nested-set or other tree representations, without violating consistency requirements.

If you can get away with it, it’s generally easier to archive the core table and leave orphans. There are a couple advantages to this. First, you can archive in single-row (or some other small amount) chunks. This satisfies the requirement to keep it low-impact. Second, it’s simpler and more efficient to determine whether the child rows can be archived. If they are orphans, they can be. The downside is your applications might not behave well, particularly if they are working with rows in the child table whose parent rows are disappearing.

A final strategy is to archive at the leaves of the dependency tree and work “up.” In this situation, you are sort of doing the first strategy — making sure dependency relationships aren’t broken — but you’re not trying to boil the ocean. Depending on how deep the dependency tree is, you can usually decide whether a row is archivable either transitively, or by checking whether it’s a widow. When you’re working at the leaves, you check whether the root is archivable; when you’re on an internal node you check if the node is a widow. There are a couple potential drawbacks, including efficiency and consistency requirements in the archive storage, which I’ll cover next.

It’s never that simple

Alas, there’s still more complexity to conquer. You have to consider how efficient it is to determine whether a row is archivable, and what sort of consistency you need to guarantee between the unarchived and the archived data.

Consistency, again

I’ll start with the consistency problem first. This is a different sort of consistency, which is why I didn’t mention it before. It would have confused things.

It’s not always enough to consider consistency in the OLTP system. You might have to think about the archived storage too, and even worse, the two of them in combination. Consider this scenario: you have a parent row in one table, some of whose children in another table can be archived. Do they need a parent row in their destination, too? In other words, do you need to copy the row from source.parent to archive.parent before you can insert into archive.child? And if so, does it bother you that the row in source.parent cannot be archived yet because some of its children are not archivable?

If you must have a row in archive.parent before inserting into archive.child, you will probably have to resign yourself to having two copies of the parent row — one in the source, one in the archive. Otherwise, you are enforcing an all-or-nothing constraint on the whole “family” of rows, and this will probably reduce the amount of data you can archive. In practice, I think most people will end up having no foreign keys or other consistency guarantees on the archived data, so child rows can be freely inserted into the archive. If your source and archive are on the same server, don’t foreign key the archive’s child table to the source’s parent table, or you won’t be able to archive from the parent! Along with this, you’ll probably need to treat an archived row as “dirty” or “un-authoritative” if it duplicates a row in the source. Upcoming articles will dig into this more.

Efficient archivability checks

The second potential complication of the strategies above is performance while checking whether a row can be archived. Both transitive checks and widow checks can have high cost.

Transitive checks for archivability can cause a heavy query load on certain tables. Here’s a real example, from the systems I described above. An ad is archivable if it has no traffic in the past 18 months. Tables that depend on ads, and which contain an ad’s ID, can be archived by checking the traffic table for any traffic for that ad in the last 18 months. This means a lot of queries against the traffic table, which is large and heavily queried anyway. It would be more efficient to archive the ad, and then archive its dependent tables by checking whether the ad doesn’t exist anymore. This works best if the ad table can be archived to a small fraction of its former size, which makes index lookups faster.

Another problem with transitive checks is if several relationships must be traversed. In this case you might have to look up values in intermediate tables before checking the final table. This adds more overhead.

The transitive checks are often “upward” checks in the dependency tree, and have a single parent row in a single parent table; or sometimes “up and across” to another table. By contrast, widow checks are downward, and might have to check many child tables for child rows. This can be expensive, particularly in tables that are close to the root of the dependency tree. The ads table is a good example. It has literally scores of dependent tables. Querying each of them before archiving a row would be prohibitively expensive. Even worse, a child table might not have an index on its parent column. If your application doesn’t traverse the relationship from parent to child, such an index wouldn’t be needed, and adding one just to support archiving queries might not be wise.

One compromise I have used occasionally, but not analyzed rigorously, is to build a table of parent row IDs that can be archived, and then access that from child archiving jobs, instead of doing transitive checks. This way at least the “traffic” table, or its equivalent, only takes the hit once.

Conclusion

I’ve covered a lot of ground in this article, starting with “deciding archivability” and “handling relationships.” Both require knowing your data, applications, schema, and consistency requirements. It helps to understand how they interact with various archiving strategies, and in turn how those strategies can violate or support your consistency requirements. Efficiency is a big deal — it’s the point of archiving, after all — so you need to include it when determining what you really require in terms of consistency.

The next article will cover some specific strategies for actually moving the data from the source to the archive, how to do that while fulfilling data consistency requirements (again!), and finally how to un-archive. This last point is crucial for scaling by “scaling back,” because it lets you archive data you’re not sure is archivable, yet pay virtually no penalty.


[1] I realize “widow” is not the opposite of “orphan,” but in computer science it is sometimes used to mean “childless,” especially for typesetting algorithms. Joe Celko also uses the term to mean “childless” in SQL For Smarties.

Technorati Tags:, , , , , , , , ,

You might also like:

  1. Archive strategies for OLTP servers, Part 1
  2. Archive strategies for OLTP servers, Part 3
  3. MySQL Archiver can now archive each row to a different table
  4. MySQL Archiver 0.9.1 released
  5. Why does InnoDB create two indexes per foreign key?

Archive strategies for OLTP servers, Part 1

In May 2005, I wrote a widely-referenced article about how to efficiently archive and/or purge data from online transaction processing (OLTP) database servers. That article focused on how to write efficient archiving SQL. In this article I’ll discuss archiving strategy, not tactics. OLTP servers tend to have complex schemas, which makes it important and sometimes difficult to design a good archiving strategy.

The 50,000-foot view

Archiving is actually a very large topic! My goal is to at least mention many of the things to consider, and go into some of them in detail. Here’s what I’ll cover in this and upcoming articles:

  • Goals of archiving
  • Where to store the archived data
  • How to choose which rows are archivable
  • How to deal with complex data relationships and dependencies
  • How to actually archive the data
  • Un-archiving strategy

Archiving: why do it?

Archiving is about scaling. There are several different views on how to scale systems with MySQL in particular, and any given system in general. Opinions vary on the best way for any given usage, but popular strategies are

    Buy bigger, more expensive, faster hardware (vertical scaling)
  • Split data into logical partitions or shards, and distribute these among many machines (horizontal scaling)
  • Use federated views, clusters, or other distributed technologies that pretend there’s one server, but underneath there are actually many (federation)

Each strategy has its place, but each focuses on how to handle the same or more data while providing redundancy or high availability, improved performance, and so on. Sometimes the elephant in the room is the obvious but easy to overlook strategy of don’t scale up or out, scale back. This may not be possible, but when it is, it’s often preferable.

How do you know if you can do this? Just ask whether you need all the data. Most of the time you don’t. Think about how true this is throughout computing as a whole. Caching is everywhere in computing, and it works precisely because you don’t need all the data, or at least you don’t need all of it all the time. Most data has some spatial or temporal hot spots, and often there’s a pretty large chunk of it that’s never accessed at all. If you don’t need all the data all the time, you might be able to move some of it away to bigger, cheaper, slower storage. That’s archiving.

If you can scale back rather than up or out, you may be able to keep your OLTP servers lean and fast without changing anything. This can easily buy you an order of magnitude. It’s generally not easy to do that; if you want to get an order of magnitude with other strategies, you may need ten times as many servers so each has an order of magnitude less data. Scaling up tends to be even more expensive; the price-to-performance ratio climbs very sharply as you get into big-iron machines. You also need to consider the cost of rack space, cooling, power, maintenance and so forth.

When people are a roadblock

I think it’s important to resist building a system that will always provide all the data that ever existed, just in case you need it. If you’re worried you might need the data someday, keep it on DVD, but you don’t need to keep it on your OLTP servers. And don’t compare yourself to Amazon or Google. You probably have limited resources, and though it may be good bragging rights, trying to engineer something massive is likely bad for your business. Sometimes people want to build unreasonable systems because they have something to prove, or they have second-system syndrome. Know your limits as an individual, and if you’re a manager, take a moment to assess whether your engineers are advocating for more than is good for the company.

Gather requirements for the archive

If you’ve identified that you can archive some of your data, I think the next important step is to figure out what you require of the stored data. I’ve worked in a place where the only requirement was legal; we needed to be able to retrieve the data only if we got audited or had to go to court. By contrast, my current employer needs infrequent access to archived data for long-term analysis, but it needs to be queryable live and quickly.

You need to balance convenience, speed, space, expense, durability, legal, and other requirements you identify. Chances are there’s a fairly obvious solution for you:

  • If you never need the data again, you can simply purge it.
  • If you only need it for legal requirements, you can archive it to a file, then burn the files to CDs and put them in a safe deposit box.
  • If you need it for infrequent queries that can be slow, you can archive to a table, then burn the table to CD. When you need to get it back, you can mount the CD and get your server to read the table right from CD. One thing to beware, though: future server versions aren’t guaranteed to read old table formats!
  • If you need it for frequent or fast querying, you might build a data warehouse, or you may be able to simply move the data to “adjacent” tables on the same server.

Any of these ideas could be mixed and matched. Let me give some examples of what I’ve done in real life.

At one company using SQL Server, we archived data to a beefy OLAP (online analytical processing) server via replication. I’ll go through the exact strategy in an upcoming article, but basically we replicated the data to the OLAP server into a spot reserved for in-transit data. From there it went into the long-term storage tables transactionally, and analytics applications read from views that gave a union over the in-transit and the long-term storage.

At my current employer using MySQL, we archive transactional InnoDB tables to adjacent tables on the same server, which use a different storage engine and sometimes fewer indexes. At the same time we archive to files, which we periodically burn to CD and put in safe storage. This would have worked well even if we’d used the InnoDB storage engine for the archive tables, simply because the OLTP tables would have been smaller, but archiving to another storage engine has the added benefits of getting it out of the InnoDB tablespace and giving more compact storage.

Conclusion

In this article I surveyed the preliminaries of archiving: motivations to archive and requirements of the archived data. While some of these decisions will actually depend on things I’ll write about in upcoming articles, it’s good to have several options in mind before you evaluate specific strategies.

In the next article I’ll discuss how to select which data is archivable, and how to deal with the complexities of OLTP schemas during the archiving process.

Technorati Tags:, , , , ,

You might also like:

  1. Archive strategies for OLTP servers, Part 2
  2. Archive strategies for OLTP servers, Part 3
  3. How to write a lazy UNION in MySQL
  4. MySQL Archiver 0.9.1 released
  5. MySQL Archiver 0.9.2 released

How to write efficient archiving and purging jobs in SQL

Sometimes it’s a terrible idea to do set-based operations in SQL. There are real-world constraints, especially in heavily used OLTP or large databases, that require doing things a tiny bit at a time. In this article I’ll show you how to write a job that can purge data from a huge table without impacting critical processes, filling up the transaction log, or causing deadlocks.

I have released a tool that does a fantastic job of archiving and purging MySQL tables, as part of MySQL Toolkit. If you’re using MySQL, you should take a look at it. Not only can it automatically design efficient queries, it handles a lot of tricky edge cases and allows plugins too.

Motivation

Mission-critical database servers can’t be taken offline for maintenance tasks such as purging or archiving historical data, yet high-volume OLTP databases need to be small to stay responsive, which creates the need for purging or archiving. The requirements of OLTP and purging processes are quite different. OLTP processes must have immediate access to the database, with nothing else getting long-standing locks that could cause waits, timeouts or deadlocks. Purging is necessary to keep the server fast, but if something else is in the way, the purge processes should yield — and once a purge process gets access to the data, it should not hold a lock for long.

At my current and previous employers, we’ve used similar tactics to purge and archive old data without detrimentally impacting the database server. In my current job, we have a set of very large, heavily-used tables that have never been purged or archived until recently. At my previous employer, purge and archive jobs were standard procedure, and moved data steadily off the OLTP servers to analysis servers.

First try: failure

When I started work at my current job, nothing was failing, but queries were too slow. One person was already attempting to solve the problem by archiving off all but the most recent data to another database, in MyISAM tables for better performance. The procedure was to dump a table to a file and load it into a MyISAM table in another database, using LOAD DATA INFILE because it’s very fast. But loading 6 gigabytes into a table is slow no matter what. It took several hours — I don’t know exactly how long — to load with the indexes disabled. Then came the disaster: enabling the indexes. It never completed, so several days later we killed the query, which completely corrupted the table. We had no idea how long it would have taken, either. We might have been two seconds or two weeks away from success.

The problem is, that wouldn’t have helped much anyway. We still needed to delete the unwanted data from the original table. That in itself would take a very long time. We decided to try the approach that worked well at my former employer.

In my previous life…

At my former employer, an all-Microsoft shop, expert DBAs had developed a strategy that worked well for archiving and purging data from Microsoft SQL Server in a variety of situations. The approach was to scan the clustered index linearly through the table, archiving one (or sometimes a few) rows at a time. The jobs I wrote, which were patterned after other jobs and guided by the DBAs, were stored procedures that looked something like this:

create procedure archive_table as
declare @id int
-- Find the minimum value that satisfies the archive condition
select @id = min(id) from table where [conditions]
while @id is not null
begin
    begin transaction
    insert into archive_table
        select * from table where id = @id
    delete from table where id = @id
    commit transaction
    -- Find the next greater minimum value
    select min(id) from table where id > @id and [conditions]
end

The important points to notice are:

  1. The initial find-min-value query is a table scan along the clustered index. As soon as it finds one row that satisfies the conditions, it’s done scanning.
  2. The get-next-row query is not a full table scan — or should not be, anyway (always check the query plan!). There are two parts to the WHERE clause. One of them should be useful for seeking into the clustered index. The other should be the archive conditions. This query should be a clustered index seek, then a scan over a certain range of the clustered index.
  3. The get-next-row query is the really important one to optimize.
  4. The actual changes to the database are very small — usually just one row at a time, but sometimes a few, depending on the data.

Note: I’ll use the terms “scan” and “seek” very specifically in this article. Scanning is reading through rows, trying to find one that satisfies a condition. Seeking is using the index to navigate quickly to a desired row.

We found these archive jobs worked well even when the rows being archived were in the middle of the table. This is because of the seek-first-then-scan query plan. The jobs took a while to run, but they had almost zero impact on the system, even if they ran for hours or even days, or if many jobs were running at the same time.

Applying these principles to MySQL

With this background in mind, I wrote a Perl script that emulates these stored procedures from my past employer. The script goes through pretty much the same steps as above, with the additional step of printing the rows to a file as they are moved or deleted. Since I wrote it for work, of course I can’t post it here, but it’s pretty simple, so I’m sure you can write your own if you want. Most of the time I spent on the script was analyzing MySQL’s behavior to see if it is similar to SQL Server’s, and that part I consider very appropriate and valuable to share here.

I wanted to answer several questions. Will MySQL use the clustered index to seek before scanning? Will one-at-a-time delete transactions be cheap and fast, and avoid blocking or other contention among queries?

To answer the first question, I wrote my get-next-row query’s WHERE clause to find only every tenth row in the table, based on a non-indexed column. That meant it would delete a row, skip nine rows, delete a row… pretty soon it would be working in the middle of the table, not at the beginning. If it seeks before scanning, the speed ought to stay fairly constant as it goes through the table. If it starts a scan at at the beginning every time, it ought to get slower as it goes (ignoring a slight speed increase due to the table getting smaller).

After testing a variety of scenarios, I concluded that MySQL will indeed seek, then start scanning. Within the limits of my ability to detect, the speed seemed to stay fairly constant as the query progressed through the table.

Once satisfied that the query is efficient, I ran several at once and observed their impact on the system. While it’s harder to profile MySQL than SQL Server, I was able to observe some impact on the system. Specifically, committing the DELETE statement seems to cause most of the work. When several instances of the program were archiving various tables simultaneously, I noticed I/O saturation on the disk, even though the CPU was running pretty cool. The archiving tasks slowed down noticeably, too; when one ran alone, it might process 1,000 rows in 14 to 20 seconds during periods of little activity, but three running together slowed that same process down to 30 or so seconds. I should mention that the server was under heavy load from other processes during most of my tests.

Good, better, best

While the queries seem to do OK archiving rows from the middle of the table, there’s an even better scenario. If the query can simply pop rows off the beginning of the table, it doesn’t have to do much work at all — the first row it finds is always the one it wants. It will be weeks before the tables are down to a reasonable size, so in the meantime I chose an absolute max value for the clustered index. That means I can rewrite the queries so they don’t include any non-clustered columns in the WHERE clause. Until these tables are pretty small, the get-next-row query looks like this:

select min(id)
from table
where id < ?

This is very efficient indeed. EXPLAIN says “Select tables optimized away,” which means the table is completely optimized out of the query. You can’t get any better than that.

Keep it small

There are several reasons to work a row at a time. One is that it avoids creating lots of locks. Locks are expensive to create, and they impact other queries. Keeping the transaction small also helps keep things low-impact. Changing a bunch of rows at a time requires putting lots of data into the transaction log, which is definitely something to avoid if possible. Ideally, the transaction log can be flushed to the disk in small bits at a time, which really helps keep things moving freely. And finally, if there’s a deadlock for some reason, rolling back the transaction needs to be as cheap as possible.

There’s also the issue of table scans. I’m not sure how MySQL does it, but in SQL Server a table scan will block at the first exclusive lock it comes to. If a table scan (whether it’s an archiving query or another query) can can seek to a point in the table before scanning, it might not block on any locks held elsewhere in the table. Archiving a row at a time means less chance of a) blocking other table scans and b) being blocked by another table scan. Even if the archive does block another query, it’s only going to delete a single row from the table, so in theory it should be getting out of the way very quickly, without causing the other query to time out.

Keep it large enough

(This section added after considerably more experience).

It turns out things will sometimes do much better if transactions aren’t committed after every row. We found that tons of tiny transactions are actually a bottleneck. We’re still archiving a row at a time, but only committing occasionally (the commit interval varies from 50 to 1000 rows).

A scary possibility

I want to discuss an alternate outcome here. Imagine the query is smart enough to seek before scanning, but isn’t smart enough to recognize that the first row it finds to satisfy the WHERE clause actually has to be the minimum value for the clustered index in the rest of the table. In this case, the query would continue scanning to the end of the table, and conclude the obvious: the first row it found in step 2 is the minimum. In general, this is no worse than scanning from the beginning every time. In either case, the query would be scanning, on average, half the table (ignoring table shrinkage due to archiving). That is, it would scan (n * (n/2)) - n rows, which is O(n2) — too slow.

Though this scenario is no worse in the general case, in the specific case for which I’m writing my archiving queries, it would be worst-case. I want my queries to pop the first row off the table; if the hypothetical “stupid” query insisted on scanning through the rest of the table, this would be a disaster. Not only would it be scanning, but it’d be starting the scan at the very first row.

You may think this is silly, and no query would be that badly optimized, but read through my past articles and you’ll see some queries that are optimized even worse than this. My point is, it’s absolutely necessary to test and see. Assuming the query will be efficient is a very bad idea.

How can I tell if the query is scanning the rest of the table, rather than stopping at the first row it finds? Simple: EXPLAIN the query and see if it’s using a table scan. Also, if this were happening, the delete-every-tenth-row query would get faster as it went, because it would have less and less of the table to scan each time, but that’s just a heuristic; I don’t need to rely on that because I can use EXPLAIN to get right to the truth.

Optimizing the table scans

After it occurred to me that MySQL might not optimize the scan along the clustered index, as I mentioned above, I decided to test it. I used EXPLAIN to look at the query plans. Lo and behold, it said it would scan the entire table rather than stopping at the first row! This was very bad news. I wasn’t sure if it would really do that, or if that was just the anticipated query plan, so I tested by looking at the very first row in a large table and writing a WHERE clause it would satisfy. For example, suppose my table had a clustered key called id and an un-indexed column called name. The first row is (1, 'Xaprb'), followed by a few million rows. I wrote the following query:

select min(id) from table
where name = 'Xaprb'

This query’s plan was a table scan, and it took several minutes to finish, indicating it read through the entire table to find the first row. I tried some other variations to encourage it to use the clustered index:

select min(id) from table
where name = 'Xaprb' and id > 0

No dice. This time, the query plan said it would use the clustered index and the WHERE clause, but it estimated it would have to scan exactly half the table, which just means it’s doing the math and saying its best guess is average-case. The performance and query plan are actually identical.

Why isn’t it smart enough to scan the clustered index and stop at the first row that satisfies, which is guaranteed to be the minimum? I think the answer might be that InnoDB is the only storage engine that even has a clustered index, and the optimizer isn’t engine-specific, so it only uses optimizations that would apply to all storage engines.

The good news is, there’s a way to get around these n2 performance penalties! Since I know the first row it finds is the one it wants, even though it doesn’t know that, I can tell it to scan the clustered index and stop after one row. Here’s the query:

select id from table
where name = 'Xaprb' limit 1;

Notice I’m no longer using the MIN() function. This is really important. If I use MIN(), it will scan the rest of the table.

In this case I don’t have to do anything to tell it to scan the clustered index — it is already doing that on its own — but in some queries I might have to use FORCE INDEX or IGNORE INDEX to make it do so. Adding an ORDER BY id might also work.

Know when you’re really done

After some experience with this technique, I discovered that using the LIMIT 1 optimization can have unintuitive results. In some cases, the query will start on a row with value 5 (for example). The “get next greater, limit one” query may then skip rows when it gets the next value. I haven’t understood yet how this happens, but it may, for example, get a row with 8 as its value, skipping over 6 and 7. At some point it returns no results, even though there may actually be more archive-able rows in the table.

This is basically an artifact of not using MIN(), which causes the table scans I desperately need to avoid.

This hasn’t impacted the archiving I’m doing at this point. When I need to be really sure the archiving is clean, I restart the job and let it clean out the rows it missed, continuing until there is no result for the “get-first-row” query. If I don’t do this, there’s a “ragged edge” in the data. For my purposes, the speed trade-off makes this a completely acceptable compromise.

Acknowledgements

I didn’t do this myself… I’m standing on the shoulders of giants, particularly my co-worker John, who taught me about this stuff and went through all this analysis with me. So, even though I say “I” a lot in this article, it’s not really just me.

Technorati Tags:, , , , , , ,

You might also like:

  1. When to use surrogate keys in InnoDB tables
  2. MySQL Archiver 0.9.1 released
  3. How to do efficient forward-only SQL maintenance jobs
  4. How to re-index a large database table
  5. How to exploit MySQL index optimizations