Dirty pages, fast shutdown, and write combining
Posted in Databases on May 25, 2010
One of the things that makes a traditional transactional database hard to make highly available is a relatively slow shutdown and start-up time. Applications typically delegate most or all writes to the database, which tends to run with a lot of “dirty” data in its (often large) memory. At shutdown time, the dirty memory needs to be written to disk, so the recovery routine doesn’t have to run at startup. And even upon a clean startup, the database probably has to warm up, which can also take a very long time.
Some databases let the operating system handle most of their memory management needs. This has its own challenges, especially if the operating system’s design doesn’t align exactly with the database’s goals.
Other databases take matters into their own hands. InnoDB (the de facto transactional MySQL storage engine) falls into this category; when properly configured to take advantage of modern hardware, it will use basically all of the server’s memory in a huge buffer pool, with files opened in O_DIRECT mode, bypassing the operating system for I/O operations.
The design choices, and the results, are worth thinking about. Assuming you shut down and restart infrequently, the choice to hold a lot of dirty memory has huge performance benefits, which has to be balanced against the desire for fast shutdown and recovery. In InnoDB, there are a few things you can configure that change the startup and shutdown behaviors, but you should understand the performance effects during normal operation.
First, let’s look at why it’s nice to run with lots of dirty data in memory.
Most databases have a concept called a page, buffer, or block. This is a physical unit of data, which can typically store many logical units (rows). InnoDB defaults to 16kb pages of data. Imagine that your typical row is only 80 bytes long. A lot of rows can fit into 16kb in most uses.
Suppose you insert, update, or delete a row. Should InnoDB write the result to disk? If it does, it has to write the entire 16kb page, and any other index pages and so forth, which can add up to a lot of pages. That’s a lot of work for a little bitty 80-byte row! InnoDB leaves the pages dirty in its memory. When you commit the transaction, the write-ahead log ensures that if there’s a crash, the change is still permanent. (The log has very compact entries and is not page-oriented.)
Now suppose you make another little change. In many cases, there’s a decent probability that both of the changes touched the same page(s). In fact, if you had the statistics to prove it, you would probably see that the vast majority of your changes focus on a small fraction of the total pages, or even a small fraction of the rows. Most workloads have a very tall head and a very long tail. Tens, hundreds, even thousands of times more changes go to those same pages and rows, as compared to the less-active ones.
Eventually, our favorite “hot page” gets written so a checkpoint can complete. Tons of changes were written in a single write. This is write combining, and it’s a huge efficiency. Huge! Servers can accept many tens of thousands of writes per second, and guarantee ACID properties, because of write combining. If they didn’t combine writes, they’d be asked to do many more I/O operations per second.
The downside to this is the amount of dirty pages in memory, which have to be written out during shutdown. Shutdown is equivalent to a forced checkpoint. The server has been lazily delaying lots of work, because it knows it’s going to be able to combine writes. Suddenly, all the bills come due at once – time to write tons of data to disk! And the problem here is that the server’s memory can actually be mostly dirty data. By default, InnoDB lets the buffer pool get up to 90% dirty before it starts to get worried and work hard to flush pages.
If most writes go to the hottest pages, why should there be so many dirty pages? The answer is the long tail. The few writes that don’t go to the tall head go to a very scattered long tail. Again this is hard to prove, but many of those one-off writes are dirtying entire pages just for themselves, and those pages will not be dirtied by any other writes. So the long tail is full of 16kb pages that had only 80 bytes written to them. This ends up being a lot of pages of data.
If you want your database to be able to shut down quickly if needed, what can you do about this? This is a tough question to answer. There are a few different strategies you might take.
InnoDB is a complex system that is trying to balance a lot of different factors for efficiency, while giving nice ACID properties. And it’s actually doing a pretty decent job of it by default. When you say you’d like more or less of such-and-such performance characteristic, then something else gets traded off. This is a really hard problem, and I’m not aware of anyone who has a brilliant solution to it, although I am far from a database research specialist.
Even the question of how much data to write, and how quickly, is a hard one. It’s hard and expensive to really answer accurately because the real answer requires knowledge of things such the frequency and distribution of page dirtying. Therefore, InnoDB kind of avoids this and lets you configure its “I/O capacity” and “dirty page percent” and maybe a few other things, depending on which version you use. These are just models that approximate the true answers to the real questions. All models are wrong. Some models are useful. InnoDB employs useful models that work a lot of the time.