How InnoDB performs a checkpoint
InnoDB’s checkpoint algorithm is not well documented. It is too complex to explain in even a long blog post, because to understand checkpoints, you need to understand a lot of other things that InnoDB does. I hope that explaining how InnoDB does checkpoints in high-level terms, with simplifications, will be helpful. A lot of the simplifications are because I do not want to explain the complexities of how the simple rules can be tweaked for optimization purposes, while not violating the ACID guarantees they enforce.
A bit of background: Gray and Reuter’s classic text on transaction processing introduced two types of checkpoints beginning on page 605. There is a sharp checkpoint, and there is a fuzzy checkpoint.
A sharp checkpoint is accomplished by flushing all modified pages for committed transactions to disk, and writing down the log sequence number (LSN) of the most recent committed transaction. Modified pages for uncommitted transactions should not be flushed — that would violate the rule of write-ahead logging. (This is a deliberate and gross over-simplification; I will not draw attention to further simplifications I make.) Upon recovery, log REDO can start from the LSN at which the checkpoint took place. A sharp checkpoint is called “sharp” because everything that is flushed to disk for the checkpoint is consistent as of a single point in time — the checkpoint LSN.
A fuzzy checkpoint is more complex. It flushes pages as time passes, until it has flushed all pages that a sharp checkpoint would have done. It completes by writing down two LSNs: when the checkpoint started and when it ended. But the pages it flushed might not all be consistent with each other as of a single point in time, which is why it’s called “fuzzy.” A page that got flushed early might have been modified since then, and a page that got flushed late might have a newer LSN than the starting LSN. A fuzzy checkpoint can conceptually be converted into a sharp checkpoint by performing REDO from the starting LSN to the ending LSN. Upon recovery, then, REDO can begin from the LSN at which the checkpoint started.
It is often said that InnoDB does fuzzy checkpointing. The truth is, it does both types. When the database shuts down, it performs a sharp checkpoint. During normal operation, it performs fuzzy checkpoints. And InnoDB’s implementation of fuzzy checkpoints is not exactly the same as that described in Gray & Reuter.
Here is where the weeds get deep: I will try to explain some of the subtleties that let InnoDB provide uniform quality of service by performing checkpoints almost constantly, instead of checkpoints being significant events that occur periodically. It can be said that InnoDB actually never does a checkpoint during normal operation. Instead, the state of the database on disk is a constantly advancing fuzzy checkpoint. The advances are performed by regular flushing of dirty pages as a normal part of the database’s operation. The details are far too many and complex to write here, in part because they have changed significantly as new versions have been released, but I will try to sketch the outline.
InnoDB maintains a large buffer pool in memory with many database pages, and doesn’t write modifications to disk immediately. Instead, it keeps dirty pages in memory, hoping that they will be modified many times before they are written to disk. This is called write combining, and is a performance optimization. InnoDB keeps track of the pages in the buffer pool through several lists: the free list notes which pages are available to be used, the LRU list notes which pages have been used least recently, and the flush list contains all of the dirty pages in LSN order, least-recently-modified first.
These lists are all important, but for the simplified explanation here, I will focus on the flush list. InnoDB has limited space in the buffer pool, and if there aren’t any free spots to store a page that InnoDB needs to read from disk, it must flush and free a dirty page to make room. This is slow, so InnoDB tries to avoid the need for this by flushing dirty pages continually, keeping a reserve of clean pages that can be replaced without having to be flushed. It flushes the oldest-modified pages from the flush list on a regular basis, trying to keep from hitting certain high-water marks. It chooses the pages based on their physical locations on disk and their LSN (which is their modification time).
In addition to avoiding the high-water marks, InnoDB must avoid a very important low-water mark as well. The transaction logs (aka REDO logs, WAL logs) in InnoDB are fixed-size, and are written in a circular fashion. But spaces in the logs cannot be overwritten if they contain records of changes to a dirty page that hasn’t been flushed yet. If that happened and the server crashed, all records of those changes would be lost. Therefore, InnoDB has a limited amount of time to write out a given page’s modifications, because the ongoing transaction logging is hungry for space in the logs. The size of the logs imposes the limit. If the log writing activity wraps around in a circle and bumps into its own tail, it will cause a very bad server stall while InnoDB scrambles to free up some room in the logs. This is why InnoDB generally chooses to flush in oldest-modification order: the oldest-modified pages are the furthest behind in the logs, and will be bumped into first. The oldest unflushed dirty page’s LSN is the low-water mark in the transaction logs, and InnoDB tries to raise that low-water mark to keep as much room available in the transaction logs as it can. Making the logs larger reduces the urgency of freeing up log space and and permits various performance optimizations to do the flushing more efficiently.
And now, with that simplified explanation in place, we can understand how InnoDB actually makes a fuzzy checkpoint. When InnoDB flushes dirty pages to disk, it finds the oldest dirty page’s LSN and treats that as the checkpoint low-water mark. It then writes this to the transaction log header. You can see this in the functions log_checkpoint_margin() and log_checkpoint(). Therefore, every time InnoDB flushes dirty pages from the head of the flush list, it is actually making a checkpoint by advancing the oldest LSN in the system. And that is how continual fuzzy checkpointing is implemented without ever “doing a checkpoint” as a separate event. If there is a server crash, then recovery simply proceeds from the oldest LSN onwards.
When InnoDB shuts down, it does some additional work. First, it stops all updates to data; then it flushes all dirty buffers to disk; then it writes the current LSN to the transaction logs. This is the sharp checkpoint. Additionally, it writes the LSN to the first page of each data file in the database, which serves as a signal that they have been checkpointed up to that LSN. This permits further optimizations during recovery and when opening these data files.
There is a lot more to study if you want to learn how it’s really done in detail; there are many fine points to the process. This is one area where the usually excellent manual is a bit lacking. Some of the best resources are as follows:
- Gray and Reuter’s book
- Mark Callaghan’s note on fuzzy checkpoints
- Peter Zaitsev’s post on why the flushing algorithm in older InnoDB used to cause spikes
- Mark Callaghan’s slides from Percona Performance Conference 2009
A related topic that is equally as complex is how InnoDB flushes dirty pages at the right speed to keep up with the database’s workload. Too fast and the server does too much work; too slow and it gets behind and hurries to catch up, causing spikes of furious flushing activity and degraded quality of service. Percona Server has arguably the most advanced and effective algorithms for this, in the XtraDB storage engine (a variant of InnoDB). Percona Server calls it “adaptive checkpointing.” InnoDB followed suit by implementing something similar, but harder to tune correctly. InnoDB calls it “adaptive flushing,” which is a more accurate name. Much (and I do mean much!) has been written about this. I know that Vadim has done hundreds of benchmarks to analyze how flushing and checkpointing works, some of them many hours long to study long-term performance characteristics. I will point you to a couple of pages that I think are the most succinct summaries of the implementation and how it performs:
- Dimitri Kravtchuk’s blog post about adaptive flushing and the innodb_io_capacity variable
- Vadim’s benchmarks of Percona Server and MySQL 5.5.8, showing how to tune so that “adaptive flushing” works well
- Percona Server documentation for adaptive checkpointing
- My own blog post about balancing dirty page flushing and write combining
If these types of topics interest you, you should attend Percona Live in San Francisco in a couple of weeks. Both Peter Zaitsev and Mark Callaghan will be speaking there on topics such as InnoDB internals, along with a variety of other speakers; there is a several-hour class on InnoDB internals.



Out of curiosity, why was this posted as PostgreSQL, not MySQL?
Thom Brown
29 Jan 11 at 1:24 pm
Understanding alternative methods of checkpointing should be of interest to the Postgres community. It would be nifty if someone from the pg world would write up the process for pg. I think of particular interest would be how checkpoint completion target is achieved without constant directio and why pages must be read back into shared buffers after being flushed.
Rob Wultsch
29 Jan 11 at 2:02 pm
I posted it to both the MySQL and PostgreSQL blog aggregators. I am trying to continue engaging both communities to learn from each other. I was thinking about writing it as a dual post about both systems but I decided it’s better if someone who knows more about Postgres’s checkpointing writes that part of it. I know just enough about Postgres’s checkpoint algorithm to make a fool of myself.
Xaprb
29 Jan 11 at 2:24 pm
Thanks for the article! It’s perhaps worth mentioning that there is a very good treatment of checkpointing in “Transactional Information Systems” by Weikum and Vossen. This is the successor in the field to Gray and Reuter, which is now a little dated but still an amazing book. I keep both of them next to each other (and within very easy reach) on my bookshelf.
Robert Hodges
29 Jan 11 at 3:08 pm
Robert, don’t wear open-toed shoes when you take them off the shelf!
Xaprb
29 Jan 11 at 3:35 pm
Baron – engaging both communities to learn from each other is good and very much welcome but please keep in mind that the listing on planet postgresql has a policy(http://planet.postgresql.org/policy.html) attached that blogs aggregated there need to follow.
This article is very interesting but afaiks it does not really contain any postgresql content so maybe it should not have been tagged as such…
Stefan Kaltenbrunner
29 Jan 11 at 4:47 pm
Stefan, I probably should have made it clearer that this is really a direct follow-on to in-person conversations with several PostgreSQL core committers who were interested in studying and discussing InnoDB’s checkpoint algorithms. I do appreciate that it’s a fringe topic for some other PostgreSQL users, so I hope I have not offended or bothered anyone. I saw it as a continuation of the conversation in posts such as http://rhaas.blogspot.com/2010/11/mysql-vs-postgresql-part-1-table.html and hope that someone will respond with “How PostgreSQL performs a checkpoint,” which I will certainly promote to Planet MySQL if it doesn’t get there on its own.
Xaprb
29 Jan 11 at 5:07 pm
@Stefan
“All blogs should be about PostgreSQL or closely related technologies.”
Discussions of InnoDB internals is a closely related technology, arguing otherwise is intellectually dishonest. More importantly, stifling the cross FOSS RDMS is not helpful to either community.
Perhaps you could write up a similar entry and post it to Planet MySQL?
Rob Wultsch
29 Jan 11 at 6:49 pm
I concur with the reference to W&V as background to checkpointing. I first had to deal with it with the Progress “database” in the early 90′s. Yes, that long ago. The methods used, at a high level, were clearly documented. To some extent, checkpoint implementation so far as the engine writing community is concerned, is a solved problem; although not necessarily in the form of shared source.
That discussion of the topic should be viewed as controversial is puzzling; checkpointing is very much as new as pocket watches and spats. If one reviews the bibliographies of G&R and W&V, there are at least dozens, if not hundreds, of citations; many go back to the 1970′s. This isn’t rocket science (for an experienced engine writer, at least), and is well documented.
It may well be that current developers of OS RDBMS simply don’t have direct experience with this aspect of engine writing. I’d wager that a good deal of wheelage is being re-invented in source; much as a competent C coder knows how to write a loop, a database engine writer knows all about checkpointing (or not to). This is not a criticism of those who write OS RDBMS engines today; I certainly have no aptitude to do what they do and would not deign to tell them what to do.
By the mid-90′s, the Progress wizard (Gus Bjorkland) had moved away from “sharp” checkpoints, and Progress was, and is, a minor player in the database arena. Here’s a public doc, from 1997:
http://www.fast4gl.com/downloads/monographs/logging/logging.html
There are, likely, similar documents from that time for other commercial databases; ARIES, if no other. The topic is well worn and shouldn’t be a big deal to engine writers.
Robert Young
29 Jan 11 at 10:26 pm
I referred to Gray & Reuter because that is what the InnoDB source refers to occasionally; there are no mentions of Weikum and Vossen, although I am sure Heikki Tuuri was aware of that text as well as many previously published.
I don’t think that “how to take a checkpoint” is a big deal. How to do it smoothly seems to be. I’ve seen checkpoint spikes and stalls in lots of products, both new and mature.
Xaprb
30 Jan 11 at 6:59 am
As one of the main people working on updating checkpoint code in PostgreSQL, I will vouch for the fact that this discussion is completely on topic for Planet PostgreSQL. The immediate predecessor to this article on that feed was something from me discussing this same topic at a lower level. This fuzzy/sharp checkpoint theory and how this process is handled by other databases is all helpful background for members of the PostgreSQL community not already familiar with these topics already.
I don’t have a full article about the transaction log side of the checkpoint mechanics in PostgreSQL handy, but the slides at http://www.westnet.com/~gsmith/content/postgresql/InsideBufferCache.pdf cover the main things that are really different in the buffer cache handling. The clock-sweep method used in Postgres for allocating new pages means that the sort of heavy attention paid to the log sequence ordering done by InnoDB doesn’t happen. Write combining happens automatically every time a page’s usage ticks up its usage count, which gives it another free pass for being evicted. The end result is that the most heavily used pages in the buffer cache will only be written once per checkpoint. The main problem with the PostgreSQL design is that those fsync calls are done in a bunch at the end of each checkpoint right now, which can force way too much cache out at that point.
Unfortunately this is still not a nicely solved problem at all, and it was actually much easier to do well on earlier systems when less memory was involved. I laughed pretty hard at the suggestion that current engine writers are ignorant of the history here, given that my own ancient page at http://www.westnet.com/~gsmith/content/progress.html first linked to the original home of Gus Bjorkland’s “Engine Monographs” articles circa 1996. The reason
That I didn’t worry about this problem nearly as much when running Progress in the 90′s had nothing to do with its checkpoint mechanics, it was because my server only had a few megabytes of RAM. And when I wrote stuff out to disk, the seek times were few milliseconds just like today, but sequential rates were not the >100MB/s I can get from a $40 drive now, which has had a subtle impact on the rest of the system.
Modern operating systems increasingly rely on having large write caches active in order to convert random I/O into an elevator sorted form that’s more sequential to achieve good performance, as the gap between seq/random speeds has widened. If you don’t let those work, you can trash overall throughput in the process.
In a database, at some point you have to ensure pages have made their way onto disk via something like an fsync system call though. And there’s no good way to do that without introducing some possibility of a stall / I/O spike if the underlying storage is busy. The potential length of the stall is proportional to how much RAM is in the server (and to a lessor extent the underlying hardware caches), a number which has been growing much faster than hard drive random write speeds for some time now.
Greg Smith
31 Jan 11 at 1:36 am
> When InnoDB shuts down, it does some additional work.
What happens if the server loses power and InnoDB doesn’t get a chance to do this?
I ask because someone claimed their entire database install was unrecoverable after the plug was pulled. I’m skeptical. Regardless of how rudely things were shut down, the database should be able to recover to a consistent state without needing to be rebuilt or restored from a dump!
D. B.
24 Feb 11 at 4:13 pm
InnoDB is fully ACID and does automatic crash recovery of all committed transactions. It generally does not corrupt data unless the hardware has severe problems such as writing to bad media. (There are sometimes obscure bugs, but in general InnoDB’s crash recovery and data integrity checks are all but bulletproof.)
So I’m skeptical too. I think you friend had bad memory, a bad hard drive, or did something very wrong.
Xaprb
24 Feb 11 at 4:22 pm
Clarification on that last point. InnoDB won’t corrupt data even if you do something wrong, it has to be a hardware failure. I’m talking about using non-ECC memory, for example. That’s the kind of thing you can do wrong that will cause corruption.
Xaprb
24 Feb 11 at 4:23 pm