How PostgreSQL protects against partial page writes and data corruption
I explored two interesting topics today while learning more about Postgres.
Partial page writes
PostgreSQL’s partial page write protection is configured by the following setting, which defaults to “on”:
full_page_writes (boolean)
When this parameter is on, the PostgreSQL server writes the entire content of each disk page to WAL during the first modification of that page after a checkpoint… Storing the full page image guarantees that the page can be correctly restored, but at a price in increasing the amount of data that must be written to WAL. (Because WAL replay always starts from a checkpoint, it is sufficient to do this during the first change of each page after a checkpoint. Therefore, one way to reduce the cost of full-page writes is to increase the checkpoint interval parameters.)
Trying to reduce the cost of full-page writes by increasing the checkpoint interval highlights a compromise. If you decrease the interval, then you’ll be writing full pages to the WAL quite often. This should in theory lead to surges in the number of bytes written to the WAL, immediately following each checkpoint. As pages are revisited over time for further changes, the number of bytes written should taper off gradually until the next checkpoint. Hopefully someone who knows more can confirm this. Does anyone graph the number of bytes written to their WAL? That would be a nice illustration to see how dramatic this surging is.
Decreasing the checkpoint interval seems a bit scary, and is bound to have its own costs, for all the usual reasons. A massive checkpoint once in a while should be really expensive, and would lead to a bad worst-case time for recovery. Does the new bgwriter implementation in 8.3 fix any of this? In theory it could, but I don’t know enough yet to say. I have heard conflicting opinions on this point. I have a lot more to read about it before I form my own opinion.
Storing full pages might not really be that expensive. It could bloat the WAL, but is the cost (in terms of time) really that high? InnoDB (in MySQL) protects against partial page writes with a double-write strategy: a region in the tablespace is called the doublewrite buffer. Page writes are first sent to the doublewrite buffer, then to their actual location in the data file. I don’t remember where, but I’ve seen benchmarks showing that this doesn’t hurt performance, even though it seems counter-intuitive. Modern disks can do a lot of sequential writes, and the way InnoDB writes its data makes a lot of things sequential. I doubt that putting full pages into the PostgreSQL WAL is forced to cost a lot, unless there is an implementation-specific aspect that makes it expensive.
The TODO has some items on the WAL, which look interesting — “Eliminate need to write full pages to WAL before page modification” and a couple more items. I need to understand PostgreSQL’s recovery process better before I know what these really mean.
Detecting data corruption
I was able to verify that the WAL entries have a checksum. It is a CRC32. This is in xlog.c.
However, as far as I can understand, the answer for detecting data corruption in normal data pages is “Postgres doesn’t do that.” I was told on the IRC channel that normal data pages don’t have checksums. I am not sure how to verify that, but if it’s true it seems like a weakness. I’ve seen hardware-induced corruption on InnoDB data many times, and it could sometimes only be detected by page checksums.
What happens when a page is corrupt? It probably depends on where the corruption is. If a few bytes of the user’s data is changed, then I assume you could just get different data out of the database than you inserted into it. But if non-user data is corrupted then do you get bizarre behavior, or do you get a crash or error? I need to learn more about PostgreSQL’s data file layout to understand this. Imagining (I haven’t verified this) that a page has a pointer to the next page, what happens if that pointer is flipped to refer to some other page, say, a page from a different table? If TABLE1 and TABLE2 have identical structures but different data, could SELECT * FROM TABLE1 suddenly start showing rows from TABLE2 partway through the results? Again I need to learn more about this.



IIRC I heard in IRC that there has been experimentation with adding checksums but it was found to be prohibitively expensive. I wonder if testing was done in an era when cpu was more of a limiting fact than disk speed.
I have wished that innodb had a setting like zero_damaged_pages when dealing with checksum mismatches.
Does Percona do Pg consulting?
Rob Wultsch
8 Feb 10 at 4:45 pm
Now it means, we should start becoming an expert of postgress now as oracle bought mysql ? :))
I mean, u know, its used to be posts here about mysql . ;)
Istvan
9 Feb 10 at 4:16 am
Rob, InnoDB can skip over checksum mismatches with innodb_force_recovery=1:
http://dev.mysql.com/doc/refman/5.1/en/forcing-recovery.html
I think checksums is a nice idea. If I’m not mistaken there’s situations that both InnoDB and Pg can’t detect corruption on – such as if the Operating System lies about syncs, and writes data out of order.
Morgan Tocker
9 Feb 10 at 12:46 pm
There is a significant difference in how the two systems deal work with a known corrupt force_recovery requires significantly more work and potentially downtime. Yes, after corruption has happened a dump and reload is a best practice, but being able to just zero the page is nice.
zero_damaged_pages (boolean)
Detection of a damaged page header normally causes PostgreSQL to report an error, aborting the current command. Setting zero_damaged_pages to on causes the system to instead report a warning, zero out the damaged page, and continue processing. This behavior will destroy data, namely all the rows on the damaged page. But it allows you to get past the error and retrieve rows from any undamaged pages that might be present in the table. So it is useful for recovering data if corruption has occurred due to hardware or software error. You should generally not set this on until you have given up hope of recovering data from the damaged page(s) of a table. The default setting is off, and it can only be changed by a superuser.
1 (SRV_FORCE_IGNORE_CORRUPT)
Let the server run even if it detects a corrupt page. Try to make SELECT * FROM tbl_name jump over corrupt index records and pages, which helps in dumping tables.
Rob Wultsch
9 Feb 10 at 1:11 pm
There was quite a discussion on pgsql-hackers recently about adding checksums to each page. I’ll have to admit I can’t immediately find a link to it, though. Sorry.
eggyknap
9 Feb 10 at 1:50 pm
Rob is on target; corrupted pages lead to errors when accessing the specific table involved (and more specifically, when accessing the page within that table; you can select around that data and even around the corrupted bit if you work at it). Granted the full nature of the problem probably depends to what extent it is corrupted, and which file is corrupted; ie. if its a page in a system table, you’re in for much more pain. if your corruption is in an index, you just reindex.
IIRC the last time page level checksums came up was about 2 years ago, which imo would mean close enough to the era of i/o over cpu as a general tradeoff. There’s still a performance hit, and some people weren’t willing to accept it. Part of that is because these type of corruption problems are extremely rare with Postgres so it isn’t considered a significant problem worth penalizing everyone for. I think it would be nice if we could make it a runtime option, or even initdb option if that’s what it took. In the meantime we run on lots of zfs, which checksums everything anyway, so we get some protection that way.
Robert Treat
9 Feb 10 at 1:57 pm
Actually the most recent (intense) discussions on that topic just happened two months ago – starting at: http://archives.postgresql.org/pgsql-hackers/2009-11/msg01932.php (see also http://archives.postgresql.org/pgsql-hackers/2009-12/msg00057.php and the other messages in the following month)
Stefan Kaltenbrunner
9 Feb 10 at 2:36 pm
One of the advatage of doing full_page_writes is that every WAL write is always blocksize (8K) which is great if you are putting WAL on Flash drives which may have higher than 512B blocksize in their internal implementation (4K as in case with some flash drives). This always 8K works out better in terms of response time of writes than the random WAL write with full_page_writes=off which can misalign the blocks and hence result in high response time for writes (and also for the readback if required which is not a concern here). Of course the trade off is always the “unchanged data” of the page being written again.
Jignesh Shah
9 Feb 10 at 3:09 pm
ZFS is very nice, and also has built-in protection against partial page writes, but I don’t believe that the majority of Postgres installations are running on Solaris.
I sense a general tone of “Postgres is good at not corrupting your data, so this isn’t much of a problem.” I think this might be a too-casual answer to the issue. Postgres is very good at not letting crap into the database, not truncating or silently substituting default values, and so on. But I’m talking about hardware-induced corruption here. Even on high-end drives and RAID controllers, this is a reality. And in my direct experience dealing with people trying to recover from corruption, it’s more of a problem than I think the comment thread here has acknowledged. For a database that says “correct is better than fast,” this should be a priority.
But don’t do it the way InnoDB does it. It’s not as good as it could be. InnoDB puts a checksum at the end of the page (two, actually!) and thus, you’re relying on the page to tell you whether it believes itself to be corrupt. Do it the way ZFS does it: there is external verification of page integrity. This is much stronger protection.
I think a good approach would be optional page checksums that the administrator could disable if desired, whether for performance reasons (misguided?) or because the files are on a reliable storage mechanism (overly trusting?).
About performance, I don’t know if anyone has looked at these yet, but Domas and Mark are two of the people who really know what they’re talking about:
http://mysqlha.blogspot.com/2009/05/innodb-checksum-performance.html
http://mituzas.lt/2008/05/29/on-checksums/
http://www.facebook.com/note.php?note_id=217758745932
I’ll read the -hackers thread next… thanks for the links.
Xaprb
9 Feb 10 at 3:14 pm
Jignesh, if a block is changed after a checkpoint and written to the WAL, and then changed again, as I understand it, it’s not fully written to the WAL the next time. So I’m not sure it’s true that WAL writes are always 8k in size, although I’m probably the least qualified person here to answer that!
Xaprb
9 Feb 10 at 3:19 pm
Oh my, the -hackers thread is hundreds of messages. For those who like MarkMail, here’s a link:
http://markmail.org/thread/upwvgyurnyxkuw6h
Xaprb
9 Feb 10 at 3:25 pm
heh – you asked for it :)
Stefan Kaltenbrunner
9 Feb 10 at 3:28 pm
Whew, 368 messages later I remember why I unsubscribed from -hackers at one point. Well, I had about 10 thoughts and responses in my head, but the responsible thing for me to do is learn the page format and then participate on the mailing list, not run my mouth on a blog post.
Xaprb
9 Feb 10 at 4:22 pm
Changes in 8.3 were very effective, not just the bgwriter changes mentioned. We concentrated on removing response time spikes to make performance smoother and it worked well.
WAL is written sequentially, so the intuitive idea that writing more WAL reduces performance hasn’t yet been measured as having a performance effect. Logically, if writing sequentially works for InnoDB then we can understand how it can work for Postgres also. I get beaten up all the time for theorising about potential problems rather than measuring them and investigating causes, so I empathise.
Robert was explaining how corruptions are detected at present, so your comment that “Postgres doesn’t do that” nor does it care is a misunderstanding. Postgres doesn’t currently detect all possible corruptions and there is interest in improving that, as 368 posts shows. Postgres would have it already though it looks like it will mean changing the page format and there has been much debate about how to do that efficiently. I expect that to happen in next release given the level of interest.
Not sure about the “correct is better than fast” thing. The numbers I’ve seen show Postgres is faster, so correctness doesn’t necessarily slow you down. The Postgres option for async commit provides the speed of MyIsam when you want it or the protection of InnoDB when you want it, as a transaction-level option not just a static table-level option. So Postgres provides more flexibility than MySQL along the recoverability v performance dimension.
simon@2ndQuadrant.com
10 Feb 10 at 10:39 pm
Thank you Simon. I’ve read a few of the things you’ve written about the changes in 8.3, but I need to learn more about Postgres internals to understand what it really means. My current belief is that some of the changes are analogous to what Percona did for InnoDB, to smooth out “furious flushing” and solve other problems.
I think that although I surely misunderstand Postgres, you might also misunderstand InnoDB. You can configure it to do async commits, in a couple different ways. These might be different from Postgres’s, and I need to learn more about that.
Some of Percona’s improvements also brought back group commit, which was a feature in earlier MySQL versions that was broken after MySQL 4.1. Does Postgres have group commits? Another item for my TOLEARN list.
I generally don’t think about MySQL recovery per se. MySQL doesn’t really do recovery, but InnoDB does.
My comment about “Postgres doesn’t do that” should be limited to “Postgres doesn’t checksum normal data pages.” I understand that corrupt page headers can be detected and handled, although just zeroing out the page doesn’t seem ideal for some cases. I think it might be beneficial to have more options than that.
Hopefully others who read this comment thread will also read the mailing list thread, linked to in previous comments. That thread makes it obvious how much people care. Reading it was also interesting from the technical angle.
Xaprb
10 Feb 10 at 11:28 pm
What happens in the period just after a PostgreSQL checkpoint ends is actually quite interesting to graph. The full page writes turn into the most active period for the WAL. But any data block writes are going to be dumping into a completely fresh OS write cache (all old blocks fsync’d away). Since those scattered data blocks are normally much more difficult to write out than the sequential WAL ones, this turns into less aggregate actual disk activity to deal with.
So unexpectedly what happens is that throughput goes *up*, sometimes a whole lot, during the early period just after a checkpoint. When the most full page writes are happening, the OS is caching all the database writes–leaving just the sequential stream of the full-page fattened WAL. It’s a few seconds later, when the OS realizes that it needs to really start writing data out of the write cache, that things slow down. By then, if you’re lucky, full images of enough of the most popular pages that get modified often will have already made it out to the WAL by then to have pushed you past the worst of it.
Let me know if you’re in the area around Baltimore sometime soon like we talked about that one time, be happy to explain in whatever level of detail you want how the checkpoint and background writer fit into all this. I’m just a little familiar with the changes in that area in 8.3.
Greg Smith
12 Feb 10 at 1:02 am
Hey Baron, a couple of points of clarification.
1. When I say “In the meantime we run on lots of zfs, which checksums everything anyway”, by “we” I don’t mean the Postgres community, I mean the more specific we of OmniTI. (Yeah, we run Postgres on Linux/BSD/etc too, but we prefer Solaris, and the zfs checksumming is one reason why).
2. zfs is now available at a BSD near you, in case you are Solaris averse.
3. The idea that this hasn’t been done is not a reflection of how awesomer postgres is, but more a reflection on the mob mentality of open source development. Since the problem is uncommon, we don’t get a large groundswell of support asking for it. OmniTI gets calls about fixing this type of thing, so it certainly happens, it’s just uncommon.
4. WRT group commits, check out the commit_delay and commit_siblings GUCs: http://www.postgresql.org/docs/current/interactive/runtime-config-wal.html#GUC-COMMIT-SIBLINGS
5. The march BWPUG will be closer to DC/VA, near the i66 and i495 intersection. Still a bit of a trek, but if you’re up for it, I think Greg Smith and I will both be there.
Robert Treat
12 Feb 10 at 7:02 pm
Robert,
Yup, yup, yup, thanks for the links I’ll read’em, and finally, when in March is the BWPUG?
Xaprb
12 Feb 10 at 8:37 pm
(Lest I seem lazy — I did look at http://pugs.postgresql.org/bwpug, but it’s quite stale.)
Xaprb
12 Feb 10 at 8:38 pm
BWPUG meetings are typically the 2nd Wednesday of the month, which makes next month’s expected on March 9. The pugs page is quite stale; the list archives at http://archives.postgresql.org/bwpug/ are more useful.
Given most people who attend the BWPUG and Baron will all be at PGEast, it may not be worth his trouble to hit the PUG when we’ll all see each other a few weeks later anyway.
Greg Smith
14 Feb 10 at 3:55 am
Let me summarize the giant pgsql-hackers thread…there’s a whole stack of reasons there’s not CRC checks on the data blocks in PostgreSQL yet, bu the major three are:
1) The database uses these things called hint bits: http://wiki.postgresql.org/wiki/Hint_Bits to help reduce the work involved in checking visibility information on the page. Changes to the hint bit info is not considered important–if they get screwed up, it can only be in the way that required doing more work; can’t have any corruption issues from it. Accordingly, hint bit changes aren’t logged in the WAL, and the database is pretty loose about letting people change them. Tightening that up, so that all hint bit changes go through the same WAL logging as other data on the page, is going to introduce a performance hit that everybody pays, whether or not they have CRCs turned on.
2) Torn pages, where only part of a page is written out, are handled pretty well by the current recovery design. This gets much more complicated with CRCs.
3) Adding this feature requires expanding the header of data pages, which is going to add a new type of task for an in-place upgrade that introduces this feature. This is difficult to put the work into resolving when the value of this feature seems so small. Most don’t care about it, and those who do already have options like ZFS (which avoids torn page issues at a lower level the database can’t really match).
Greg Smith
14 Feb 10 at 4:35 am
I won’t be able to be at the BWPUG meeting this time, but maybe another time.
As far as I can understand, hint bits don’t belong on disk or in the WAL anyway. They say nothing about the data; they only reflect the current state of transactions in a way that’s not meaningful for recovery. But I appreciate why they’re in the pages and not elsewhere. Changing the page format looks like a good reason to delay page CRCs until someone’s ready with a way to help with in-place upgrades (read old version, write new version).
Xaprb
18 Feb 10 at 10:25 pm