Archive for the ‘checksums’ tag
How PostgreSQL protects against partial page writes and data corruption
I explored two interesting topics today while learning more about Postgres.
Partial page writes
PostgreSQL’s partial page write protection is configured by the following setting, which defaults to “on”:
full_page_writes (boolean)
When this parameter is on, the PostgreSQL server writes the entire content of each disk page to WAL during the first modification of that page after a checkpoint… Storing the full page image guarantees that the page can be correctly restored, but at a price in increasing the amount of data that must be written to WAL. (Because WAL replay always starts from a checkpoint, it is sufficient to do this during the first change of each page after a checkpoint. Therefore, one way to reduce the cost of full-page writes is to increase the checkpoint interval parameters.)
Trying to reduce the cost of full-page writes by increasing the checkpoint interval highlights a compromise. If you decrease the interval, then you’ll be writing full pages to the WAL quite often. This should in theory lead to surges in the number of bytes written to the WAL, immediately following each checkpoint. As pages are revisited over time for further changes, the number of bytes written should taper off gradually until the next checkpoint. Hopefully someone who knows more can confirm this. Does anyone graph the number of bytes written to their WAL? That would be a nice illustration to see how dramatic this surging is.
Decreasing the checkpoint interval seems a bit scary, and is bound to have its own costs, for all the usual reasons. A massive checkpoint once in a while should be really expensive, and would lead to a bad worst-case time for recovery. Does the new bgwriter implementation in 8.3 fix any of this? In theory it could, but I don’t know enough yet to say. I have heard conflicting opinions on this point. I have a lot more to read about it before I form my own opinion.
Storing full pages might not really be that expensive. It could bloat the WAL, but is the cost (in terms of time) really that high? InnoDB (in MySQL) protects against partial page writes with a double-write strategy: a region in the tablespace is called the doublewrite buffer. Page writes are first sent to the doublewrite buffer, then to their actual location in the data file. I don’t remember where, but I’ve seen benchmarks showing that this doesn’t hurt performance, even though it seems counter-intuitive. Modern disks can do a lot of sequential writes, and the way InnoDB writes its data makes a lot of things sequential. I doubt that putting full pages into the PostgreSQL WAL is forced to cost a lot, unless there is an implementation-specific aspect that makes it expensive.
The TODO has some items on the WAL, which look interesting — “Eliminate need to write full pages to WAL before page modification” and a couple more items. I need to understand PostgreSQL’s recovery process better before I know what these really mean.
Detecting data corruption
I was able to verify that the WAL entries have a checksum. It is a CRC32. This is in xlog.c.
However, as far as I can understand, the answer for detecting data corruption in normal data pages is “Postgres doesn’t do that.” I was told on the IRC channel that normal data pages don’t have checksums. I am not sure how to verify that, but if it’s true it seems like a weakness. I’ve seen hardware-induced corruption on InnoDB data many times, and it could sometimes only be detected by page checksums.
What happens when a page is corrupt? It probably depends on where the corruption is. If a few bytes of the user’s data is changed, then I assume you could just get different data out of the database than you inserted into it. But if non-user data is corrupted then do you get bizarre behavior, or do you get a crash or error? I need to learn more about PostgreSQL’s data file layout to understand this. Imagining (I haven’t verified this) that a page has a pointer to the next page, what happens if that pointer is flipped to refer to some other page, say, a page from a different table? If TABLE1 and TABLE2 have identical structures but different data, could SELECT * FROM TABLE1 suddenly start showing rows from TABLE2 partway through the results? Again I need to learn more about this.
How MySQL replication got out of sync
I created MySQL Table Checksum because I was certain replication slaves were slowly drifting out of sync with their masters, and there was no way to prove it. Once I could prove it, I was able to show that replication gets out of sync for lots of people, lots of times. (If you really want to hear war stories, you should probably talk to one of the MySQL support staff or consulting team members; I’m sure they see this a lot more than I do).
I finally figured out what was causing one of my most persistent and annoying out-of-sync scenarios. It turns out to be nothing earth-shaking; it’s just an easy-to-overlook limitation of statement-based replication. You could call it a bug, but as far as I can see, there’s no way to fix it with statement-based replication. (I’d love to be proven wrong). Read on for the details.
The setup
Here’s the table I saw getting out of sync, usually within hours of being synced:
CREATE TABLE `workpriority` ( `client` smallint(5) unsigned NOT NULL, `workunit` bigint(20) NOT NULL, `priority` float NOT NULL, `processor` int unsigned NOT NULL default '0', PRIMARY KEY (`client`,`workunit`), KEY `priority` (`priority`), KEY `processor` (`processor`,`priority`) ) ENGINE=InnoDB DEFAULT CHARSET=latin1
This table is essentially a queue of work that needs to be done, along with the priority of each item (workunit refers to another table, where the application retrieves the work to do). Applications work against this table in parallel. They run an UPDATE statement to claim some work for themselves, then fetch the rows the statement affected. To avoid any race conditions, the “token” is the result of the CONNECTION_ID() function. Here’s the statement:
update workpriority as p
inner join (
select client, workunit
from workpriority
where processor = 0
order by priority desc
limit 10
) as chosen using(client, workunit)
set p.processor = $cxn_id;
The nested SELECT statement finds 10 unclaimed (processor = 0) rows in order of descending priority. The outer UPDATE statement claims these rows by setting processor to CONNECTION_ID().
Now the application can find out what work it claimed with a simple SELECT with the token in the WHERE clause. Later, after the application processes each row, it issues the following statement to clean out the table:
delete from workpriority where client = ? and workunit = ? and processor = ?;
The problem
The problem seemed to be that some binary log events were not getting replayed on the slave. This table accumulated extra rows on the slaves, as though the DELETE statements weren’t getting to the slaves. To test this, I compared the logs and determined that it’s not a logging issue; the binary log events are getting to the slave and replaying just fine. I can see them in the slave’s relay log and in the slave’s binary log (I have log_slave_updates enabled).
So if that’s not the problem, what is?
The bug
I already showed you the bug. If you didn’t see it, well, neither did I — for a year.
If you still don’t see it, here’s a hint: the slaves get out of sync in totally different ways. In other words, the slaves don’t even match each other after a little while.
The problem is that ORDER BY... LIMIT is non-deterministic. If several rows are tied for priority, the slaves might (and do!) order them differently than the master did. Then the the UPDATE statement claims different rows on the slaves. Some rows that have been claimed on the master are still marked as 0 on the slave. Then they don’t get deleted by the DELETE statement. I was able to confirm this by running a script that does a checksum on this table every few minutes, then as soon as it finds differences dumps the whole table on both the master and the slave. I was able to find some rows that the application hadn’t deleted yet. Sure enough, some of them weren’t claimed on the slave.
The fix
Very simple: resolve the ties. The query now causes a filesort because it can’t use the index to sort, but it’s not that big a table and this query doesn’t run that often. Here’s the fixed query:
update workpriority as p
inner join (
select client, workunit
from workpriority
where processor = 0
order by priority desc, workunit
limit 10
) as chosen using(client, workunit)
set p.processor = $cxn_id;
This limitation of statement-based replication is so basic and simple, and I’ve known about it for a long time, but it’s so innocuously hidden in plain sight that it took me forever to see it.


