Xaprb

Stay curious!

Archive for February, 2010

A review of Understanding MySQL Internals by Sasha Pachev

with 7 comments

Understanding MySQL Internals

Understanding MySQL Internals

Understanding MySQL Internals. By Sasha Pachev, O’Reilly 2007. Page count: about 227 pages. (Here’s a link to the publisher’s site).

I should have read this book a long time ago, and it’s my loss that I didn’t. Although the title makes it sound like it should only benefit those who’ll be changing the MySQL server’s own code, that’s not true. To the contrary, at least parts of this book should be required reading for DBAs and developers who use MySQL, after they gain a moderate level of familiarity with how to use the server.

The book does indeed start off with a few chapters on the source code, how to work with it, and the core structures that make up the MySQL server at the source code level. However, even these topics hold value for users such as DBAs. If you don’t know how the server really works, you are lost when you are faced with a problem or asked to understand some behavior. Peter Zaitsev refers to this as “X-Ray Vision,” something a good DBA or consultant needs. I think the first few chapters of this book are a great way to develop that X-Ray Vision into MySQL.

The next couple of chapters are on the client/server API, configuration variables, and thread-based architecture. Although the first is probably not a core competency for DBAs, the others are. I sure wish I’d had the client/server protocol chapter handy when I was working with the protocol, though. It is variously more useful than, and a good supplement to, the internals document on the MySQL wiki.

The following chapters cover the storage engine interface, the server-level lock manager and how it interacts with the storage-engine locking, and the parser and optimizer. These are absolutely core knowledge for DBAs and developers in my opinion. The server/storage-engine division is one of the things that makes MySQL different from other databases, and is mandatory to understand deeply. This applies equally well to the rest of the chapters, which cover the parser, optimizer, various storage engines (as opposed to just the server’s interface to them), transactions, and replication. Mandatory, every one.

What’s missing? I found that the book is kind of funny in one major way. It doesn’t talk much about MySQL 5.0. Instead, it delves into 4.x and 5.1. Most of the new features in 5.0 are not mentioned at all. Stored procedures, the INFORMATION_SCHEMA, triggers, and so on are absent, as are most discussions of changes to the optimizer and so forth. Some 5.0 topics are covered: index merge, for example. But by and large, there’s not a lot of coverage here. The 5.1-specific topics are those such as the new storage engine API and row-based binary logging. Events are not covered, nor are changes to other types of logging. Honestly, I feel this is appropriate in a book this size; the stuff that hasn’t changed since 4.x days is more important to understand.

There’s little discussion of exactly how certain features work, such as the different sorting algorithms. But that’s OK. These are covered pretty well by the MySQL manual, and even by my own book High Performance MySQL 2nd Edition. I think some other major topics might be missing, but I can’t quite think of them now.

The book is really well written. I expected it to be dry but it’s not at all. It’s actually engaging and interesting. I also found a curious thing happening as I read: I became more aware of how much legacy cruft there is inside MySQL, and how much that has contributed to various shortcomings. This made me actually feel sad, and made me yearn for the bright pure clean exciting vision that Drizzle strives towards. But at the same time I kind of felt nostalgic, kind of fell a little more in love with MySQL, for its strengths and for the countless hours of work and the really monumental genius that it embodies, warts and all. It was quite a cognitive dissonance experience, to tell the truth!

For those who have any inclination to reading it, I’d say: do it. It’ll benefit you a lot more than you think. And if possible, do it with a copy of the MySQL source code available and actually take the time to look through it and explore the things Sasha suggests. I read this book on an airplane, far from a computer, and I need to read it again and look at source code as I do so. I am positive I’ll get more than twice as much benefit from this second reading than I did from the first. I say that because I have a shin-deep exposure to the MySQL source code myself, so I knew just enough about it to recognize that I really would get a lot more from going and looking at the code Sasha cross-referenced. It was a bit like speaking Spanish without a dictionary, but having had a few weeks of intensive instruction ten years ago. I remembered some things well, other things just hazily.

Overall, this book is easily a high 4 stars on a scale of 5, and again, anyone seriously using MySQL should have it.

Written by Xaprb

February 19th, 2010 at 7:13 pm

How PostgreSQL protects against partial page writes and data corruption

with 23 comments

I explored two interesting topics today while learning more about Postgres.

Partial page writes

PostgreSQL’s partial page write protection is configured by the following setting, which defaults to “on”:

full_page_writes (boolean)

When this parameter is on, the PostgreSQL server writes the entire content of each disk page to WAL during the first modification of that page after a checkpoint… Storing the full page image guarantees that the page can be correctly restored, but at a price in increasing the amount of data that must be written to WAL. (Because WAL replay always starts from a checkpoint, it is sufficient to do this during the first change of each page after a checkpoint. Therefore, one way to reduce the cost of full-page writes is to increase the checkpoint interval parameters.)

Trying to reduce the cost of full-page writes by increasing the checkpoint interval highlights a compromise. If you decrease the interval, then you’ll be writing full pages to the WAL quite often. This should in theory lead to surges in the number of bytes written to the WAL, immediately following each checkpoint. As pages are revisited over time for further changes, the number of bytes written should taper off gradually until the next checkpoint. Hopefully someone who knows more can confirm this. Does anyone graph the number of bytes written to their WAL? That would be a nice illustration to see how dramatic this surging is.

Decreasing the checkpoint interval seems a bit scary, and is bound to have its own costs, for all the usual reasons. A massive checkpoint once in a while should be really expensive, and would lead to a bad worst-case time for recovery. Does the new bgwriter implementation in 8.3 fix any of this? In theory it could, but I don’t know enough yet to say. I have heard conflicting opinions on this point. I have a lot more to read about it before I form my own opinion.

Storing full pages might not really be that expensive. It could bloat the WAL, but is the cost (in terms of time) really that high? InnoDB (in MySQL) protects against partial page writes with a double-write strategy: a region in the tablespace is called the doublewrite buffer. Page writes are first sent to the doublewrite buffer, then to their actual location in the data file. I don’t remember where, but I’ve seen benchmarks showing that this doesn’t hurt performance, even though it seems counter-intuitive. Modern disks can do a lot of sequential writes, and the way InnoDB writes its data makes a lot of things sequential. I doubt that putting full pages into the PostgreSQL WAL is forced to cost a lot, unless there is an implementation-specific aspect that makes it expensive.

The TODO has some items on the WAL, which look interesting — “Eliminate need to write full pages to WAL before page modification” and a couple more items. I need to understand PostgreSQL’s recovery process better before I know what these really mean.

Detecting data corruption

I was able to verify that the WAL entries have a checksum. It is a CRC32. This is in xlog.c.

However, as far as I can understand, the answer for detecting data corruption in normal data pages is “Postgres doesn’t do that.” I was told on the IRC channel that normal data pages don’t have checksums. I am not sure how to verify that, but if it’s true it seems like a weakness. I’ve seen hardware-induced corruption on InnoDB data many times, and it could sometimes only be detected by page checksums.

What happens when a page is corrupt? It probably depends on where the corruption is. If a few bytes of the user’s data is changed, then I assume you could just get different data out of the database than you inserted into it. But if non-user data is corrupted then do you get bizarre behavior, or do you get a crash or error? I need to learn more about PostgreSQL’s data file layout to understand this. Imagining (I haven’t verified this) that a page has a pointer to the next page, what happens if that pointer is flipped to refer to some other page, say, a page from a different table? If TABLE1 and TABLE2 have identical structures but different data, could SELECT * FROM TABLE1 suddenly start showing rows from TABLE2 partway through the results? Again I need to learn more about this.

Written by Xaprb

February 8th, 2010 at 3:36 pm

How often should you use OPTIMIZE TABLE?

with 12 comments

Many times I’ve heard people advise on “best practices” for a MySQL database. This often includes routine maintenance, such as “you should run OPTIMIZE TABLE on all of your InnoDB tables once a week to defragment them for better performance.”

But this advice is unsubstantiated and could even be detrimental. Here are some of the obvious problems that I can think of:

  • The optimized table compacts the primary key (clustered index) to its default 15/16ths fill factor per page. But other indexes will be built in pseudo-random order and are likely to end up just as fragmented afterwards as before. Which indexes are more important for performance? Maybe the primary key is just a dummy value that’s not even used, and the secondary indexes are the ones that would benefit from compacting.
  • Suppose the primary key is the important one, and SELECT queries will perform more quickly if it’s defragmented. Why does it get fragmented? Because of changes to the table. Now these changes could suddenly slow down dramatically as they are forced to split pages at a much higher rate due to the more compact data layout.

Why do people make a blanket “you should defragment” statement without supporting it with hard facts? It sounds like something you’d hear from a naive Windows user who buys a $99 piece of software to make his PC “boot faster” or “fix his registry” or something. Maybe it ain’t broke and should not be fixed.

I believe we hear advice like this because there isn’t easy-to-get data that can tell us the truth. To make decisions about defragmenting tables responsibly, we need either performance data on that table (hard to get in most cases), or failing that, information about cost and frequency of page splits in general (not available from InnoDB at present). It would help to have these metrics, and I think it might not be very hard to add page-split instrumentation to InnoDB.

Written by Xaprb

February 7th, 2010 at 6:39 pm