Archive for the ‘Review’ Category
Relational Database Index Design and the Optimizers. By Tapio Lahdenmaki and Mike Leach, Wiley 2005. (Here’s a link to the publisher’s site).
I picked this book up on the advice of an Oracle expert, and after one of my colleagues had read it and mentioned it to me. The focus is on how to design indexes that will produce the best performance for various types of queries. It goes into quite a bit of detail on how databases execute specific types of queries, including sort-merge joins and multiple index access, and develops a generic cost model that can be used to produce a quick upper-bound estimate (QUBE) for the execution time of a query. The book focuses on DB2, Oracle, and SQL Server, but applies equally well to MySQL and PostgreSQL.† I learned a lot from this book, and will add it to my list of essential books.
There are too many myths and rules of thumb about index design. This book debunks them pretty thoroughly. It walks the reader through the process of understanding what a database does to execute a query, and how much that costs; and then what a database does to execute a data modification, and how much that costs. Given this knowledge, you can answer questions such as “what is the ideal index for each of these two queries?” and “should the queries have separate indexes, or is it better to find a compromise that will be good for both of them?” and even “how much slower will the compromise be for each query?” In many cases, the results are non-obvious, and often don’t agree with the rules of thumb you might have been taught. Generally, the book concludes, we should use indexes much more than we often do, and we should not hold irrational fears about the cost of maintaining indexes.
After reading this book, you’ll understand what makes an index good or bad for a query (a three-star ranking system), what makes a query possible or impossible to index ideally, the quick upper-bound estimate of execution time, the Basic Question, finding the cheapest adequate index, difficult predicates, index slices, and a host of other valuable concepts. In addition, there’s an entire chapter on a method for finding queries that are not well indexed. Some of the methods in this book are things I already had notes to implement in Maatkit tools, but others are new to me. The method of finding promising culprits is something I learned in this book, and I think it’s very valuable for a tool such as mk-query-digest with the Percona enhancements to the slow query log.
There are a few things I’ll point out so it doesn’t seem like an unqualified endorsement. One, the book is not as easy to read as it could be. The editors should have removed 99% of the places where the authors italicized or otherwise emphasized words; there’s a lot of emphasis on relatively unimportant or random words. Barely a sentence is free of italics. Second, the book was written in 2005 and today’s machines have much more memory. (This generally makes the book’s points more valid, not less valid.) Finally, the cost model is based on spinning disks, and the QUBE method needs slightly different parameters to work correctly on solid-state storage, or indeed even many modern SANs. However, that’s not a big deal — just measure your storage system’s performance, plug in the correct random versus sequential access time, and the model is still valid.
† Note that although PostgreSQL does not yet support index-only queries, which is a major focus of the book, the various cost models apply equally well. One must simply account for the cost of the table access, and not assume that the index is the only thing that’s touched by the query. In general, you’re going to need to know the internals of your database server to apply this book’s wisdom.
Of all the books I’ve reviewed, this one has taken me the longest to study first. That’s because there is a lot of math involved, and Neil Gunther knows a lot more about it than I do. Here’s the short version: I’m learning how to use this in the real world, but that’s going to take many months, probably years. I’ve already spent about 10 months studying this book, and have read it all the way through twice — parts of it five times or more. Needless to say, if I didn’t think this was a book with value, I wouldn’t be doing that. But you’ll only get out of this book what you put in. If you want to learn a wholly new way to understand software and hardware scalability, and how to do capacity planning as a result, then buy the book and set aside some study time. But don’t think you’re going to breeze through this book and end up with a simple N-step method to take capacity forecasts to your boss. If you want that, buy John Allspaw’s book instead. (If you’re reading this blog post, you need that book.)
I don’t want to spend a lot of time talking about Neil’s method, because honestly the book isn’t about the method first and foremost, and I think many readers will have a hard time digging the capacity planning method out of the math-ness. This book is, in a sense, a textbook or workbook for his training courses. It begins with a lot of general topics, such as how managers think about capacity, risk, what’s needed in the world of businesses that are driven by Wall Street, ITIL, and so on. Then there’s the mathematical background for the rest of the book, things like significant digits and expressing error.
The part of the book that I’m still studying begins in Chapter 4, which introduces ways to quantify scalability. The math begins with Amdahl’s Law, which you may have heard of. It turns out that not only can this be used to understand how much overall speedup is possible by speeding up part of a process, which is how I’m used to using it, but it can be used to model what is possible with parallelization. (I think I actually learned this in my university classes, but I’d forgotten its uses in parallel computing since then.) Anyway, it’s a straightforward model that makes intuitive sense and is easy to accept. I believe in it because it’s so logical and simple, and because I’ve worked with it for a long time. That’s the last bit of math in this book that I can understand so solidly, because after that, we get into a lot of things that have to do with interaction between concurrently performed work, and nothing is ever intuitive about that domain.
Now, when you’re talking about scalability, you generally are working with scalability of concurrent systems, and queueing theory is Topic Number One. Proper queueing theory is correctly modeled, under certain very restricted conditions, by the Erlang C formula. This is a complex bit of math, and although I believe in it, I don’t understand it enough to know how it’s derived or proven to be correct. Well, there’s no Erlang C math in this book. Neil Gunther goes a completely different direction. Instead of modeling the impact of queueing through the math that describes the model, he creates a new model. Let’s leave the model for later, and just look at what’s nice about not using Erlang C math to model computer system scalability:
- The Erlang C formula requires complex calculations.
- It is valid only in restricted conditions, and it’s a lot of work to prove that your workload conforms.
- It models queueing delay, but it doesn’t model coherency delay.
- It requires inputs such as service time, which are difficult or impossible to measure accurately.
Someone once said that all models are wrong, but some models are useful. Neil Gunther heads in the direction of a more useful model. First, he proves that two parameters are necessary and sufficient to create a realistic model. Next, he introduces another parameter into Amdahl’s Law to account for coherency delay. The resulting (still simple) equation models serial delay (the reverse of parallel speedup) and coherency delay. Now we have a model for how a system scales under a given workload as you increasingly parallelize the hardware. This is the universal scalability model. From the mathematical point of view, it’s the crowning achievement of this book. I’m very much summarizing, by the way. There’s a lot to think about in developing such a model, so the reader gets quite a tour de force here. Along the way Neil shows how you can arrive at the same surprising result through an entirely different route, without even using Amdahl’s Law as a starting point.
There are other models. Neil discusses these. They all have problems. Some don’t model what we know can happen in the real world — retrograde scaling — where performance can decrease when you add more power to a system. Others are physically impossible, predicting negative speedup. Negative speedup means the system’s performance goes below zero. As in, you ask it to do work, and it, uh, takes back work it’s already done? Impossible. So it certainly looks like Neil’s model is the strongest contender. By the way, Craig Shallahamer’s book on forecasting Oracle performance uses the universal scalability model, although without the mathematical rigor.
Now, the problem is how to apply this in the real world. To model a system’s performance, you have to know the value of those two magical parameters. How on earth can you find these values? This seems to be just as hard as Erlang C math. But Neil shows the second most remarkable thing: if you transform the universal scalability model around a bit, then you get a polynomial of degree two. This is exciting because if you take some measurements of your system’s observed performance at different points on the scalability curve (holding the work per processor constant, and adding more processors), and then transform those measurements in an equivalent manner, you can fit a regression curve through those points. Now you can reverse the transformations to the equation, plug in the coefficients of the quadratic equation that resulted from the curve-fitting, and out come the parameters you need for the universal scalability equation! Final result: you can extrapolate out beyond your observations and predict the system’s scalability.
We’re not done. All of this was about hardware scalability: “how much faster will this system run if I add more CPUs?” Software scalability is next. Neil goes back to the basics, starting with how Amdahl’s Law applies to software speedup, and essentially covers all the same ground we’ve already covered, but this time modeling what happens when you hold the hardware constant, and increase the concurrency of the workload the software is serving. It turns out that exactly the same scalability model holds for software as it did for hardware. This is why he calls it the universal scalability model. But not only that, it works for multi-tier architectures of arbitrary complexity.
And this is why I say I am not competent to really prove or disprove the validity of the whole thing. It makes sense to me that even a multi-tier architecture can conform to a model with two parameters. As we know in the real world, there is usually a single worst bottleneck, a weakest link. And therefore no matter how complex the architecture, the dominant factors limiting scalability are still coherency and/or queueing at the bottleneck, and how much you can parallelize (Amdahl’s Law). Thus, the universal scalability model intuitively might be valid for such architectures. But proving it — wow, that’s way beyond me. I know my limits. I’m taking it all on faith, experience, and intuition at this point.
In my mind, the results Neil Gunther derives up to this point in the book would have been plenty. However, there’s lots more left in the book. The rest of the book is about how to use the model for capacity planning, but surprisingly, it’s not about just how to use the universal scalability model. It’s about Guerrilla Capacity Planning in the real world. Right after exploring software scalability, he dives into virtualization for a whole chapter — and then shows you how to measure, model, and predict the scalability of various virtualization technologies. Next chapter: web site capacity planning. After that? “Gargantuan Computing: GRIDs and P2P.” Yep, he analyzes the scalability limits of Gnutella and friends. And then, apparently just because he can, he dissects arguments about network traffic in general (read: “how scalable is the Internet?”). I can’t pretend to understand all this myself. I’m just following along.
I have a feeling that Neil Gunther is kind of like Einstein: his real gift is his ability to create thought experiments that make the model accessible to mortals. Maybe someday he’ll be a legend you learn about in CS101 classes, or maybe someday he’ll be proven wrong like Newton, who knows. In the meantime, I’m going to keep working on applying it all in the real world, especially to MySQL, and see what comes of it. The fact that I’m still doing that bears out what I said earlier: you aren’t going to just waltz through this book and come away with a clear picture of how to work through a capacity planning method. You’ll have some work to do. If you want an elegant and simple capacity planning method, then you should buy John Allspaw’s The Art Of Capacity Planning instead.
This is a great book on how to build apps in the cloud! I was happy to see how much depth it went into. It’s short — 150 pages plus some appendixes — so I was expecting it to be a superficial overview. But it isn’t. It is thorough. And it is also obviously built on his own experience building very specific applications that he uses to run his business — he isn’t preaching about stuff he doesn’t know first-hand. Finally, George Reese is a good writer! It’s impressive. This is how he covers so much ground with so much depth in so few pages, and it all makes sense. He takes a side trip every now and then, but it’s always in the right place at the right time — how to do a snapshot for backups, for example — and isn’t distracting. For a technical book, it has an amazing narrative flow.
The book begins with an intro to cloud computing in general, with definitions and an explanation of different models, plus cost estimates of traditional IT, managed hosting, and cloud computing for an app. There’s a brief overview of the Amazon platform. This book is mostly about Amazon, and states that up front. There are references and comparisons to other providers throughout, and later there’ll be two appendixes on GoGrid and Rackspace, each written by a representative of that company. I was happy that the author brought in people to write those, instead of doing it himself. They are non-promotional in nature, and quite short. That adds value to the book, which would have been fine without them, honestly.
Back to chapter two now — a deeper introduction to Amazon, moving through all the major components, but especially EC2, S3, and EBS. Here we also start to see a focus on the platform as a whole — availability zones, security, redundancy, reliability. These topics are treated fairly and woven into every chapter. It’s clear that the author doesn’t want to isolate these topics, but rather explain them in context so your mind is always on them as each new topic is introduced. Chapter 3 picks all this up again: considering a move into the cloud? More cost comparisons, more explanations of concepts such as availability and how they translate into the Amazon cloud. Performance, disaster recovery and a few other topics show up here.
Chapter 4 is about how to build an app in the cloud: web app design, making multiple machines work together, handling failure, building AMIs, privacy, and operating databases (especially MySQL) in the cloud. The privacy section is particularly good. I’d recommend this to anyone building an app that might process personally identifiable information or financial information, in or out of the cloud. And as I said already, this is one of the types of things he weaves into the whole book. Chapter 5 picks right up and keeps going: it’s about security. Data security, regulatory compliance, network security, host security, how to respond if there’s a breach. And then Chapter 6 is on disaster recovery: planning, implementing, managing.
Chapter 7 is titled “scaling,” but it’s more than that. It starts with capacity planning. Here’s one of my favorite quotes: “some think they no longer need to engage in capacity planning… [others] think of tens or hundreds of thousands of dollars in consulting fees. Both thoughts are dangerous myths…” There’s a reference to John Allspaw’s excellent book on capacity planning. (I saw that he was a tech reviewer for this book, too.) This chapter covers how you predict and provision for capacity needs in the cloud, including the “automatic scaling” holy grail, how it can bite you, and how to keep that from happening. It also talks about how you scale vertically in the cloud. It doesn’t talk about why it’s hard to really be sure about your capacity needs in the cloud, but that’s okay given the other material covered in the chapter.
And that’s it! After this, it’s 3 appendixes. One is an AWS reference, and then there’s the two on GoGrid and Rackspace.
What’s to criticize? Well, not a lot really. I read every word in this book, I promise. Here’s what I noticed: he talked about database corruption from unexpected shutdowns — he should have said “use InnoDB,” because that’s pretty much a MyISAM problem. He talked about taking backups from replication slaves — he should have said “don’t just trust replication, verify it with mk-table-checksum.” I also think he encourages a little too much trust that cloud providers are always magically going to have the capacity you need; it felt a bit naive, but this is actually a fundamental point in whether you’re going to use the cloud or not. Nobody knows how much excess capacity Amazon has, and as we know, weird things happen. But if you’re going to embrace a cloud platform, you’re going to have to trust to a certain extent.
A couple other things to nitpick: in Chapter 1, when talking about availability, he writes “[if] even 1 minute of downtime in a year is entirely unacceptable, you almost certainly want to opt for a managed services environment… [if] 99.995% is good enough, you can’t beat the cloud.” But these numbers are unrealistic and don’t have enough context to explain what he means. Finally, in a couple of places he talks about algorithms for generating unique identifiers and dealing with concurrent access, but these don’t have a deep enough explanation to prevent novices from shooting themselves in the foot with wrong assumptions such as a timestamp will always increase between each subsequent access. But a savvy developer will recognize those problems and won’t be bitten.
This book is the first one to go onto my list of essential books in a while. I’ll be keeping this one on my own bookshelf.