Archive for the ‘Capacity Planning’ tag
Of all the books I’ve reviewed, this one has taken me the longest to study first. That’s because there is a lot of math involved, and Neil Gunther knows a lot more about it than I do. Here’s the short version: I’m learning how to use this in the real world, but that’s going to take many months, probably years. I’ve already spent about 10 months studying this book, and have read it all the way through twice — parts of it five times or more. Needless to say, if I didn’t think this was a book with value, I wouldn’t be doing that. But you’ll only get out of this book what you put in. If you want to learn a wholly new way to understand software and hardware scalability, and how to do capacity planning as a result, then buy the book and set aside some study time. But don’t think you’re going to breeze through this book and end up with a simple N-step method to take capacity forecasts to your boss. If you want that, buy John Allspaw’s book instead. (If you’re reading this blog post, you need that book.)
I don’t want to spend a lot of time talking about Neil’s method, because honestly the book isn’t about the method first and foremost, and I think many readers will have a hard time digging the capacity planning method out of the math-ness. This book is, in a sense, a textbook or workbook for his training courses. It begins with a lot of general topics, such as how managers think about capacity, risk, what’s needed in the world of businesses that are driven by Wall Street, ITIL, and so on. Then there’s the mathematical background for the rest of the book, things like significant digits and expressing error.
The part of the book that I’m still studying begins in Chapter 4, which introduces ways to quantify scalability. The math begins with Amdahl’s Law, which you may have heard of. It turns out that not only can this be used to understand how much overall speedup is possible by speeding up part of a process, which is how I’m used to using it, but it can be used to model what is possible with parallelization. (I think I actually learned this in my university classes, but I’d forgotten its uses in parallel computing since then.) Anyway, it’s a straightforward model that makes intuitive sense and is easy to accept. I believe in it because it’s so logical and simple, and because I’ve worked with it for a long time. That’s the last bit of math in this book that I can understand so solidly, because after that, we get into a lot of things that have to do with interaction between concurrently performed work, and nothing is ever intuitive about that domain.
Now, when you’re talking about scalability, you generally are working with scalability of concurrent systems, and queueing theory is Topic Number One. Proper queueing theory is correctly modeled, under certain very restricted conditions, by the Erlang C formula. This is a complex bit of math, and although I believe in it, I don’t understand it enough to know how it’s derived or proven to be correct. Well, there’s no Erlang C math in this book. Neil Gunther goes a completely different direction. Instead of modeling the impact of queueing through the math that describes the model, he creates a new model. Let’s leave the model for later, and just look at what’s nice about not using Erlang C math to model computer system scalability:
- The Erlang C formula requires complex calculations.
- It is valid only in restricted conditions, and it’s a lot of work to prove that your workload conforms.
- It models queueing delay, but it doesn’t model coherency delay.
- It requires inputs such as service time, which are difficult or impossible to measure accurately.
Someone once said that all models are wrong, but some models are useful. Neil Gunther heads in the direction of a more useful model. First, he proves that two parameters are necessary and sufficient to create a realistic model. Next, he introduces another parameter into Amdahl’s Law to account for coherency delay. The resulting (still simple) equation models serial delay (the reverse of parallel speedup) and coherency delay. Now we have a model for how a system scales under a given workload as you increasingly parallelize the hardware. This is the universal scalability model. From the mathematical point of view, it’s the crowning achievement of this book. I’m very much summarizing, by the way. There’s a lot to think about in developing such a model, so the reader gets quite a tour de force here. Along the way Neil shows how you can arrive at the same surprising result through an entirely different route, without even using Amdahl’s Law as a starting point.
There are other models. Neil discusses these. They all have problems. Some don’t model what we know can happen in the real world — retrograde scaling — where performance can decrease when you add more power to a system. Others are physically impossible, predicting negative speedup. Negative speedup means the system’s performance goes below zero. As in, you ask it to do work, and it, uh, takes back work it’s already done? Impossible. So it certainly looks like Neil’s model is the strongest contender. By the way, Craig Shallahamer’s book on forecasting Oracle performance uses the universal scalability model, although without the mathematical rigor.
Now, the problem is how to apply this in the real world. To model a system’s performance, you have to know the value of those two magical parameters. How on earth can you find these values? This seems to be just as hard as Erlang C math. But Neil shows the second most remarkable thing: if you transform the universal scalability model around a bit, then you get a polynomial of degree two. This is exciting because if you take some measurements of your system’s observed performance at different points on the scalability curve (holding the work per processor constant, and adding more processors), and then transform those measurements in an equivalent manner, you can fit a regression curve through those points. Now you can reverse the transformations to the equation, plug in the coefficients of the quadratic equation that resulted from the curve-fitting, and out come the parameters you need for the universal scalability equation! Final result: you can extrapolate out beyond your observations and predict the system’s scalability.
We’re not done. All of this was about hardware scalability: “how much faster will this system run if I add more CPUs?” Software scalability is next. Neil goes back to the basics, starting with how Amdahl’s Law applies to software speedup, and essentially covers all the same ground we’ve already covered, but this time modeling what happens when you hold the hardware constant, and increase the concurrency of the workload the software is serving. It turns out that exactly the same scalability model holds for software as it did for hardware. This is why he calls it the universal scalability model. But not only that, it works for multi-tier architectures of arbitrary complexity.
And this is why I say I am not competent to really prove or disprove the validity of the whole thing. It makes sense to me that even a multi-tier architecture can conform to a model with two parameters. As we know in the real world, there is usually a single worst bottleneck, a weakest link. And therefore no matter how complex the architecture, the dominant factors limiting scalability are still coherency and/or queueing at the bottleneck, and how much you can parallelize (Amdahl’s Law). Thus, the universal scalability model intuitively might be valid for such architectures. But proving it — wow, that’s way beyond me. I know my limits. I’m taking it all on faith, experience, and intuition at this point.
In my mind, the results Neil Gunther derives up to this point in the book would have been plenty. However, there’s lots more left in the book. The rest of the book is about how to use the model for capacity planning, but surprisingly, it’s not about just how to use the universal scalability model. It’s about Guerrilla Capacity Planning in the real world. Right after exploring software scalability, he dives into virtualization for a whole chapter — and then shows you how to measure, model, and predict the scalability of various virtualization technologies. Next chapter: web site capacity planning. After that? “Gargantuan Computing: GRIDs and P2P.” Yep, he analyzes the scalability limits of Gnutella and friends. And then, apparently just because he can, he dissects arguments about network traffic in general (read: “how scalable is the Internet?”). I can’t pretend to understand all this myself. I’m just following along.
I have a feeling that Neil Gunther is kind of like Einstein: his real gift is his ability to create thought experiments that make the model accessible to mortals. Maybe someday he’ll be a legend you learn about in CS101 classes, or maybe someday he’ll be proven wrong like Newton, who knows. In the meantime, I’m going to keep working on applying it all in the real world, especially to MySQL, and see what comes of it. The fact that I’m still doing that bears out what I said earlier: you aren’t going to just waltz through this book and come away with a clear picture of how to work through a capacity planning method. You’ll have some work to do. If you want an elegant and simple capacity planning method, then you should buy John Allspaw’s The Art Of Capacity Planning instead.
Forecasting Oracle Performance. By Craig Shallahamer, Apress 2007. Page count: about 250 pages. (Here’s a link to the publisher’s site). Short version: buy it and read it, but make sure you don’t rely on it alone; deepen your knowledge through other sources.
I bought and read this book because I’m interested in performance, performance forecasting, and capacity planning. I’m not interested in forecasting Oracle performance per se. However, I have noticed that there is a lot of good literature in the Oracle arena that can apply to other databases (*cough* MySQL), and even systems of any type. Oracle and its practitioners are at least a decade ahead of MySQL in terms of treating performance scientifically.
This book is a compendium of performance forecasting techniques. It begins with an introduction to performance forecasting with simple models, and gradually gets into the more advanced techniques such as queuing theory, which match the real world better. It ends with chapters on ratio modeling, linear regression modeling, and scalability.
The book is fairly straightforward and easy to read. Chapter summaries are well written, and the structure is clear and well thought through. It has frequent case studies to show the topics through examples. I appreciated this; I think it makes things pretty clear, although it is a bit wordy sometimes. Some of my colleagues did not like the case studies at all. There really are a lot of case studies, so maybe he just went too far for some people’s taste. Some of them seemed a bit magical, too: “given that the sky is blue and grass is green, then e-to-the-i-pi plus one equals zero, and we’ll see why that’s so later.”
Chapter 1 discusses several different types of models, including mathematical, benchmark, and simulation models. It introduces the challenges in forecasting performance. Chapter 2 begins with definitions of transactions, arrival rate, and other notions that are essential to understanding performance. It begins to discuss the familiar response time curve and queuing at this point. It shows the difference between CPU and I/O subsystems in terms of their queuing models. Later in the chapter, it introduces what it calls essential mathematics for performance forecasting. These are a handful of formulas that the author uses to model performance under changing circumstances. I have an issue with these formulas. All of the definitions and math that we have seen so far in the book makes it seem as though we are talking about the formal queueing math that many of us are perhaps used to. However, the formulas that are shown here under the essential mathematics heading are not Erlang C formulas. They are approximations that are not accurate at all. The author does not disclose this, and a lazy reader such as myself might assume that he is simply skipping some of the more advanced aspects of queuing theory and presenting the functions simplified down to their most important forms. Indeed, this is what I thought at first. I thought the functions looked a little bit funny, but I did not check the math; I thought he was skipping details (hence the word “essential?”), and I was confused. Readers need to beware that this chapter is playing fast and loose with the response time mathematics. They are not “of the essence” at all.
In chapter 3, the author introduces modeling gotchas, several forecasting models, and how to choose them. At this point it also begins to talk about more correct response time mathematics, such as the Erlang C formulas. There is a lot of discussion of the difference between these formulas and the so-called essential formulas presented earlier. I think he should have just stuck with Erlang C formulas and skipped this “essential” stuff, or at least presented it later as simplifications that are easier to work out by hand for back-of-envelope math, rather than making it seem like The Answer without qualification.
Chapter 4 continues with basic forecasting statistics, including definitions of samples and populations, skew, and other things that will be familiar to you if you’ve taken statistics or probability courses. Chapter 5 follows with an introduction to queuing theory. There is a good overview of Little’s Law and Kendall’s notation. There are lots of graphs in this chapter, showing how the response time curves change under different circumstances. The book also begins to use a spreadsheet, which is available from the author’s website, for showing how response time varies for particular examples. The spreadsheet shows a lot of output that the author never explains mathematically, such as standard deviation of response time. How does one forecast the standard deviation of response time given the input parameters? I am not sure. I wish the book had told me, so I could form an opinion on whether it is valid and useful. Another thing that I think this book glosses over is validating that the workload can be modeled accurately with queuing theory. The distribution of arrival rates and response times matters a lot, but it really was not mentioned prominently.
I would consider chapter 6 to be something that most people want to skip. It is a little bit promotional of the author’s own method for his consulting practice, and I don’t think it is concrete enough for most people to put into action. In fact, chapter 7, which is about characterizing the workload, is much the same way. After reading it, I was unclear on exactly how to apply it. Maybe I just needed to read it more times. It felt to me like he was kind of insistent about “you must characterize your workload!” and then… we’re all waiting… yes? Oh, here is the chapter summary. Letdown.
Chapter 8 introduces ratio modeling, which is essentially a set of rules of thumb that predict how a system might perform based on intuition and experience with similar systems. I am not sure how useful this is, because the ratios seem overly simplistic. However, I am willing to accept that because systems are so hard to model, ratios might be just as good as formal queueing math.
Chapter 9 is about linear regression modeling. There is a lot of good stuff in here about how to take a list of measurements and fit it to a curve. There are examples of residual analysis, how to get rid of statistical outliers, and how to understand the correlation strength.
Chapter 10, Scalability, begins with a definition that I think most people get wrong. “A solid definition is that scalability is a function that represents the relationship between workload and throughput.” I agree with this definition, and I’m glad that he stated it so clearly (although it’s not the only useful definition of scalability). The chapter continues by defining effective CPUs, another relevant topic in the world of hyperthreading and virtualization. Then it introduces several scalability models: Amdahl, geometric, quadratic, and super-serial. Just as with the essential forecasting formulas shown earlier, some of these are clearly ridiculous and do not model real systems at all. The quadratic is a good example. I think readers can see this easily, so he doesn’t necessarily need to spell it out, but I think the amount of space devoted to this was not really warranted. I also think that he is too casual about Amdahl’s law. This last chapter will be familiar to readers of Neil J. Gunther’s work, although I value Gunther’s approach more highly.
I am a bit skeptical about this book. There is too much rabbit-out-of-hat with the math, so much so that I ended up taking almost everything with a grain of salt and thinking “I’ll make a mental note about that, and if I ever encounter a situation where it could be of use, I’ll have to do the work and prove or research proofs myself.” Too many of the foundational bits are swept under the rug, so you get a book that kind of says “This is hard stuff, but just trust me and my magic spreadsheet and you’ll be all right.” Also, in many places where the rubber meets the road, the book stops just short of really showing how to apply the material. It’s kind of hard to explain what I mean, but I get the feeling he withholds a bit to promote his business and himself. In the end he doesn’t really show what to do with the Scalability chapter; it isn’t included in his Patented Method ™ and so it seems like a waste, or a revelation that the stuff you’ve learned so far in the book is going to turn out to be an oversimplification after all (a feeling I got a lot in this book). There were too many “oh hey, so this invalidates the earlier stuff” surprises in the book for me. And some of the things that he kind of insists are SO IMPORTANT are the parts he doesn’t really cover properly or give you a good take-away for, in my view. Validation of precision of results is one of those.
In the end, despite my reservations, I think this book is worth buying because I haven’t yet seen a better book on performance forecasting. I have seen better books on capacity planning (check my list of essential books), but that’s not the same thing. Although not everything is explained fully and there is not enough mathematical rigor to satisfy me, the applications of the techniques are worth learning, provided you do not rely on this book alone.
This is an outstanding book. As far as I know Ewen Fortune was the first Perconian to read it, and it’s been spreading amongst us since then. I got my copy last week, and read it last night when I couldn’t sleep for some reason. It took me about 90 minutes to read.
This book doesn’t teach in generalities — it shows you exactly what to do. Rather than outlining the process of capacity planning (and it is a process!) and then letting you figure out how to apply it, the book shows you the process and then walks you through it several times with real examples.
The book is also intensely practical, focusing on what makes the application and the business successful. It doesn’t get any more straightforward than this: “You don’t want to be caught unprepared when growth takes place… Conversely, the company financial officers will not hold you in high regard either when you’ve purchased a lot of equipment that lay idle, only to see its price drop a few months later.”
There are several discussions of special cases, such as when database and web server reside on the same hardware. These side trips serve two purposes: they help you see how to apply the process of capacity planning in more complex situations, and they cement the importance of the process even in the straightforward cases, so you learn it better.
Let me summarize the process:
- Define your goals, so you can measure whether performance is acceptable.
- Measure and graph everything, especially metrics that show whether you’re meeting the goal.
- Inspect and correlate historical metrics to learn the limiting factor (I/O, CPU, network bandwidth, etc).
- Use real-world load (not lab tests) to discover the ceiling of that factor. Measure this by increasing the load and observing when performance stops meeting the goal.
- Use curve-fitting on historical metrics to derive an equation that describes your growth.
- Project the curve into the future and find out how long it’s going to be until you hit the capacity ceiling, and therefore when performance will become unacceptable.
- Given the knowledge of how long it’ll take you to deploy more capacity, work backwards and see when you need to start the procurement process.
This isn’t the whole story. For example, some things aren’t about performance, they’re about literal capacity, such as available disk storage. But I’m summarizing. The point is to figure out what resource limits you, and predict when you’re going to run out of it. This is so much simpler than I’ve seen this done before. Queuing theory impresses me too, but I think this is much more practical, and likely to be more accurate in my opinion.
The book ends with a chapter on deployment, and a few useful appendixes. I thought the chapter on deployment was a little less useful than the rest of the book, because it’s not specific and actionable enough. However, there’s still a lot to learn from it.
I highly recommend this book. Everyone on the operations team should probably have their own copy.