Archive for the ‘Theo Schlossnagle’ tag
I wrote a chapter for this book, and it’s now on shelves in bookstores near you. I got my dead-tree copy today and read everyone else’s contributions to it. It’s a good book. A group effort such as this one is necessarily going to have some differences in style and even overlapping content, but overall it works very well. It includes chapters from some really smart people, some of whom I was not previously familiar with. John and Jesse obviously have good connections. A lot of the folks are from Flickr.
Here are the highlights in my opinion.
- Theo Schlossnagle, who has a place on my list of essential books, opens things with an overview of what web operations really is, and why it’s hard. Don’t skip this. Theo’s introduction is concise and thoughtful.
- Eric Ries discusses the benefits of continuous deployment. He is right on the money. Right out of college I spent 3 years as a developer at a company with very little engineering discipline, and then left for another company built by a small ace team practicing extreme programming. Eric nails the benefits of continuous deployment — he really gets it. I hadn’t heard of Eric before, but now I’ve subscribed to his blog.
- John Allspaw (whose book on capacity planning is also on my list of essentials) and Richard Cook discuss how complex systems fail. This chapter appeared in part as a whitepaper and blog post on John’s blog, and is expanded in this book. I have spent a lot of time examining failures for clients, and as VP of Consulting, also a lot of time examining Percona’s own mistakes. I fully agree with the conclusions in this chapter. A few key points: there is never a single root cause; our desire to find one blinds us and keeps us from learning; true failures are inherently unpredictable and happen only when a series of things fails; avoiding failure requires experience with failure. This echoes another book I’ve read recently, The Black Swan.
- Brian Moon’s chapter on unexpected traffic spikes. If you get a chance to hear Brian speak, take it. He’s an engaging guy with interesting and relevant stories to tell. Stories are always a better experience than bullet points.
- Jake Loomis’s chapter on postmortems. My own research into prevention of emergencies agrees almost perfectly with his list of things to do on page 225. Read this chapter carefully! Now, knowing how to put this into action is hard — very hard — but at least you’ll have a place to start. The worst compliment I ever got after fixing a system that’d run out of hard drive space (due to utter lack of basic monitoring) was that I’d “saved the day.” Baloney. Postmortems can be a great way to learn your infrastructure’s weaknesses and prevent emergencies in the future. I’m fully confident that this particular client will again deploy new servers without adding them into Nagios, and the results will be predictable.
- Naturally, my chapter about choosing a relational database architecture for web applications (skewed towards MySQL). There is a chapter on NoSQL databases by Eric Florenzano as well, but it is more introductionary-level.
What wasn’t so good? I didn’t get a lot of value out of John’s interview with Heather Champ, on community management and web operations. I did not think the interview format worked well in a book full of essays. But that might just be me. Also, a couple of places in two or three chapters felt a bit rant-ish without a lot of clear actionable advice; I think readers won’t get so much out of this.
Overall, though, this is a great book, badly needed, on a topic that is simply not yet recognized for its true importance. As Theo writes, we’re seeing the emergence of web operations as a very large profession; it’s one whose definition is not yet formalized or agreed-upon, but that’ll change. It’s too important not to. Jesse’s introduction repeats this sentiment: the world now relies on the web, and so the world relies also on the engineers who make it run. Web operations is work that matters.
A common problem I see people running into when using a cloud computing service is the trap of under-provisioning. There’s a chain effect that leads to this result: 1) people don’t understand how virtualization works, and therefore 2) they don’t realize how much of a computing resource they’re really buying, so 3) they assume they are entitled to more than they really are, and 4) they under-provision. A few other causes and effects come into play here, too. For example, the choice to use the cloud is sometimes founded on economic assumptions that frequently turn out to be wrong. The cloud service looks more economically attractive than it really is, due to under-provisioning.
Let’s get back to this idea that people under-provision. How do I know that’s happening? I’ll use anecdotal evidence to illustrate. Here’s a real quote from a recent engagement about database (MySQL) performance problems:
Do you think it’s likely that the underlying hardware is simply worse than average? If you think this will be an ongoing problem, maybe we should try our luck with a new instance/storage cluster?
The fundamental assumption here is that some clusters are overloaded and are giving poor quality of service. We’re trained to think this way because we are familiar with services such as shared hosting, where other users on your particular server might really be abusive and claim resources that should be yours. But this isn’t how virtualization works in the common cloud platforms. In these platforms, you aren’t sharing resources with other users. You are guaranteed to get what you deserve! No kidding — this actually works.
If that’s true, then why does performance fluctuate so much? The answer lies in how resources are parceled out. Assume there are 10 units of computing resources, and you’re paying for one of them. You buy 1/10th of the machine’s power. But it just happens that you’re the only virtual instance running on that physical server. You fire up an intense job. How much power do you get? You paid for 1 unit, but you get 10, because no one else is using the other 9 units. This is the way most virtualization platforms work: they give you extra resources if they’re available and not being claimed by anyone else’s instance. This guarantees that you’ll never get less than you deserve, but it leaves open the possibility that you’ll get more than you deserve. (What would be the point of wasting that power, really?) Under-provisioning is the obverse of over-providing, which is what the virtualization platform does.
First-generation hyperthreading gave the same illusion of more resources than are really available, by the way. It made you think there were multiple processors, when in fact there weren’t — there were multiple sets of registers. Hyperthreading is a form of virtualization, too.
What typically happens is that people are running their cloud instances on machines whose underlying physical hardware is not fully utilized, and they get used to a certain level of performance they’re not really paying for. Alas, you can’t really know whether this is happening or not! But it surely is in many (most?) cases, which is why occasionally you get some resource that seems much slower than you’re accustomed to, and you think it’s “too slow.” Not so. Your other units are “too fast.”
I have a theory that if you really knew the true capacity you were buying, you’d view the price-to-performance ratio much less favorably. But it’s almost impossible to know that, really; it doesn’t help that the cloud service providers are rather vague about how much power a certain instance size really gives you. (They aren’t being malicious; it’s just the way virtualization works.) Under-provisioning is almost forced on users because they have no alternative — you could plan for worst-case performance, and you’d be doing the right thing, but how will you ever know you’ve really hit rock bottom and the worst case is really no worse? How can you even benchmark and do proper capacity planning, if you don’t know what you’re benchmarking? This should really give you serious pause. You should be thinking “wait, I’m basing my capacity planning and provisioning on luck and the law of large numbers. What if my luck runs out and I get a Black Swan event?” The question is not “what if,” but “when.”
I also think that the lack of transparency encourages people to use cloud computing services for the wrong reasons altogether. I could write about this, but I think Theo Schlossnagle said it pretty well already.
Having written about what I think is cool about the upcoming MySQL Conference and the MySQL Camp, now I want to finish up with what I’d like to see at the Percona Performance Conference. Just to recap, this is a conference we created to serve those who want to learn about performance — not “learn about MySQL,” not “learn about database performance,” just learn about performance, period.
I want to see everything. I think this is going to be the single best conference I’ve ever been to. Even the way the conference is organized is exciting. For example, it’s running from early morning till late at night, nonstop. The sessions are also (mostly) only 25 minutes. This means if you decide a session isn’t all that interesting, you didn’t spend much time on it, and you don’t have long to wait for the next one.
So here is a small sample of the sessions:
- CouchDB: Behind the Buzz (Jan Lehnardt)
- Performance Instrumentation: Beyond What You Do Now (Cary Millsap)
- Hive: Distributed Data Warehousing with Hadoop (Ashish Thusoo and Prasad Chakka, Facebook)
- High Performance Erlang (Jan Henry Nystrom)
These are not just people who’ve learned about something and want to talk at you. These are the inventors, the originators, the gurus. It is truly the who’s who, and that’s just a few of them. If you aren’t familiar with those names, Google them and see. And after that, why not Google Theo Schlossnagle, Eric Burton, Monty Widenius, Andrew Aksyonoff, and a few others.
I hope to see you there. Bring your business cards and introduce yourself to me!