The bigger they are, the harder they fall
I see that a lot of people just don’t get it when they start talking about high availability, redundancy, failover, etc. This is probably not going to change, but maybe I can try anyway.
Let’s think about how you can survive a massive Amazon AWS failure. You build your application to automatically move services to another part of the infrastructure that’s still up. Great! Now assume that everyone else is smart, too. Their applications move, too. What happens next?
The whole AWS cloud melts to the ground. Have you never seen this happen, where one instance of something fails and others pick up the load and fail in turn? I have. OK, so let’s say that you’re really smart, and you also have the ability to move to an entirely different provider. Now suppose that other people are smart too. Next stop — Rackspace Cloud is down, and so is Joyent, and so on.
You can’t just pretend that “the cloud” is infinite. It isn’t. Stop trying! In “the cloud,” you still have to do capacity planning, even though it’s hard or impossible, and you still have to think about the possibility that the resources you assume are there aren’t. Let’s think about cloud computing’s older name — utility computing. Can you think of any utilities that have had capacity shortages, brownouts, or even cascading failures? I worked a bunch of case studies on them in my engineering classes, but I also lived through some of them myself.
This is why some old-fashioned, stupid, clueless people still own their own hardware. Those dumb clod-jumpers aren’t hip enough to move into the cloud where everything is magical. I bet they have kerosene lanterns for when the lights go out, too.
With economies of scale come failures at scale. You can’t have it both ways.



I suppose over-optimization of resource allocation is also at fault, here — AWS almost certainly doesn’t have something analogous to the reserved capacity of the POTS.
Tim McCormack
25 Apr 11 at 4:03 pm
I have no inside knowledge of AWS, but from the outside I think I can see that there is not much reserve capacity, at least in certain regions. One commenter on http://broadcast.oreilly.com/2011/04/the-aws-outage-the-clouds-shining-moment.html spoke of the impact from people moving services to other regions. The lack of reserve capacity is probably part of where the economy of scale comes from.
Someone inside AWS is probably wanting to slap me right now for my stupid speculation about things I’m not qualified to talk about, so please ignore me.
Xaprb
25 Apr 11 at 7:01 pm
From http://blog.mongodb.org/post/4982676520: “If instead the entire east coast region were lost, then you would still have a ful copy of data on C. If you decided that you were going to make the west coast your primary data center for the duration, you would just bring up a couple more nodes there, and make a new replica set with the data from C.”
The word “just” is the wishful thinking I’m talking about in this post. It assumes infinite capacity.
Xaprb
27 Apr 11 at 11:11 am
“Just wait . . . if you can grow them big enough they’ll eventually be too big to fail. Then we won’t have to worry about downtime. The best solution for a broken basket is to put more eggs in it.”
- Ben Bernanke, Sys Admin for Amazon
awksedgreep
11 May 11 at 10:32 pm