Xaprb

Stay curious!

Is 100% uptime really possible?

with 13 comments

This post isn’t about NuoDB, although it was prompted by the phrase “100% uptime” that I’ve seen them use a few times. I want to suggest that people think slightly differently about uptime and availability.

The key to understanding uptime and thinking clearly about it, in my opinion, is to think instead about downtime. Uptime is the absence of downtime. Therefore, focus your attention on reducing downtime through a two-pronged approach. First, increase the mean time between failures (MTBF). Second, reduce the mean time to recovery (MTTR) when downtime happens. The techniques for achieving these goals are quite different; the second tends to be a technical solution, whereas the first usually requires a management solution.

Now, back to uptime. Is 100% uptime even possible? It depends on how you define it. Play funny with the definition, and you can draw a box around a period on your timeline where there was no downtime. During that period, presto! 100% uptime! But that’s not legitimate, and we all know it.

What we mean when we say 100% uptime is that there will never be any downtime in the time period starting now and extending to infinity. You might legitimately tweak that definition a little bit — you could say no unplanned downtime, for example, instead of just no downtime, period.

So here’s the thing. If there is any chance at all of downtime, then in the time period extending from now to infinity, the chance becomes a certainty. If we claim 100% uptime, we’re claiming zero chance of downtime. That isn’t “very small,” but “zero.” That is, we’re claiming that downtime is literally impossible.

In the specific case of, say, NuoDB’s database software, a 100% uptime claim means that it is impossible to take NuoDB offline regardless of what happens. That means that it can survive complete failure of its networking, storage, and compute facilities. That means that NuoDB itself has zero bugs that could ever cause problems or crash it. That means that NuoDB can’t be screwed up by a human who makes a mistake.

Again, this isn’t about NuoDB, which I’m sure is a remarkable piece of software. I just want to examine in clear terms what it really means to achieve such a goal. I would say 100% uptime is impossible to achieve. I don’t think it is possible to completely eliminate every chance of failure in any system, and over a long period of time, a chance becomes a certainty. Downtime is not only always possible, but always inevitable.

Written by Baron Schwartz

July 3rd, 2012 at 2:28 pm

Posted in SQL

13 Responses to 'Is 100% uptime really possible?'

Subscribe to comments with RSS

  1. So now we have death, taxes, and downtime.

    Jeff Cornejo

    3 Jul 12 at 7:21 pm

  2. Excellent points, but do not feed the marketing trolls.

    Mark Callaghan

    3 Jul 12 at 10:36 pm

  3. While I agree with your conclusion, I believe you are overstretching the arguments:
    “That means that NuoDB itself has zero bugs that could ever cause problems or crash it. That means that NuoDB can’t be screwed up by a human who makes a mistake.”

    Yes, and also that means the service would withstand a world-wide disaster like a meteorite crashing into earth obliterating humanity.

    Now I’m stretching it even further, of course, but the point is that we obviously do not hold up meteorites against some who says “100%”, nor “99.999″, 4 nines or 3 nines. It’s just not part of the equation.

    Likewise, a human that makes a mistake like “DROP DATABASE” does not fall into the equation (hey, the service is still up, right? Not the service’s fault).

    If someone were to turn off the power in the US, that’s not the service’s fault either.

    So in this respect I think your criticism was somewhat unfair.

    It does boil down to the very thing you suggested: the definition is vague.

    Shlomi Noach

    4 Jul 12 at 1:24 am

  4. The definition is actually pretty simple and clear: Availability = (MTBF+MTTR)/MTBF

    I don’t know anything about NuoDBs recovery time, but the argument is about the mean time between failures anyway … and to get to 100% (and not just close to it with a lot of nines) you need to stretch out MTBF to infinity (unless you can get MTTR down to zero ;).

    And when approaching infinity then running into “that one last bug that is always still in there” or being struck by a meteor or all CPUs register banks beingt hit by a cosmic radiation particle at the same time does indeed something that needs to get taken into account. You usually simply don’t bother about doing this as other sorts of failure are much more likely so this level of detail doesn’t really change the result …

    hartmut

    4 Jul 12 at 4:51 am

  5. I am from NuoDB.

    Our official position is that in the case of planet earth being obliterated by some cataclysm you would need to have at least one NuoDB Storage Manager running somewhere else for your database to still be live :-). You also might want to be elsewhere yourself.

    Of course there are assumptions in what NuoDB is saying. We need machines to run on, we need electricity for those machines, we need sufficient space to store the data, and so on. And of course you could think of situations in which a software bug or combination of bugs can bring down dozens of machines simultaneously. If that is the point of the blog post then we agree. But the point we’re making about uptime is a little different.

    Expectations for current database systems are a million miles from the above issues, both in terms of failure-downtime and administrative-downtime.

    Traditional databases model reliability as a chain, in which any weak link will cause failure of the chain. The band-aid solution is to have a second or third chain with “fail-over”. A better way is to model reliability as a flock of birds, in which to stop the flock you have to stop every bird. This is how NuoDB works and it consequently has an architectural solution to failure-downtime that is of a different order.

    As relates to administrative downtime, have you ever tried upgrading a traditional database to a new version? It can take many days. Have you ever tried upgrading your hardware or operating system on a traditional database server? That’s right, you have to stop the database system to do so. Have you ever tried changing a schema in a traditional database? You have to take the system down, and you may have to export, transform and import your data before going live again. Even for backup people frequently implement database replication hacks to get a high integrity backup without stopping the database system from running. There is administrative downtime that is not measured in milliseconds or minutes per year, but days per year. For cloud and webscale that is not OK.

    As an aside, to the point about user error, we actually are resilient to that. NuoDB is a version-based, append-only database system (with garbage collection) in which deleting all customer data simply causes the system to note that the data should not be visible to transactions running after the moment of deletion. “Time travel” is not a feature that is scheduled for V1 but it is how the system works, and you could expect that we will expose it to users. In this case recovering your customer data after deletion would be a case of going back 5 minutes and restoring it from there.

    Arbitrary redundancy, rolling upgrades, dynamic schemas and a list of other features enable NuoDB’s “100% Uptime”, bounded by quite tolerant requirements for earth to exist, computers to exist etc.

    NuoDB is a peer to peer system. So here’s a pertinent question relating to peer to peer systems:

    How much downtime has BitTorrent had in the last 5 years?

    Barry Morris

    4 Jul 12 at 8:50 am

  6. > How much downtime has BitTorrent had in the last 5 years?

    when it comes to accessing the data i’m actually looking for: a lot …

    hartmut

    4 Jul 12 at 12:19 pm

  7. Being unable to get what you want is no fun. It sounds like the BitTorrent service was up at the time but delivering disappointing service. You really want both, of course.

    The best of all worlds might be to have the uptime characteristics of a peer to peer system with the service and data guarantees of a full featured database system. That’s the way to think about NuoDB.

    Barry Morris

    4 Jul 12 at 1:49 pm

  8. Peer-to-peer doesn’t mean a thing. Skype is peer-to-peer, too. You can build bad systems on peer-to-peer technology, and you can build good systems on other kinds of technologies.

    What I’m most interested in is understanding how NuoDB actually works. So far all the explanations I’ve seen stop short of explaining what it does. They go as far as “emergent behavior like a flock of birds” and leave us to guess what actually happens when, for example, a transaction is committed somewhere at the same time that someone else on another node has updated a row included in the transaction, while yet another node has a few different transactions reading the same row, and elsewhere there is a really long-running idle transaction with an uncommitted modification.

    Richard Agnori

    5 Jul 12 at 9:46 am

  9. The avialability formual was of course supposed to read

    Availability = MTBF/(MTBF+MTTR)

    and not the other way round.

    Thanks to Axel @MPAB for spotting and pointing it out.

    hartmut

    7 Jul 12 at 6:14 am

  10. One can safely promise 100% SLA and simply refund for the the time of outage like a lot of companies out there (so yeah, 1% refund for 99% of uptime). Make sure you always read the small print…

    Maciej Wiercinski

    17 Jul 12 at 6:56 pm

  11. “If there is any chance at all of downtime, then in the time period extending from now to infinity, the chance becomes a certainty.”

    Not to be super-nerdy-math-guy here but the probability infinitely approaches certainty, but doesn’t actually reach it. If the probability isn’t 100% to begin with, then it’s impossible to reach 100% regardless of timeline. Mathematically, it is wholly possible to flip a coin an infinite number of times and always come up with heads. Of course, that’s just me being nerdy and your practical point is valid (99.99999…to infinity percent is practically a certainty).

    mr.stobbe

    31 Jul 12 at 6:24 pm

  12. mr. stobbe,

    Not quite. You mean to say that as time approaches infinity, probability approaches certainty. You are slightly misquoting me: I said that in the time extending from now to infinity, chance becomes certainty. (I should have said it IS certainty, not BECOMES certainty.)

    I am treating now-to-infinity as something that exists, not something that unfolds. The time from now to infinity is infinity, not “approaching” infinity. Make sense?

    Xaprb

    31 Jul 12 at 7:42 pm

  13. By the way, some diligent readers of this blog might have noticed several hours of downtime a couple weeks ago while I messed around with DNS. Thank heavens this blog isn’t a critical resource, or I would have learned what I was doing before doing it…

    Xaprb

    31 Jul 12 at 7:43 pm

Leave a Reply