Archive for the ‘Scalability’ Category
Most of the time, when people say “scalability” they mean any of dozens of things. Most of the time, when I say it I mean exactly one precisely defined thing. However, I don’t claim that’s the only correct use of “scalability.” There is another, in particular, that I think is very important to understand: the inherent limitations of the system. This second one doesn’t have a single mathematical definition, but it’s vital nonetheless.
I’ll frame the discussion by asking this: how scalable is your database?
Using the two definitions I like to use the most, I answer the question in this way.
- Scalability in terms of the Universal Scalability Law is the degree to which you can add more workers (or units of hardware) and get equal returns in terms of system throughput.
- Scalability in terms of inherent limitations is how big you can actually make the system.
These are very different things. For example, the Universal Scalability Law doesn’t say anything about the amount of data your database stores. But I think we all know that a MySQL server can only hold just so much data. True, it’s a lot of data — there are lots of multi-terabyte MySQL servers out there. But if you need to put, say, 20 petabytes of data into MySQL, you just can’t do it.
Similarly, if you need to write 40 million values per second into your MySQL server, you just can’t do it. Nor can you support 10 million concurrent client connections. These things are impossible with MySQL.
I hear some people saying “of course you can! You shard it, dummy!”
Ah. But do you then have a database, or do you have many? You have many, of course. If you build your own sharding layer on top of lots of MySQL instances, one could argue that you then have a single very large database. But it isn’t a “MySQL database” anymore. MySQL has been relegated to a component of this sharded DBMS. As a supporting crew member, MySQL can play a role in a 20PB database, but MySQL per se can’t do it.
When you’re designing a system of any kind, it’s smart to keep in mind that a lot of technologies have practical limits that can’t be exceeded. They may grow with time and Moore’s Law, but they represent a cap you can’t get around without doing something differently.
I’m reading a little bit about Riak, and was curious about performance and scalability. The only benchmark I found that allowed me to assess scalability was this one from Joyent. Of course, they say scalability is linear (everyone says that without knowing what it means) but the results are clearly not a straight line. So how scalable is it, really?
The Universal Scalability Law is such a powerful tool for thinking about scalability. A few seconds later, I had my answer.
Of course, this is to be taken with a spoonful of salt, because the modeling is based on only four measurements. But it’s an incredibly quick way to get an idea of what we’re looking at, just as a level-set.
I’ve written a lot about modeling MySQL with the USL, and I like it best of all the scalability models I’ve seen, but it’s not the only way to think about scalability. I was aware that New Relic supports a scalability chart, so I decided to take a peek at that. Here’s a screenshot of the chart, from their blog:
Here’s how it works. It plots response time (or database time, or CPU) as the dependent variable, versus throughput as the independent variable. There’s a line through it to indicate the general shape. Samples are charted as points in a scatter plot. The points are color-coded by the time of day. Outliers are automatically removed.
The focus on response time is really good. That’s one of the things I like about New Relic. While most systems show people status counters, and imply that they have some deep insight and meaningfulness (there’s usually no meaning to be found in status counters!), New Relic is educating people about the importance of response time, or latency.
But as I read through the blog posts about this chart, it struck me that there’s something a little odd about it. The problem, I realized, is that it plots throughput as the independent variable on the chart. But throughput isn’t an independent variable. Throughput is the system’s output under load, and depends on a) the load on the system, b) the system’s scalability. It’s a dependent variable.
In a chart like this, it would be even better to show the independent variable as the variable that one can really control: the concurrency or load on the system. By “load” I mean the usual definition: the amount of work waiting to be completed, i.e. the backlog; this is what a Unix load average measures.
To explain a little more what I mean about throughput being dependent, not independent, here are a few ways to think about it:
- An independent variable should range from zero to infinity (negative numbers are unphysical in a situation like this, so we exclude that). Throughput has a very finite theoretical and practical upper bound, but concurrency can theoretically go to infinity as work arrives and doesn’t complete.
- An independent variable is the variable you can control as an input parameter of a system under test. It’s dead-easy to achieve the desired concurrency for a benchmark or other test. It’s amazingly difficult to manufacture a desired throughput for a benchmark, even in “easy” conditions. Computers are unruly beasts — they are queueing systems, and random variations and dependencies cause throughput to fluctuate greatly. That’s because throughput is measured at the output end of the system, after the queues inside the system have had their way with the input and introduced statistical fluctuations into it. It’s quite easy to generate a desired arrival rate for a system under test, provided that you have an unbounded number of workers ready to keep submitting more requests as the system queues up and stalls existing workers, but arrivals are not the same as throughput :-) Any way you look at it, you can pick your concurrency and your arrival rate, but you really can’t pick your throughput reliably. Throughput is an effect, not a cause.
- An independent variable in a function must map to one and only one value of the dependent variable. But we know that as load increases, a system’s throughput rises, peaks, and then falls again as retrograde scalability manifests itself. Suppose a system’s throughput goes from 10,000 queries per second at 16 threads, to 20,000 at 32 threads, and back to 10,000 at 64 threads. Now if we flip the chart’s axes around and treat throughput as an input, we’ll find that a throughput of 10,000 queries per second would map to either 16 or 64 threads. That doesn’t describe a real function.
So although the New Relic scalability chart shows some of the effects of the system’s scalability, and it’s great to visualize the variation in response time as throughput varies, it doesn’t strike me as quite the right angle of approach.
I’m curious to hear from people who may have used this feature. What did you use it for? Were you successful in gaining insight into scalability bottlenecks? How did it help you?