Archive for the ‘Scalability’ Category
Last time I said that you can set a starting value for the USL’s coefficient of performance and let your modeling software (R, gnuplot, etc) manipulate this as part of the regression to find the best fit. However, there is a subtlety in the USL model that you need to be aware of. Here is a picture of the low-end of the curve:
The graph shows the USL model as the blue curve and linear scalability as the black line. Notice that at concurrencies less than 1, the value of the USL function is actually greater than the linear scalability function. This deserves some thought and explanation, because it can cause problems.
If you think about it, concurrency between one and zero is impossible. In fact, concurrency is not a smooth function, it is a step function. There can be zero requests resident in the system, one request, two requests, and so on — but not 0.7 requests or 3.14159 requests. However, the USL is defined in terms of a continuous function, not a step function.
The trouble with the MySQL systems I usually model is that I generally observe them in the wild, which means that I get a large number of samples of throughput-and-concurrency, and I aggregate them. For example, I’ll usually observe concurrency once per second, and average these samples over a minute or more for each point I want to feed into the USL model. This approach generates concurrency values that are real numbers, not just integers — so it’s entirely possible that during a given minute, the “average concurrency” on the system comes out to 0.7 or 3.14159. What’s to be done with this?
In the perfect world, I’d like to delete “empty space” during which zero queries were executing, and determine the actual throughput at each integral value of concurrency. But it’s a lot less convenient to do this, at best; and it’s usually impractical or impossible. So I work with the data I have. In practice I find it’s good enough.
Back to the funny anomaly where the USL predicts better-than-linear scalability between concurrency=0 and =1. The outcome is that the regression to the USL model can potentially skew the values of the USL coefficients, if you have any samples that lie between 0 and 1. Thus, it may be a good idea to discard these samples. This should not be a significant portion of your sample dataset anyway. If you don’t have a lot of samples at higher concurrencies, you probably don’t have enough data to model the system accurately, and you should act accordingly.
Last time I said that the USL has a forgotten third coefficient, the coefficient of performance. This is the same thing as the system’s throughput at concurrency=1, or C(1). How do you determine this coefficient? There are at least three ways.
Neil Gunther’s writings, or at least those that I’ve read and remember, say that you should set it equal to your measurement of C(1). Most of his writing discusses a handful of measurements of the system: one at concurrency 1, and at least 4 to 6 at higher concurrencies. I can’t remember a time when he’s discussed taking more than one measurement of throughput at each level of concurrency, so I think the assumption is that you’re going to take a single measurement at various concurrencies (or, in the case of hardware scalability, units of hardware), and you’re done.
This tends to work quite well. I’ve blogged before about this: well-designed systems, measured in a carefully controlled test, tend to match the Universal Scalability Law model quite well. Here are two examples.
Most systems I model aren’t like that. I don’t do my modeling in a lab. I get thousands, if not tens or hundreds of thousands, of measurements of throughput and concurrency from a MySQL server’s real production traffic. How do you determine the system’s throughput at concurrency=1 in this kind of situation? You may have hundreds or thousands of samples at or near concurrency=1, and here’s the interesting thing: they aren’t tightly clustered. This leads to the two additional techniques I’ve used.
Method 2 is fairly obvious: you can take an aggregate measure of the throughput at N=1. You can simply average, or you can use the median. In my experience, the latter tends to be a little more accurate, because the median essentially discards outliers. Given enough samples, it is very likely that the median is truly representative of the system’s real behavior.
Finally, method 3 is to treat C(1) as one of the parameters to fit in the regression to the USL model. Instead of holding it as a fixed quantity, go ahead and let the regression find the best fit for it along with the other coefficients.
In practice, I tend to combine methods 2 and 3. I use method 2 to find a starting point for the coefficient, and then I let the regression tweak it as needed. In my experience, this usually produces good results. Sometimes the software doing the regression gets a little confused, or stuck at a local maximum, but otherwise it works well.
What if you don’t have measurements at N=1? The best approach, in my experience, is to take the slope of the line from the first data point you have, and use that. N=1 will almost always be higher than this, because real systems are rarely linearly scalable. That’s okay. If you let the regression adjust the coefficient as needed for the best fit, you’ll end up with a good answer anyway.
If you’re familiar with Neil Gunther’s Universal Scalability Law, you may have heard it said that there are two coefficients, variously called alpha and beta or sigma and kappa. There are actually three coefficients, though. See?
No, you don’t see it — but it’s actually there, as a hidden “1″ multiplied by N in the numerator on the right-hand side. When you’re using the USL to model a system’s scalability, you need to use the C(1), the “capacity at one,” as a multiplier. I call this the coefficient of performance. It’s rarely 1; it’s usually thousands.
To illustrate why this matters, consider two systems’ throughput as load increases:
The green line and the blue line are both linearly scalable systems. Add twice the concurrency, get twice the throughput. But the slope of the lines is different. The green system can do three times as much work as the blue system, even though it’s no more scalable.
To model the USL, you need to determine C(1) by measuring the system under test. In my experience with real systems running in production, mostly MySQL servers, this is not simple. You can’t just say “let’s quiet the web app down, I want to load it with exactly one user for a few minutes and measure how fast it runs.” Instead, you get a bunch of samples from production traffic, and you derive the throughput at concurrency=1 from that.
The result goes into the numerator as a multiplier of N, although it’s usually omitted when the USL formula is shown.