The Square Root Staffing Law

The square root staffing law is a rule of thumb derived from M/M/m queueing theory, useful for getting an estimate of the capacity you might need to serve an increased amount of traffic.

Square Root Staffing Law

The square root staffing law is designed to help with capacity planning in what’s called the QED regime, which tries to balance efficiency with quality of service. Capacity planning is a set of tradeoffs: for best quality of service, you must provision lots of spare capacity (headroom), but that’s wasteful. For best efficiency, you minimize idle capacity, but then quality of service becomes terrible.

» Continue Reading (about 400 words)

How scalable is your database?

Most of the time, when people say “scalability” they mean any of dozens of things. Most of the time, when I say it I mean exactly one precisely defined thing. However, I don’t claim that’s the only correct use of “scalability.” There is another, in particular, that I think is very important to understand: the inherent limitations of the system. This second one doesn’t have a single mathematical definition, but it’s vital nonetheless.

» Continue Reading (about 500 words)

How scalable is Riak?

I’m reading a little bit about Riak, and was curious about performance and scalability. The only benchmark I found that allowed me to assess scalability was this one from Joyent. Of course, they say scalability is linear (everyone says that without knowing what it means) but the results are clearly not a straight line. So how scalable is it, really? The Universal Scalability Law is such a powerful tool for thinking about scalability.

» Continue Reading (about 200 words)

A close look at New Relic's scalability chart

I’ve written a lot about modeling MySQL with the USL, and I like it best of all the scalability models I’ve seen, but it’s not the only way to think about scalability. I was aware that New Relic supports a scalability chart, so I decided to take a peek at that. Here’s a screenshot of the chart, from their blog: Here’s how it works. It plots response time (or database time, or CPU) as the dependent variable, versus throughput as the independent variable.

» Continue Reading (about 800 words)

Modeling scalability with the USL at concurrencies less than 1

Last time I said that you can set a starting value for the USL’s coefficient of performance and let your modeling software (R, gnuplot, etc) manipulate this as part of the regression to find the best fit. However, there is a subtlety in the USL model that you need to be aware of. Here is a picture of the low-end of the curve: The graph shows the USL model as the blue curve and linear scalability as the black line.

» Continue Reading (about 500 words)

Determining the USL's coefficient of performance, part 2

Last time I said that the USL has a forgotten third coefficient, the coefficient of performance. This is the same thing as the system’s throughput at concurrency=1, or C(1). How do you determine this coefficient? There are at least three ways. Neil Gunther’s writings, or at least those that I’ve read and remember, say that you should set it equal to your measurement of C(1). Most of his writing discusses a handful of measurements of the system: one at concurrency 1, and at least 4 to 6 at higher concurrencies.

» Continue Reading (about 600 words)

Black-Box Performance Analysis with TCP Traffic

This is a cross-post from the MySQL Performance Blog. I thought it would be interesting to users of PostgreSQL, Redis, Memcached, and $system-of-interest as well. VividCortex is the startup I founded in 2012. It’s the easiest way to monitor what your servers are doing in production. It does TCP network traffic analysis. VividCortex offers MySQL performance monitoring and PostgreSQL performance management among many other features. For about the past year I’ve been formulating a series of tools and practices that can provide deep insight into system performance simply by looking at TCP packet headers, and when they arrive and depart from a system.

» Continue Reading (about 300 words)

Fundamental performance and scalability instrumentation

This post is a followup to some promises I made at Postgres Open. Instrumentation can be a lot of work to add to a server, and it can add overhead to the server too. The bits of instrumentation I’ll advocate in this post are few and trivial, but disproportionately powerful. Note: VividCortex is the startup I founded in 2012. It’s the easiest way to monitor what your servers are doing in production.

» Continue Reading (about 500 words)

Surge 2011 slides, recap

This year’s Surge conference was a great sophomore event to follow up last year’s inaugural conference. A lot of very smart people were there, and the hallway track was great. I presented on three things: a lightning talk about causes of MySQL downtime; I chaired a panel on Big Data and the Cloud; and I showed how to derive scalability and performance metrics from TCP traffic. I’ve sent my slides to the Surge organizers, and I understand that they will be posting them as well as integrating them into the video of my session.

» Continue Reading (about 200 words)

When systems scale better than linearly

I’ve been seeing a few occasions where Neil J. Gunther’s Universal Scalability Law doesn’t seem to model all of the important factors in a system as it scales. Models are only models, and they’re not the whole truth, so they never match reality perfectly. But there appear to be a small number of cases where systems can actually scale a bit better than linearly over a portion of the domain, due to what I’ve been calling an “economy of scale.

» Continue Reading (about 300 words)

I'll be presenting at Postgres Open 2011

I’ve been accepted to present at the brand-new and very exciting Postgres Open 2011 about system scaling, TCP traffic, and mathematical modeling. I’m really looking forward to it – it will be my first PostgreSQL conference in a couple of years! See you there.

» Continue Reading (about 100 words)

I'm speaking at Surge 2011

I’ll be speaking at Surge again this year. This time, unlike last year’s talk, I’m tackling a very concrete topic: extracting scalability and performance metrics from TCP network traffic. It turns out that most things that communicate over TCP can be analyzed very elegantly just by capturing arrival and departure timestamps of packets, nothing more. I’ll show examples where different views on the same data pull out completely different insights about the application, even though we have no information about the application itself (okay, I actually know that it’s a MySQL database, and a lot about the actual database and workload, but I don’t need that in order to do what I’ll show you).

» Continue Reading (about 200 words)

When can I have a big server in the cloud?

I was at a conference recently talking with a Major Cloud Hosting Provider and mentioned that for database servers, I really want large instances, quite a bit larger than the largest I can get now. The lack of cloud servers with lots of memory, many fast cores, and fast I/O and network performance leads to premature sharding, which is costly. A large number of applications can currently run on a single real server, but would require sharding to run in any of the popular cloud providers’ environments.

» Continue Reading (about 800 words)

Subtleties in the Universal Scalability Law

Those of you who’ve been following my recent work on modeling system scalability might be interested in this. (It’s not my work, by the way. I’m just trying to ski in the wake of Neil Gunther.) I’ve measured quite a few systems that have some strange bubbles in the scalability curve. As I explained in my talk on Thursday, systems don’t always follow the model precisely, because of their internal architecture.

» Continue Reading (about 300 words)