The Erlang Response Time Stretch Factor For 3 And 4 Servers

In a previous post I explored a few variations of equations that express the M/M/m queueing theory response time “stretch factor,” and tried to indicate some areas where I wanted to dig into the relationships between these formulas a bit more. In this post I discuss the divergence between the official Erlang C formula and Neil Gunther’s heuristic approximation to it. I introduced this before thusly:

At \(m=3\) and above, the heuristic is only approximate. What does the Erlang form reduce to for the first of those cases? Does it result in the missing term that will extend to 4 and beyond too?

» Continue Reading (about 500 words)

The Queueing Knee, Part 2

Last week I wrote about the so-called “knee” in the M/M/m queueing theory response time curve. In that post I examined one definition of the knee; here is my analysis of the others, including the idea that there is no such thing as the knee.

There are potentially several ways to think about the “knee” in the queueing curve. In the previous post I dug into Cary Millsap’s definition: the knee is the point where a line tangent to the queueing curve passes through the origin:


Here are a few others to consider:

» Continue Reading (about 1400 words)

The Queueing Knee, Part 1

The “knee” in the M/M/m queueing theory response time curve is a topic of some debate in the performance community. Some say “the knee is at 75% utilization; everyone knows that.” Others say “it depends.” Others say “there is no knee.”

Depending on the definition, there is a knee, but there are several definitions and you may choose the one you want. In this post I’ll use a definition proposed by Cary Millsap: the knee is where a line from the origin is tangent to the queueing response time curve. The result is a function of the number of service channels, and although we may argue about the topics in the preceding paragraph and whether this is the right definition, it still serves to illustrate important concepts.


» Continue Reading (about 500 words)

Excel Hacks To Ignore Missing Data

I’ve done quite a bit of work with Excel over the last few years, and I’ve found a couple of recurring problems when there’s missing or error data in ranges. I’ve had to work around this enough times that I thought it was worth sharing the solutions I’ve used.

Beautiful green bird eating orange peels

» Continue Reading (about 700 words)

The Response Time Stretch Factor

Computer systems, and for that matter all types of systems that receive requests and process them, have a response time that includes some time waiting in queue if the server is busy when a request arrives. The wait time increases sharply as the server gets busier. For simple M/M/m systems there is a simple equation that describes this exactly, but for more complicated systems this equation is only approximate. This has rattled around in my brain for a long time, and rather than keeping my notes private I’m sharing them here (although since I’m still trying to learn this stuff I may just be putting my ignorance on full display).

Hockey-Stick Curve

» Continue Reading (about 2200 words)

The Square Root Staffing Law

The square root staffing law is a rule of thumb derived from M/M/m queueing theory, useful for getting an estimate of the capacity you might need to serve an increased amount of traffic.

Square Root Staffing Law

The square root staffing law is designed to help with capacity planning in what’s called the QED regime, which tries to balance efficiency with quality of service. Capacity planning is a set of tradeoffs: for best quality of service, you must provision lots of spare capacity (headroom), but that’s wasteful. For best efficiency, you minimize idle capacity, but then quality of service becomes terrible.

» Continue Reading (about 400 words)

How to Extract Data Points From a Chart

I often see benchmark reports that show charts but don’t provide tables of numeric results. Some people will make the actual measurements available if asked, but I’ve been interested in analyzing many systems for which I can’t get numbers. Fortunately, it’s usually possible to get approximate results without too much trouble. In this blog post I’ll show several ways to extract estimates of values from a chart image.


» Continue Reading (about 1000 words)

Setting Thresholds With Quantiles

I was talking with someone the other day about a visualization I remembered seeing some years ago, that could help set a reasonable value for a threshold on a metric. As I’ve written, thresholds are basically a broken way to monitor systems, but if you’re going to use them, I think there are simple things you can do to avoid making threshold values completely arbitrary.

I couldn’t find the place I’d seen the visualization (if you know prior art for the below, please comment!) so I decided to just blog about it. Suppose you start off with a time series:

time series

» Continue Reading (about 800 words)

New O'Reilly Book, Anomaly Detection For Monitoring

UPDATE: the book is now available from

Together with Preetam Jinka, I’m writing a book for O’Reilly called Anomaly Detection for Monitoring (working title).

I’d like your help with this. Would you please comment, tweet, or email me examples of anomaly detection used for monitoring; and monitoring problems that frustrate you, which you think anomaly detection might help solve?

Thanks in advance.


» Continue Reading (about 100 words)

Can Anomaly Detection Solve Alert Spam?

Anomaly detection is all the buzz these days in the “#monitoringlove” community. The conversation usually goes something like the following: Alerts are spammy and often generate false positives. What you really want to know is when something anomalous is happening. Anomaly detection can replace static thresholds and heuristics. The result will be better accuracy and lower noise. I’m going to give a webinar about the science of statistical anomaly detection on June 17th.

» Continue Reading (about 100 words)

Thinking clearly about fitting a model to data

I have often seen people fitting curves to sets of data without first understanding whether that is appropriate. I once even used this blog to criticize someone for doing that. I was trying to explain that it’s wrong to fit a model to a set of measurements, unless the model actually describes the process that produced the measurements. All of my explanations (and rants) have fallen far short of the clarity and simplicity of this curve-fitting guide.

» Continue Reading (about 400 words)

Determining the Universal Scalability Law's coefficient of performance

If you’re familiar with Neil Gunther’s Universal Scalability Law, you may have heard it said that there are two coefficients, variously called alpha and beta or sigma and kappa. There are actually three coefficients, though. See? \[ C(N) = \frac{N}{1 + \sigma(N-1) + \kappa N (N-1)} \] No, you don’t see it – but it’s actually there, as a hidden 1 multiplied by N in the numerator on the right-hand side.

» Continue Reading (about 300 words)

Trending data with a moving average

In my recent talk at Surge and Percona Live about adaptive fault detection (slides), I claimed that hardcoded thresholds for alerting about error conditions are usually best to avoid in favor of dynamic or adaptive thresholds. (I actually went much further than that and said that it’s possible to detect faults with great confidence in many systems like MySQL, without setting any thresholds at all.) In this post I want to explain a little more about the moving averages I used for determining “normal” behavior in the examples I gave.

» Continue Reading (about 600 words)