A close look at New Relic's scalability chart

I’ve written a lot about modeling MySQL with the USL, and I like it best of all the scalability models I’ve seen, but it’s not the only way to think about scalability. I was aware that New Relic supports a scalability chart, so I decided to take a peek at that. Here’s a screenshot of the chart, from their blog:

blog-rpm-response1

Here’s how it works. It plots response time (or database time, or CPU) as the dependent variable, versus throughput as the independent variable. There’s a line through it to indicate the general shape. Samples are charted as points in a scatter plot. The points are color-coded by the time of day. Outliers are automatically removed.

The focus on response time is really good. That’s one of the things I like about New Relic. While most systems show people status counters, and imply that they have some deep insight and meaningfulness (there’s usually no meaning to be found in status counters!), New Relic is educating people about the importance of response time, or latency.

But as I read through the blog posts about this chart, it struck me that there’s something a little odd about it. The problem, I realized, is that it plots throughput as the independent variable on the chart. But throughput isn’t an independent variable. Throughput is the system’s output under load, and depends on a) the load on the system, b) the system’s scalability. It’s a dependent variable.

In a chart like this, it would be even better to show the independent variable as the variable that one can really control: the concurrency or load on the system. By “load” I mean the usual definition: the amount of work waiting to be completed, i.e. the backlog; this is what a Unix load average measures.

To explain a little more what I mean about throughput being dependent, not independent, here are a few ways to think about it:

So although the New Relic scalability chart shows some of the effects of the system’s scalability, and it’s great to visualize the variation in response time as throughput varies, it doesn’t strike me as quite the right angle of approach.

I’m curious to hear from people who may have used this feature. What did you use it for? Were you successful in gaining insight into scalability bottlenecks? How did it help you?


Comments