What’s a good benchmark?
Vadim has taught me that valid benchmarks are both simple and complex. Simple, because the basic principles are few; complex, because the devil is in the details and it’s a lot of work to satisfy the basic requirements. I’ll give the simple version here.
- Benchmarks must be appropriate. The workload, sample dataset, distribution of work and data, and so on must be relevant and meaningful for the intended purpose. Running the wrong benchmark rarely teaches anything.
- Benchmarks must be fully documented. Another researcher must be able to determine exactly how you ran your benchmark, on what hardware, under what workload, what operating system, kernel version, all MySQL tuning parameters, and so on.
- Benchmarks must be repeatable. Another researcher must be able to reproduce your results. Documentation is part of this, but you need to ensure that you can reproduce your own results. If you can’t, no one else can either.
- Benchmarks must be measured fully. All relevant data, such as I/O and CPU usage, memory usage, and so on must be collected for later analysis and to help respond to questions that other researchers may have.
- Benchmarks must be analyzed. The results must be compared against what is expected, and verified against common sense. I recently ran an iozone benchmark on a 15k RPM disk in O_DIRECT mode and got 14k random reads per second. I was under the gun in an emergency, so I didn’t notice this in the flood of other data at the time, but Peter caught it right away when I called for a second opinion. What this tells me is that iozone isn’t really doing O_DIRECT, because that’s physically impossible performance. Benchmark results must be put under a microscope and examined with a critical eye.
- Benchmarks must be interpreted and explained. Any deviations must be examined and explained. Many benchmarks produce inexplicable results. They must not be published before understanding the results. There are rare exceptions, but that’s the general principle to follow. It is quite likely that some unexpected result means the benchmark was set up wrong and needs to be redone or thrown away.
I suppose it should go without saying, but it’s worth saying anyway: benchmarks must also not be deliberately manipulated, e.g. artificial throttling to make scalability appear to be a perfectly straight line. I’ve seen more than one benchmark that is simply too perfect to be true (each throughput measurement is an exact multiple of ten thousand, for example), and the original data is mysteriously not available anymore.



Hi Baron,
When it comes to benchmarking, do you believe it is important that structure between the two alternatives be exactly the same -or- the setup be optimized for the respective application? I’d like to get your opinion.
I would argue that each system should be calibrated to its optimal performance (assuming you know what optimal for each application is). As long as the raw dataset is the exact same, the organization/structure/layout of the data within the application can be different.
Some tend to argue that benchmarks should consist entirely of controls and exactly 1 test.
Thanks,
Jeff
Jeff Kibler
11 May 11 at 3:56 pm
One of my favourite comments on benchmarking comes from Larry McVoy:
http://lwn.net/Articles/368908/
Morgan TOcker
11 May 11 at 10:48 pm
Jeff,
I think “what is the maximum capacity of two very different systems” is a useful and valid question to ask. Not everything has to be apples-to-apples.
Xaprb
12 May 11 at 5:58 am