Detecting MySQL server problems automatically
I previously blogged about work I was doing on automatically finding problems in a MySQL server, with no hardcoded thresholds or predetermined indicators of what is “bad behavior.” I had to pause my studies on that for a while, due to time constraints. I’ve recently been able to resume and I’m happy to report that I’m making good progress.
One of the things I’ve done is a survey of existing literature on this subject. It turns out that the abnormality-detection techniques I’ve developed over the years are well-known in the operations research field. I reinvented some classic techniques used in Statistical Process Control (SPC). These include Shewhart Control Charts, exponentially weighted moving averages, and Holt-Winters forecasting. However, I was never satisfied with these approaches. They are simultaneously overly simplistic and too sensitive, so they produce false positives and false negatives when applied to MySQL status metrics. I suspect the same thing would happen in most server systems.
Given that the existing techniques I’ve found are inadequate, I’ve developed several more that satisfy me. I am currently testing them on a wide variety of real-world data to fine-tune them.
I can’t claim to understand fully what I’m doing, although I think I grasp it intuitively. When I read what expert researchers have written on some of the simpler methods I decided aren’t good enough, the math quickly overwhelms me, so I imagine that I am working with much more complex math in my new algorithms. (I also assume that my new algorithms are also reinvented wheels, and I will probably find out what they’re really called at some future date.)
I’ll present some of my work at Percona Live in a few weeks.



I look forward to hearing what you’ve come up with.
Maybe put it on github? :-)
Jeremy Zawodny
6 Sep 12 at 7:17 pm
I think a big problem here is that statistical methods are usually about analyzing data to identify outliers (either from forecasts done from the same data, or from comparison against expected values).
For MySQL, in my experience, the most reliable indicator of trouble is query response time, and it’s impossible to get this out of MySQL directly.
So while we can apply the statistical methods to other indicators we can get out of MySQL, I can’t help to feel it would be like looking for your lost keys not where you dropped them, but where the light is better.
Now, if you want to do this without any client-side instrumentation (i.e new relic), I think the only hope right now would be a version of pt-query-digest that perpetually analyzes a slow log (probably a rate limited slow log) and calculating the response time for the 95th percentile of samples.
Finally, I second the request for making this available on github :)
Fernando ipar
7 Sep 12 at 1:25 pm
Little’s Law gives us a means to get average query response times over intervals of time using SHOW STATUS.
Xaprb
7 Sep 12 at 2:13 pm