Xaprb

Stay curious!

Automatically detecting abnormal behavior in MySQL

with 14 comments

Over the course of years, I have observed that the three most sensitive indicators of MySQL having a server lockup are the queries per second, number of connections, and number of queries running. Here is a chart of those three on a production system. Find the bad spot:

I am currently working on developing an automated system that detects abnormal behavior in these three metrics, but doesn’t require any a priori inputs or thresholds, e.g. you don’t have to tell it “more than X is bad.” (It could be that during a low period of the day, X is different than during the peak load.)

It turns out that this is hard to do reliably, without a lot of false positives and without false negatives (not triggering during an incident). If there is existing literature on the mathematical techniques to do this, I’d be interested in not reinventing the wheel. Does anyone have references to share?

Written by Baron Schwartz

December 1st, 2011 at 5:43 pm

Posted in SQL

14 Responses to 'Automatically detecting abnormal behavior in MySQL'

Subscribe to comments with RSS

  1. Maybe you could find something useful in the last rrdtool version with the trend and predict operations.

    na

    1 Dec 11 at 5:54 pm

  2. You might want to take a look at Holt-Winters forecasting. We used this at IMVU to automatically find anomalies. Evan Miller did some work on this while he was with IMVU, and submitted a patch to rrdtool to improve its support. He has a write up online:
    http://www.evanmiller.org/poisson.pdf

    Chris Hondl

    1 Dec 11 at 6:25 pm

  3. Thanks, I will look at that. Two votes for it is compelling.

    Xaprb

    1 Dec 11 at 8:23 pm

  4. “Detection of abrupt changes”, the whole book. http://www.irisa.fr/sisthem/kniga/kniga.pdf

    Wlad

    1 Dec 11 at 8:34 pm

  5. Hey, I’m working on something like this, too. I called it “cepmon” — it sends your graphite monitoring metrics (steals a copy from an amqp topic) through an open-source CEP engine (Esper; via jruby glue). I’m also trying to cobble together a working holt-winters timeseries prediction algorithm integrated with Esper, and example CEP rules to find “unexpected” data. Also useful if you need to do an SLA-type monitor (90% percentile response time, network latency, etc).

    https://github.com/fetep/cepmon

    Feel free to drop me an email if you want to talk more about this.

    Pete Fritchman

    2 Dec 11 at 12:09 am

  6. Baron,

    Perhaps you can get some insight by looking at how Anomaly detection is implemented in CFEngine.

    Fortxun

    2 Dec 11 at 4:46 am

  7. Baron, you are reaching for the holy grail here. But yes, this is what needs to be done.

    This may not be the easiest way to do it, but the methodologies are sound. In my university Neural Networks were the trendy thing that we were taught. An emeritus professor there is the inventor of self organizing maps – again, not the simplest algorithm out there but could be used for this purpose.

    Today’s machine learning, and what have you, work in similar ways, but is mostly easier than the academic literature on neural networks: it turns out you can get pretty far with sums and averages to calculate some kind of rank.

    So for example, to use a self organizing map for this purpose:
    – Take a series of “states”, such as the above three variables. It should include “bad” states.
    – Train the self organizing map with that data.
    – In this case you’d probably use some moving window as input, not just 3 values. Like values over the past minute.
    – The output is a 2 dimensional “map” of cells, where similar states are close to each other. You should find that all of the “bad” states are clustered together, in one corner or a few areas.
    – Your job as a human being is to look at the map and identify what each area means (these are the bad cells, here a backup is running, here the db is idle…)
    – Note that this process has to be done again for each system being monitored. Even for the same system the map could look different on 2 consecutive runs.
    – You now feed live data into trained map. Whenever current state is classified into a “bad” cell, you call Ghostbusters. I mean Nagios.

    Henrik Ingo

    2 Dec 11 at 10:14 am

  8. So as a demo: 80 million usenet articles organized into a SOM: http://websom.hut.fi/websom/milliondemo/html/root.html

    Henrik Ingo

    2 Dec 11 at 2:40 pm

  9. I’m reading through a few of the links given now; very interesting and useful. I have already re-invented some of these wheels. For example, it turns out that I tried a Shewhart control chart about 18 months ago, and found that it is inadequate.

    Xaprb

    4 Dec 11 at 6:06 am

  10. … and my most recent efforts have been in Geometric Moving Average Control Charts :-)

    Xaprb

    4 Dec 11 at 6:10 am

  11. The biggest meta-lesson I’ve learned so far is that the branch of science that deals with all of this is called Statistical Process Control. Knowing that would have enabled me to find a huge variety of fascinating research on this topic.

    Xaprb

    4 Dec 11 at 6:25 am

  12. After reading through a bunch of stuff I believe that the geometric average I’ve recently been using is likely to be the best bet, along with run tests. There is an important twist, though, which I’ll follow up on in another blog post.

    Xaprb

    4 Dec 11 at 6:50 am

  13. I was wondering if there were any updates on this. I was looking at simply setting up your recommended mysql metrics “queries per second, number of connections, and number of queries running” to be collected and stored in Graphite and then (naively??) setting up some comparison function to indicate problems.

    However, I see that in the notes for your upcoming Velocity talk “Quantifying Abnormal Behavior” (http://velocityconf.com/velocity2013/public/schedule/detail/28118) you state “Why control charts, Holt-Winters, trending, etc are NOT the right approach”.

    I’m curious where you are at on this. Any thoughts or do we have to wait for your Velocity talk?

    Great Stuff all around!

    lreed

    11 Mar 13 at 11:09 am

  14. Check out my talk about this on the food fight show: http://foodfightshow.org/2013/03/adaptive-fault-detection.html

    Xaprb

    12 Mar 13 at 9:57 am

Leave a Reply