Archive for May, 2010
Reacting to small variations in response time
I wrote recently about early detection for MySQL performance problems. If your server is having micro-fluctuations in performance, it’s important to know, because very soon they will turn much worse. What can you do about this?
The most important thing is not to guess at what’s happening, but to measure instead. I have seen these problems from DNS, the binary log, failing hardware, the query cache, the table cache, the thread cache, and a variety of InnoDB edge cases. Guessing at the problem is very dangerous; you need diagnostic data. But it is often quite hard to catch a problem in action when you can only observe it in hindsight, and it happens only for a few seconds once or twice a week. This blog post is about how to detect small variations in performance, especially when it is most difficult to observe them.
Sometimes it’s actually quite easy, so let’s look at the easy cases first. Over time I have built up a collection of tricks and tools for catching a problem in action. The process of catching and diagnosing a lightning-fast performance problem looks like the following:
- Determine the symptoms of the problem.
- Determine how to observe the symptoms reliably and quickly.
- Determine how to gather diagnostic data for later.
- Set up tools or processes to do the above.
- Sift the collected data and diagnose.
This is 95% about figuring out how to observe the problem and gathering the data, and 5% about actually diagnosing. If you don’t get the 95% right, you’ll gather too little or too much data, or you’ll capture it at the wrong time. Your job is hard enough; you won’t be successful if you simply gather gigs of data for weeks at a time. You need to be as precise as you can. Here are some examples:
- There are normally very few connections to the server, but sometimes I start getting “Error: max_connections exceeded” or similar. Solution: observe Threads_connected from SHOW GLOBAL STATUS and react when it grows too large, or when you cannot log in to query Threads_connected.
- There are normally very few queries running, but we have a connection pool (and thus Threads_connected is constant). During the freezes, hundreds of queries show up in SHOW PROCESSLIST. Solution: observe Threads_running from SHOW GLOBAL STATUS and react when it grows too large.
You get the point. Find a simple metric and figure out how to capture it — usually this is possible with a little bit of bash, awk, and grep. You might need to look for something specific in SHOW INNODB STATUS, for example, such as a large number of transactions in LOCK WAIT status.
But sometimes it’s much harder. What if you simply can’t observe the problem internally to MySQL? This does happen, especially when nothing changes except for response time. This was the case in the customer’s system that I discussed in the “predicting performance problems” blog post linked above. Every single metric provided by MySQL itself stayed constant during the mini-freezes. The problem is that you can’t get information on response time from within MySQL.
I ended up writing tools to help with this, of course. I’ll show the results below.
If fluctuations in response time are the problem, then the way to observe it is to measure response time. This requires some care, because you don’t want false positives, and a lot of my ideas were obviously vulnerable to false positives. I could cross them off right away. I can’t trigger on unusually large or small numbers of queries, for example, because those just happen as the workload fluctuates through the day, and random user behavior naturally introduces variations too.
I ended up writing a tool to tail the slow query log file, which I set to zero so it captured all queries with microsecond precision. Once per second, MySQL writes out the current timestamp to the log file, so when I see that marker, I know that a second’s worth of queries has passed by. I aggregate the last second’s worth of queries (count, total time, average time) and print out a line.
This in itself does not provide a good way to know when something unusual is happening, but it gives the foundation for it. I took it slightly further: I kept a sliding window of the last 60 1-second averages, and took the standard deviation of those. If the current second’s average response time deviates significantly from the average response time over the last 60 seconds, then something is wrong. “Significant” is pretty easy to measure with standard deviations, so that’s where the real magic comes in. Let’s see some samples of this.
First, here’s a bit of the slow query log chopped into 1-second segments and aggregated:
Time Total Count Avg 1-Min Avg 1-Min StDev Sigma
100519 18:05:17 0.477078 2084 0.000229 0.000795 0.063251 0.008954
100519 18:05:18 0.264729 1823 0.000145 0.000756 0.061334 0.009960
100519 18:05:19 0.287641 1936 0.000149 0.000720 0.059481 0.009599
100519 18:05:20 0.213181 1619 0.000132 0.000691 0.058050 0.009641
100519 18:05:21 0.276063 1520 0.000182 0.000669 0.056806 0.008587
100519 18:05:22 0.289921 1963 0.000148 0.000642 0.055310 0.008936
100519 18:05:23 0.277754 1882 0.000148 0.000618 0.053983 0.008717
100519 18:05:24 0.337821 1900 0.000178 0.000598 0.052745 0.007963
100519 18:05:25 0.236592 1727 0.000137 0.000579 0.051682 0.008556
100519 18:05:26 0.257150 1488 0.000173 0.000566 0.050821 0.007727
100519 18:05:27 0.303697 1672 0.000182 0.000552 0.049908 0.007412
100519 18:05:28 0.182106 1416 0.000129 0.000539 0.049163 0.008346
100519 18:05:29 0.211202 1631 0.000129 0.000525 0.048347 0.008186
The columns mean the following:
- Time is the timestamp of this second’s stats.
- Total is the total response time within the sample, in seconds.
- Count is how many queries were in that sample.
- Avg is just the mean response time (Total / Count).
- 1-Min Avg is the one-minute moving average of response time.
- 1-Min StDev is the standard deviation of the average response times for each of the previous 60 seconds.
- Sigma is the difference between this second’s average response time and the 1-Min Avg, in standard deviations.
As you can see, most of the time the deviation between this second’s average and the last minute’s average is quite low. But when there’s a meaningful fluctuation in performance, that changes pretty clearly. Here’s a sample with a blip at 18:09:30:
Time Total Count Avg 1-Min Avg 1-Min StDev Sigma
100519 18:09:26 0.187038 1245 0.000150 0.000159 0.006844 0.001343
100519 18:09:27 0.269272 1697 0.000159 0.000160 0.006862 0.000178
100519 18:09:28 0.329386 1901 0.000173 0.000160 0.006895 0.001865
100519 18:09:29 0.350918 2017 0.000174 0.000161 0.006929 0.001881
100519 18:09:30 0.016610 73 0.000228 0.000161 0.006943 0.009537
100519 18:09:31 0.590175 2905 0.000203 0.000162 0.007074 0.005815
100519 18:09:32 0.384193 1879 0.000204 0.000163 0.007133 0.005783
100519 18:09:33 0.345033 2044 0.000169 0.000163 0.007133 0.000815
100519 18:09:34 0.289663 1793 0.000162 0.000163 0.007148 0.000255
That was a fast one! It flew by too quickly to do much about. But it was also not a very large deviation, and could have been a false positive. In any case, I highly doubt that we would have caught anything meaningful by triggering a stats-collection process just then. Let’s keep looking.
Time Total Count Avg 1-Min Avg 1-Min StDev Sigma
100519 18:10:05 0.209619 1578 0.000133 0.000181 0.009542 0.005044
100519 18:10:06 0.279070 1849 0.000151 0.000181 0.009546 0.003167
100519 18:10:07 1.152811 1624 0.000710 0.000189 0.010152 0.051257
100519 18:10:08 0.342763 1450 0.000236 0.000191 0.010188 0.004478
100519 18:10:09 0.200373 1452 0.000138 0.000190 0.010182 0.005155
100519 18:10:10 0.231888 1577 0.000147 0.000191 0.010190 0.004289
100519 18:10:13 0.000459 10 0.000046 0.000191 0.010241 0.014203
100519 18:10:14 0.001999 4 0.000500 0.000189 0.010139 0.030678
100519 18:10:15 0.165705 582 0.000285 0.000189 0.010151 0.009423
100519 18:10:16 7.129640 5104 0.001397 0.000251 0.024463 0.046854
100519 18:10:17 1.140011 2859 0.000399 0.000256 0.024555 0.005817
100519 18:10:18 0.325617 2240 0.000145 0.000255 0.024491 0.004460
100519 18:10:19 0.243101 1538 0.000158 0.000255 0.024510 0.003966
Another relatively short blip but a bit longer. The mean response time really didn’t deviate as much as my client was complaining about — they were showing me New Relic transaction traces with 50-second waits. Maybe I could have caught something here, but I doubt that it’d be enough to separate the signal from the noise. Still, at this point you can clearly see how sensitive this technique is. The deviation in average response varies from a few thousandths of a sigma to a few hundredths. Let’s keep looking for something more dramatic to use as a trigger:
Time Total Count Avg 1-Min Avg 1-Min StDev Sigma
100519 18:10:57 0.352336 2038 0.000173 0.000282 0.026701 0.004092
100519 18:10:58 0.260373 1692 0.000154 0.000283 0.026725 0.004817
100519 18:10:59 1.453306 1834 0.000792 0.000294 0.027073 0.018400
100519 18:11:00 0.264517 1658 0.000160 0.000295 0.027084 0.004989
100519 18:11:01 0.093991 953 0.000099 0.000294 0.027148 0.007203
100519 18:11:02 1.119373 469 0.002387 0.000306 0.027513 0.075629
100519 18:11:03 8.609779 291 0.029587 0.000395 0.038954 0.749383
100519 18:11:04 3.474422 103 0.033732 0.000435 0.040854 0.815026
100519 18:11:05 4.095386 111 0.036895 0.000483 0.043286 0.841211
100519 18:11:06 14.951602 131 0.114134 0.000647 0.065451 1.733932
100519 18:11:07 5.954177 52 0.114503 0.000720 0.068831 1.653074
100519 18:11:08 19.979373 53 0.376969 0.000952 0.096110 3.912385
100519 18:11:09 8.056343 16 0.503521 0.001047 0.100590 4.995285
100519 18:11:10 28.035963 7 4.005138 0.001380 0.138789 28.847777
100519 18:11:11 0.017634 69 0.000256 0.001400 0.139897 0.008182
100519 18:11:12 10.516826 20 0.525841 0.001548 0.145732 3.597660
100519 18:11:13 11.889159 50 0.237783 0.001687 0.151319 1.560253
100519 18:11:14 0.032239 138 0.000234 0.001685 0.151199 0.009599
100519 18:11:15 39.607576 38 1.042305 0.002164 0.204041 5.097713
100519 18:11:16 14.577523 40 0.364438 0.002397 0.215261 1.681870
100519 18:11:17 47.602524 34 1.400074 0.003094 0.278776 5.011118
100519 18:11:18 0.016022 84 0.000191 0.003180 0.282795 0.010570
We totally hit pay dirt here. This period in the log corresponded exactly to one of the visible spikes in New Relic. There were extremely long queries in the log, and throughput dropped to the floor — for an extended time. In the far right-hand column, Sigma is in the double digits. More experience showed me that on this particular client’s workload, anything above 0.3 Sigma is a reliable indicator of a real performance problem. If that condition becomes true, then it’s time to gather diagnostic data for a while. This is resistant to false positives from things like the occasional one-off long-running query.
After building this tool — maybe 30 minutes of work or so — I can see that I could have used other metrics instead. The number of queries per second (throughput) varies, just as response time does. And I probably could go back to the database and start watching Handler_ counters, or similar things like Innodb_rows_read, with the same technique. I wasn’t able to see those things as possibilities because of the overwhelming amount of information to sift through before (and I still don’t really know that they’re going to show spikes and notches the same way, I’m just speculating; they might be really noisy and unreliable). However, focusing on response time is an accurate metric, because response time is what actually matters. Handler counters and rows-read counters are secondary effects that can lie, and there is never anything wrong with focusing on primary sources. Looking at secondary things is far too likely to present you with unreliable information, and you end up on wild goose chases that consume huge amounts of your client’s time.
The tool I wrote for this task is crude, and not formally tested, but it’s a great proof of concept. I think the next step is probably going to be something like revamping mk-loadavg (and probably renaming it!) to be able to capture load metrics and variations in a more flexible and meaningful way.
The end result on this case is at least two problems, by the way (we’re still working on it). One was DNS flakiness. The server was not configured with skip_name_resolve, and when DNS stopped working for a short period, everything stopped working. After clearing that up, many but not all of the spikes in response time went away, and permitted me to see that InnoDB is also having trouble. It is actually quite common for multiple things to be going badly on a server, which makes a disciplined approach all the more important. Trial and error is a disaster in cases like these. Peter and I wrote a brief whitepaper about our approach, by the way. You might find it helpful if you are also facing complex performance problems.
Dirty pages, fast shutdown, and write combining
One of the things that makes a traditional transactional database hard to make highly available is a relatively slow shutdown and start-up time. Applications typically delegate most or all writes to the database, which tends to run with a lot of “dirty” data in its (often large) memory. At shutdown time, the dirty memory needs to be written to disk, so the recovery routine doesn’t have to run at startup. And even upon a clean startup, the database probably has to warm up, which can also take a very long time.
Some databases let the operating system handle most of their memory management needs. This has its own challenges, especially if the operating system’s design doesn’t align exactly with the database’s goals. Other databases take matters into their own hands. InnoDB (the de facto transactional MySQL storage engine) falls into this category; when properly configured to take advantage of modern hardware, it will use basically all of the server’s memory in a huge buffer pool, with files opened in O_DIRECT mode, bypassing the operating system for I/O operations.
The design choices, and the results, are worth thinking about. Assuming you shut down and restart infrequently, the choice to hold a lot of dirty memory has huge performance benefits, which has to be balanced against the desire for fast shutdown and recovery. In InnoDB, there are a few things you can configure that change the startup and shutdown behaviors, but you should understand the performance effects during normal operation.
First, let’s look at why it’s nice to run with lots of dirty data in memory.
Write combining
Most databases have a concept called a page, buffer, or block. This is a physical unit of data, which can typically store many logical units (rows). InnoDB defaults to 16kb pages of data. Imagine that your typical row is only 80 bytes long. A lot of rows can fit into 16kb in most uses.
Suppose you insert, update, or delete a row. Should InnoDB write the result to disk? If it does, it has to write the entire 16kb page, and any other index pages and so forth, which can add up to a lot of pages. That’s a lot of work for a little bitty 80-byte row! InnoDB leaves the pages dirty in its memory. When you commit the transaction, the write-ahead log ensures that if there’s a crash, the change is still permanent. (The log has very compact entries and is not page-oriented.)
Now suppose you make another little change. In many cases, there’s a decent probability that both of the changes touched the same page(s). In fact, if you had the statistics to prove it, you would probably see that the vast majority of your changes focus on a small fraction of the total pages, or even a small fraction of the rows. Most workloads have a very tall head and a very long tail. Tens, hundreds, even thousands of times more changes go to those same pages and rows, as compared to the less-active ones.
Eventually, our favorite “hot page” gets written so a checkpoint can complete. Tons of changes were written in a single write. This is write combining, and it’s a huge efficiency. Huge! Servers can accept many tens of thousands of writes per second, and guarantee ACID properties, because of write combining. If they didn’t combine writes, they’d be asked to do many more I/O operations per second.
Dirty pages and the long tail
The downside to this is the amount of dirty pages in memory, which have to be written out during shutdown. Shutdown is equivalent to a forced checkpoint. The server has been lazily delaying lots of work, because it knows it’s going to be able to combine writes. Suddenly, all the bills come due at once — time to write tons of data to disk! And the problem here is that the server’s memory can actually be mostly dirty data. By default, InnoDB lets the buffer pool get up to 90% dirty before it starts to get worried and work hard to flush pages.
If most writes go to the hottest pages, why should there be so many dirty pages? The answer is the long tail. The few writes that don’t go to the tall head go to a very scattered long tail. Again this is hard to prove, but many of those one-off writes are dirtying entire pages just for themselves, and those pages will not be dirtied by any other writes. So the long tail is full of 16kb pages that had only 80 bytes written to them. This ends up being a lot of pages of data.
Fast shutdown on demand
If you want your database to be able to shut down quickly if needed, what can you do about this? This is a tough question to answer. There are a few different strategies you might take.
- You can configure InnoDB to keep the dirty pages to a minimum. The problem is, it starts to do a lot less write combining. Take an average web application’s database and lower the dirty page percent, and watch the disk activity. It will go through the roof. It starts furiously flushing pages, only to turn around and flush the same pages again an instant later. InnoDB isn’t particularly designed or optimized for this, by the way. Things will suffer. However, this is actually a useful technique for a planned fast shutdown.
- You can lower the page size. If you make the pages smaller, then in theory you’ll do less work flushing those long-tail pages. Be careful with this! There is research (actual math, mind you) indicating that InnoDB’s default page size is already too small, and there isn’t a lot of real-world experience with non-default page sizes. The Tokutek folks know a lot about the math, by the way.
- You can configure InnoDB not to flush dirty pages before a shutdown. This is essentially the same thing as shutting down without a checkpoint, which is the same as crashing. The recovery routine will have to run at startup before the database becomes available. That is likely to be much slower than a clean shutdown, due to the mechanics of crash recovery.
- You might be able to make InnoDB capable of doing a lot more flushing by upgrading to a version that has separate threads for this purpose, and/or using native asynchronous I/O. This might not really help in shutdown; to tell the truth, I haven’t checked it.
No free lunch
InnoDB is a complex system that is trying to balance a lot of different factors for efficiency, while giving nice ACID properties. And it’s actually doing a pretty decent job of it by default. When you say you’d like more or less of such-and-such performance characteristic, then something else gets traded off. This is a really hard problem, and I’m not aware of anyone who has a brilliant solution to it, although I am far from a database research specialist.
Even the question of how much data to write, and how quickly, is a hard one. It’s hard and expensive to really answer accurately because the real answer requires knowledge of things such the frequency and distribution of page dirtying. Therefore, InnoDB kind of avoids this and lets you configure its “I/O capacity” and “dirty page percent” and maybe a few other things, depending on which version you use. These are just models that approximate the true answers to the real questions. All models are wrong. Some models are useful. InnoDB employs useful models that work a lot of the time.
A better way to build Cacti templates
The traditional way to build Cacti templates is through the Cacti web interface. This is an enormous amount of work, and the result is generally not very consistent or good quality. The process is too error-prone. You can export the templates as XML, but they tend to have problems such as version incompatibilities with other Cacti installations, and it’s hard to adapt them for user preferences such as different graph image sizes and polling intervals.
The way I build Cacti templates is exactly the opposite. I create a data structure in a file, which looks like many configuration file syntaxes you’ve probably worked with. It represents the graphs, templates, scripts, and so on. From this, a tool generates the XML template file, which is a universal template definition, and is a breeze to import into Cacti. It is completely consistent and has zero cruft in it. This process prevents errors, and the results are perfect every time. (There’s a test suite, by the way.)
All the tools, documentation, how-tos, examples, and pre-fabricated scripts and templates you need are at the Better Cacti Templates open-source project. If you want to build your own templates, pay special attention to the documentation on creating graphs.





