Archive for the ‘Tools’ Category
How Linux iostat computes its results
iostat is one of the most important tools for measuring disk performance, which of course is very relevant for database administrators, whether your chosen database is Postgres, MySQL, Oracle, or anything else that runs on GNU/Linux. Have you ever wondered where statistics like await (average wait for the request to complete) come from? If you look at the disk statistics the Linux kernel makes available through files such as /proc/diskstats, you won’t see await there. How does iostat compute await? For that matter, how does it compute the average queue size, service time, and utilization? This blog post will show you how that’s computed.
First, let’s look at the fields in /proc/diskstats. The order and location varies between kernels, but the following applies to 2.6 kernels. For reads and writes, the file contains the number of operations, number of operations merged because they were adjacent, number of sectors, and number of milliseconds spent. Those are available separately for reads and writes, although iostat groups them together in some cases. Additionally, you can find the number of operations in progress, total number of milliseconds during which I/Os were in progress, and the weighted number of milliseconds spent doing I/Os. Those are not available separately for reads and writes.
The last one is very important. The field showing the number of operations in progress is transient — it shows you the instantaneous value, and this “memoryless” property means you can’t use it to infer the number of I/O operations that are in progress on average. But the last field has memory, because it is defined as follows:
Field 11 — weighted # of milliseconds spent doing I/Os This field is incremented at each I/O start, I/O completion, I/O merge, or read of these stats by the number of I/Os in progress (field 9) times the number of milliseconds spent doing I/O since the last update of this field. This can provide an easy measure of both I/O completion time and the backlog that may be accumulating.
So the field indicates the total number of milliseconds that all requests have been in progress. If two requests have been waiting 100ms, then 200ms is added to the field. And thus it records what happened over the duration of the sampling interval, not just what’s happening at the instant you look at the file. We’ll come back to that later.
Now, given two samples of I/O statistics and the time elapsed between them, we can easily compute everything iostat outputs in -dx mode. I’ll take them slightly out of order to reflect how the computations are done internally.
- rrqm/s is merely the incremental merges divided by the number of seconds elapsed.
- wrqm/s is similarly simple, and r/s, w/s, rsec/s, and wsec/s are too.
- avgrq-sz is the number of sectors divided by the number of I/O operations.
- avgqu-sz is computed from the last field in the file — the one that has “memory” — divided by the milliseconds elapsed. Hence the units cancel out and you just get the average number of operations in progress during the time period. The name (short for “average queue size”) is a little bit ambiguous. This value doesn’t show how many operations were queued but not yet being serviced — it shows how many were either in the queue waiting, or being serviced. The exact wording of the kernel documentation is “…as requests are given to appropriate struct request_queue and decremented as they finish.”
- %util is the total time spent doing I/Os, divided by the sampling interval. This tells you how much of the time the device was busy, but it doesn’t really tell you whether it’s reaching its limit of throughput, because the device could be backed by many disks and hence capable of multiple I/O operations simultaneously.
- await is the total time for all I/O operations summed, divided by the number of I/O operations completed.
- svctm is the most complex to derive. It is the utilization divided by the throughput. You saw utilization above; the throughput is the number of I/O operations in the time interval.
Although the computations and their results seem both simple and cryptic, it turns out that you can derive a lot of information from the relationship between these various numbers. This is one of those tools where a few lines of code have a surprising amount of meaning, which is left for the reader to understand. I’ll get more into that in the future.
How to capture debugging information with mk-loadavg
Maatkit’s mk-loadavg tool is a helpful way to gather information about infrequent conditions on your database server (or any other server, really). We wrote it at Percona to help with those repeated cases of things like “every two weeks, my database stops processing queries for 30 seconds, but it’s not locked up and during this time there is nothing happening.” That’s pretty much impossible to catch in action, and these conditions can take months to resolve without the aid of good tools.
In this blog post I’ll illustrate a very simple usage of mk-loadavg to help in solving a much smaller problem: find out what is happening on the database server during periods of CPU spikes that happen every so often.
First, set everything up.
- Start a screen session:
screen -S loadmonitoring. If you don’t have screen, you can run mk-loadavg as a daemon, but it’s much better to use screen in my opinion. - Get
mk-loadavg. For purposes of this blog post, I’m going to get the latest trunk code, because I know a bug or two has been fixed since the last release.wget http://www.maatkit.org/trunk/mk-loadavg - Create a directory to hold the collected information in files.
mkdir collected
Next let’s set up a script that mk-loadavg can use to gather some information when it detects a high CPU condition. Save the contents of this script as “collect-stats.sh”. The script will gather about 30 seconds worth of statistics. It uses a simple sentinel file /tmp/gatherinfo to prevent multiple occurrences from gathering statistics at the same time. (This is intentionally simple for demo purposes.)
#!/bin/bash
if [ -f /tmp/gatherinfo ]; then exit 0; fi
touch /tmp/gatherinfo
d=`date +%F-%T`
echo "gathering info for $d"
ps -eaf >> collected/$d-ps 2>&1 &
top -bn1 > collected/$d-top 2>&1 &
mysql -e 'show innodb status\G show full processlist\G' >> collected/$d-innodbstatus 2>&1 &
vmstat 1 30 >collected/$d-vmstat 2>&1 &
iostat -dx 1 30 >collected/$d-iostat 2>&1 &
mysqladmin ext -i1 -c30 > collected/$d-mysqladmin 2>&1 &
sleep 30
ps -eaf >> collected/$d-ps 2>&1 &
mysql -e 'show innodb status\G show full processlist\G' >> collected/$d-innodbstatus 2>&1 &
rm /tmp/gatherinfo
Now make the script executable: chmod +x collect-stats.sh. At this point we’re ready to start working. Let’s fire the stats-collection script when the system’s user CPU goes above 40%.
perl mk-loadavg --watch "Server:vmstat:us:>:40" --interval 1 --execute collect-stats.sh
If the CPU goes over 40%, you'll get a bunch of files in the collected directory, with helpful information to diagnose the problem. This example usage is pretty similar to a real-life one I set up recently. It enabled me to take a methodical approach to the problem:
- From the
topoutput I was able to identify that MySQL was causing the spike. - I then looked at the
SHOW STATUSoutput to see what the database server was doing, using mext as a helper. - From
Select_full_scanandHandler_read_rnd_nextI isolated table scans as a problem. - From the saved
SHOW PROCESSLISTI found problem queries and optimized them.
You would be right if you said there are much better tools for finding problem queries -- but remember two things: 1) sometimes clients ask for the lightweight, short-term solution that can be set up in about 5 minutes and checked the next day; and 2) when it is unclear that queries are the problem, setting up only a query monitor is stabbing in the dark and will not get results.
In addition to watching vmstat to measure system CPU usage, mk-loadavg can watch many other things, such as the MySQL server's SHOW PROCESSLIST, parsing values from SHOW INNODB STATUS, and so on.
A script snippet to relative-ize numbers embedded in text
A lot of times I’m looking at several time-series samples of numbers embedded in free-form text, and I want to know how the numbers change over time. For example, two samples of SHOW INNODB STATUS piped through grep wait might contain the following:
Mutex spin waits 0, rounds 143359179688, OS waits 634106844 RW-shared spins 1224152309, OS waits 38278807; RW-excl spins 2432166425, OS waits 35264871 Mutex spin waits 0, rounds 143386303439, OS waits 634292093 RW-shared spins 1224197048, OS waits 38281423; RW-excl spins 2432347936, OS waits 35271423
How much have the numbers changed in the second sample? My head is too lazy to do that math. So Daniel Nichter and I whipped up Yet Another Snippet to self-discover patterns of text and numbers, and compare each line against the previous line that matches the same pattern. Let’s fetch it:
wget http://maatkit.googlecode.com/svn/trunk/util/rel
Now give it the above input, and it’ll print out something useful (emphasis mine):
Mutex spin waits 0, rounds 143359179688, OS waits 634106844 RW-shared spins 1224152309, OS waits 38278807; RW-excl spins 2432166425, OS waits 35264871 Mutex spin waits 0, rounds 27123751, OS waits 185249 RW-shared spins 44739, OS waits 2616; RW-excl spins 181511, OS waits 6552
My lazy brain likes that much better.
