Archive for October, 2009
A review of The Art of Capacity Planning by John Allspaw
The Art of Capacity Planning. By John Allspaw, O’Reilly 2008. Page count: 130 pages. (Here’s a link to the publisher’s site.)
This is an outstanding book. As far as I know Ewen Fortune was the first Perconian to read it, and it’s been spreading amongst us since then. I got my copy last week, and read it last night when I couldn’t sleep for some reason. It took me about 90 minutes to read.
This book doesn’t teach in generalities — it shows you exactly what to do. Rather than outlining the process of capacity planning (and it is a process!) and then letting you figure out how to apply it, the book shows you the process and then walks you through it several times with real examples.
The book is also intensely practical, focusing on what makes the application and the business successful. It doesn’t get any more straightforward than this: “You don’t want to be caught unprepared when growth takes place… Conversely, the company financial officers will not hold you in high regard either when you’ve purchased a lot of equipment that lay idle, only to see its price drop a few months later.”
There are several discussions of special cases, such as when database and web server reside on the same hardware. These side trips serve two purposes: they help you see how to apply the process of capacity planning in more complex situations, and they cement the importance of the process even in the straightforward cases, so you learn it better.
Let me summarize the process:
- Define your goals, so you can measure whether performance is acceptable.
- Measure and graph everything, especially metrics that show whether you’re meeting the goal.
- Inspect and correlate historical metrics to learn the limiting factor (I/O, CPU, network bandwidth, etc).
- Use real-world load (not lab tests) to discover the ceiling of that factor. Measure this by increasing the load and observing when performance stops meeting the goal.
- Use curve-fitting on historical metrics to derive an equation that describes your growth.
- Project the curve into the future and find out how long it’s going to be until you hit the capacity ceiling, and therefore when performance will become unacceptable.
- Given the knowledge of how long it’ll take you to deploy more capacity, work backwards and see when you need to start the procurement process.
This isn’t the whole story. For example, some things aren’t about performance, they’re about literal capacity, such as available disk storage. But I’m summarizing. The point is to figure out what resource limits you, and predict when you’re going to run out of it. This is so much simpler than I’ve seen this done before. Queuing theory impresses me too, but I think this is much more practical, and likely to be more accurate in my opinion.
The book ends with a chapter on deployment, and a few useful appendixes. I thought the chapter on deployment was a little less useful than the rest of the book, because it’s not specific and actionable enough. However, there’s still a lot to learn from it.
I highly recommend this book. Everyone on the operations team should probably have their own copy.
Version 1.1.3 of improved Cacti graphs for MySQL released
I’ve just released version 1.1.3 of the Cacti templates I wrote for MySQL. This is a bug-fix release only, and affects only ss_get_mysql_stats.php. To upgrade from the previous release, upgrade ss_get_mysql_stats.php. Don’t forget to save and restore your configuration options, if any. (Note that there is a feature to help with this: you can keep configuration options in ss_get_mysql_stats.php.cnf to avoid making them in ss_get_mysql_stats.php.)
Next up: actual template changes! More graphs!
The changelog follows.
2009-10-24: version 1.1.3
* This is a bug-fix release only, and contains no template changes.
* To upgrade from the previous release, upgrade ss_get_mysql_stats.php.
* MySQL 5.1 broke backwards compatibility with table_cache (issue 63).
* Added a version number to the script (partial fix for issue 79).
* Added a test suite (issue 76, issue 59).
* Math operations were done in MySQL instead of PHP (issue 25).
* SHOW STATUS values didn't override SHOW INNODB STATUS parsing (issue 24).
* Long error messages were not appearing in the Cacti log.
* SHOW INNODB STATUS parsing for unpurged_txns was broken.
* SHOW INNODB STATUS parsing for innodb_lock_structs was broken.
* SHOW INNODB STATUS parsing for pending_log_flushes was broken (issue 62).
* SHOW INNODB STATUS parsing for pending_buf_pool_flushes was broken.
* SHOW INNODB STATUS parsing for pending_ibuf_aio_reads was broken.
* SHOW INNODB STATUS parsing for pending_aio_log_ios was broken.
* SHOW INNODB STATUS parsing for pending_aio_sync_ios was broken.
* Made SHOW INNODB STATUS parsing less sensitive to false positive matches.
How to capture debugging information with mk-loadavg
Maatkit’s mk-loadavg tool is a helpful way to gather information about infrequent conditions on your database server (or any other server, really). We wrote it at Percona to help with those repeated cases of things like “every two weeks, my database stops processing queries for 30 seconds, but it’s not locked up and during this time there is nothing happening.” That’s pretty much impossible to catch in action, and these conditions can take months to resolve without the aid of good tools.
In this blog post I’ll illustrate a very simple usage of mk-loadavg to help in solving a much smaller problem: find out what is happening on the database server during periods of CPU spikes that happen every so often.
First, set everything up.
- Start a screen session:
screen -S loadmonitoring. If you don’t have screen, you can run mk-loadavg as a daemon, but it’s much better to use screen in my opinion. - Get
mk-loadavg. For purposes of this blog post, I’m going to get the latest trunk code, because I know a bug or two has been fixed since the last release.wget http://www.maatkit.org/trunk/mk-loadavg - Create a directory to hold the collected information in files.
mkdir collected
Next let’s set up a script that mk-loadavg can use to gather some information when it detects a high CPU condition. Save the contents of this script as “collect-stats.sh”. The script will gather about 30 seconds worth of statistics. It uses a simple sentinel file /tmp/gatherinfo to prevent multiple occurrences from gathering statistics at the same time. (This is intentionally simple for demo purposes.)
#!/bin/bash
if [ -f /tmp/gatherinfo ]; then exit 0; fi
touch /tmp/gatherinfo
d=`date +%F-%T`
echo "gathering info for $d"
ps -eaf >> collected/$d-ps 2>&1 &
top -bn1 > collected/$d-top 2>&1 &
mysql -e 'show innodb status\G show full processlist\G' >> collected/$d-innodbstatus 2>&1 &
vmstat 1 30 >collected/$d-vmstat 2>&1 &
iostat -dx 1 30 >collected/$d-iostat 2>&1 &
mysqladmin ext -i1 -c30 > collected/$d-mysqladmin 2>&1 &
sleep 30
ps -eaf >> collected/$d-ps 2>&1 &
mysql -e 'show innodb status\G show full processlist\G' >> collected/$d-innodbstatus 2>&1 &
rm /tmp/gatherinfo
Now make the script executable: chmod +x collect-stats.sh. At this point we’re ready to start working. Let’s fire the stats-collection script when the system’s user CPU goes above 40%.
perl mk-loadavg --watch "Server:vmstat:us:>:40" --interval 1 --execute collect-stats.sh
If the CPU goes over 40%, you'll get a bunch of files in the collected directory, with helpful information to diagnose the problem. This example usage is pretty similar to a real-life one I set up recently. It enabled me to take a methodical approach to the problem:
- From the
topoutput I was able to identify that MySQL was causing the spike. - I then looked at the
SHOW STATUSoutput to see what the database server was doing, using mext as a helper. - From
Select_full_scanandHandler_read_rnd_nextI isolated table scans as a problem. - From the saved
SHOW PROCESSLISTI found problem queries and optimized them.
You would be right if you said there are much better tools for finding problem queries -- but remember two things: 1) sometimes clients ask for the lightweight, short-term solution that can be set up in about 5 minutes and checked the next day; and 2) when it is unclear that queries are the problem, setting up only a query monitor is stabbing in the dark and will not get results.
In addition to watching vmstat to measure system CPU usage, mk-loadavg can watch many other things, such as the MySQL server's SHOW PROCESSLIST, parsing values from SHOW INNODB STATUS, and so on.






