Archive for the ‘gdb’ tag
I frequently encounter MySQL servers with intermittent problems that don’t happen when I’m watching the server. Gathering good diagnostic data when the problem happens is a must. Aspersa includes two utilities to make this easier.
The first is called ‘stalk’. It would be called ‘watch’ but that’s already a name of a standard Unix utility. It simply watches for a condition to happen and fires off the second utility.
This second utility does most of the work. It is called ‘collect’ and by default, it gathers stats on a number of things for 30 seconds. It names these statistics according to the time it was started, and places them into a directory for analysis.
Here’s a sample of how to use the tools. In summary: get them and make them executable, then configure them; then start a screen session and run the ‘stalk’ utility as root. Go do something else and come back later to check! A code sample follows.
$ wget http://aspersa.googlecode.com/svn/trunk/stalk $ wget http://aspersa.googlecode.com/svn/trunk/collect $ chmod +x stalk collect $ mkdir -p ~/bin $ mv stalk collect ~/bin $ vim ~/bin/stalk # Configure it $ screen -S stalking.the.server $ sudo ~/bin/stalk
Inside the ‘stalk’ tool, you’ll see a few things you can configure. By default, it tries to connect to mysqld via mysqladmin and see how many threads are connected to the server. If this increases over 100 (a sample number you should almost certainly change), or if it can’t connect to mysqld, then it fires off the ‘collect’ tool, or whatever else you configure it to execute.
The ‘collect’ tool, by default, captures a variety of things including disk usage, cpu usage, internal status from mysqld, and even oprofile (which it saves using the standard oprofile save feature; you must use opreport to get your report later). There is also a commented-out section to run GDB if you want stack traces. This is not enabled by default because that’ll freeze mysqld briefly. Usually this is OK if mysqld is already unresponsive during the problem!
I’ve been noticing an undeniable trend in my consulting engagements in the last year or so, and when I vocalized this today, heads nodded all around me. Everyone sees a growth in the number of cases where otherwise well-optimized systems are artificially limited by InnoDB contention problems.
A year ago, I simply wasn’t seeing the need for analysis of GDB backtraces en masse. These days, I’m writing custom tools to gather and analyze backtraces. A year ago, I simply looked at the SEMAPHORE section of SHOW INNODB STATUS. These days I’m writing custom tools to aggregate and reformat that data so I can interpret it more easily. And I’m actually seeing cases of this type of problem multiple times every week. I remember the first time I ran into a server that was literally optimized to the limit, but struggling under the load. It was something new for me, not that long ago. Oh, I’d seen it before, plenty, but was always able to point out where something could be improved without changing InnoDB itself. Now it’s commonplace: schemas are fine — check. Queries are all well-indexed — check. Everything else — check. InnoDB is bottlenecked and absolutely nothing can be improved — check.
Part of the difference is the rapidly improving hardware. It’s getting hard to buy a server with fewer than 8 or even 16 cores, and 16GB of RAM feels like something I’d install in a wristwatch. But I also suspect that if I’d been characterizing the workload of servers over time in a way that was easy to compare, I’d see a clear trend towards bigger data and more queries per second. We’re just pushing MySQL + InnoDB harder today than we ever have before.
What can be done? Well, InnoDB needs to be improved, that’s all. Oracle, Percona, Google, Facebook and others are working on it, and in many cases these efforts have yielded dramatic results. But there is still much room for improvement.
Note: the bt-aggregate tool has been deprecated and replaced by the pmp tool, which can do all that and more.
A short time ago in a galaxy nearby, Domas Mituzas wrote about contention profiling with GDB stack traces. Mark Callaghan found the technique useful, and contributed an awk script (in the comments) to aggregate stack traces and identify which things are blocking most threads. I’ve used it myself a time or five. But I’ve found myself wanting it to be fancier, for various reasons. So I wrote a little utility that can aggregate and pretty-print backtraces. It can handle unresolved symbols, and aggregate by only the first N lines of the stack trace. Here’s an example of a mysqld instance that’s really, really frozen up:
bt-aggregate -4 samples/backtrace.txt | head -n12 2396 threads with the following stack trace: #0 0x00000035e7c0a4b6 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x00000000005f2bd8 in open_table () #2 0x00000000005f3fb4 in open_tables () #3 0x00000000005f4247 in open_and_lock_tables_derived () 4 threads with the following stack trace: #0 0x00000035e7c0a4b6 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x0000000000780099 in os_event_wait_low () #2 0x000000000077de42 in os_aio_simulated_handle () #3 0x000000000074a261 in fil_aio_wait ()