Archive for the ‘Method R’ tag
New in mk-query-digest: variance-to-mean ratio
This isn’t actually new — it has been out for a few releases. The mk-query-digest tool from Maatkit now outputs information about each class of queries’ variance-to-mean ratio. The new output goes in a couple of places, including perhaps most usefully the “profile” report. Here’s an example from a real MySQL system:
# Profile # Rank Query ID Response time Calls R/Call Apdx V/M Item # ==== ================== ================ ===== ====== ==== ===== ======= # 1 0xBFCF8E3F293F6466 11256.3618 68.1% 78069 0.1442 1.00 0.21 SELECT [redacted] # 2 0x620B8CAB2B1C76EC 2029.4730 12.3% 14415 0.1408 1.00 0.21 SELECT [redacted] # 3 0xB90978440CC11CC7 1345.3445 8.1% 3520 0.3822 1.00 0.00 SHOW STATUS # 4 0xCB73D6B5B031B4CF 1341.6432 8.1% 3509 0.3823 1.00 0.00 SHOW STATUS # MISC 0xMISC 560.7556 3.4% 23930 0.0234 NS 0.0 <17 ITEMS>
The variance-to-mean ratio is placed in the V/M column. It is the ratio of the query response time’s variance to the mean, for that class of queries. It also appears in the detailed output for the queries in the rest of the report.
What is this useful for? It is a dimensionless number that shows how variable a query’s response time is. The dimensionless number is better than a number such as the standard deviation of response time, because it places fast and slow queries on equal footing; when looking at standard deviation, you really need to compare it to typical execution time to see if there’s a problem. (A fast query that varies by a tenth of a second is highly variable. A query that usually runs hours and varies only by a tenth of a second is unbelievably consistent.)
A query with a highly variable response time is interesting not only because it is providing unpredictable performance, but because it often means that the query is either a perpetrator or victim of bad interactions with other queries, and possibly that it accesses a larger working set of data than fits in the server’s caches, so it makes unpredictable random disk accesses. That’s a fancy way of saying that this query might have a high potential for improvement.
To see what I mean, let’s look at the detailed report for one of the queries whose V/M ratio was 0.21:
# Query 1: 24.28 QPS, 3.50x concurrency, ID 0xBFCF8E3F293F6466 at byte 5590079 # This item is included in the report because it matches --limit. # Scores: Apdex = 1.00 [1.0], V/M = 0.21 # Query_time sparkline: | _^_.^_ | # Time range: 2008-09-13 21:51:55 to 22:45:30 # Attribute pct total min max avg 95% stddev median # ============ === ======= ======= ======= ======= ======= ======= ======= # Count 63 78069 # Exec time 68 11256s 37us 1s 144ms 501ms 175ms 68ms # Lock time 85 134s 0 650ms 2ms 176us 20ms 57us # Rows sent 8 70.18k 0 1 0.92 0.99 0.27 0.99 # Rows examine 8 70.84k 0 3 0.93 0.99 0.28 0.99 # Query size 84 10.43M 135 141 140.13 136.99 0.10 136.99 # Query_time distribution # 1us # 10us # # 100us #################################################### # 1ms ### # 10ms ################ # 100ms ################################################################ # 1s # # 10s+ SELECT ... FROM ... WHERE (col1 = 87041469) AND (col2 = 1138714082) LIMIT 1\G
You can see from the Query_time distribution that this query often executes in the hundreds of microseconds, but also frequently in the hundreds of milliseconds. I redacted some details to protect client data, but this query is a primary-key lookup on an extremely large table. I’ll hazard a guess here: when the data is in memory, it runs in hundreds of microseconds; and when it has to hid the disk, it takes tens to hundreds of milliseconds.
More of the math and theory behind this useful metric of query response time variability is available from Robyn Sands’ article via Method R corporation. Thanks to Cary Millsap, who directed my attention to the V/M ratio in the first place.
Response-time optimization in systems that are queued
The best overall method of performance optimization is optimization for response time. Users care about response time, not load average or cache hit ratios. The job of a system is to accept some request and do the required work, and deliver a result. The time elapsed between the request and the result is the response time.
Methods of Response Time Optimization
Not all optimization methods are created equal. Here are a few I see commonly.
- No method. Most people simply have no method of performance optimization at all. They just look for things that look “bad” and try to make them look “better.” In the MySQL database world, the classic example is trying to improve a cache hit ratio. This is utter folly, and doesn’t become any less stupid no matter how many times it is taught and repeated.
- Server Load Reduction. One step up from that is to try to understand what work the database is performing and discover what part of that work consumes the most response time, then improve that to lower the load on the database server. This is better, but still not a logical way to optimize response time for the end user. Imagine that you’ve built your scaling strategy around archiving and purging unnecessary data — a very sensible strategy. Most well-designed archiving and purging jobs are terribly inefficient, for a reason: they are designed to consume resources that are not needed by anything else, so they don’t interfere with anything else. Archiving a billion rows from a table is best done in nibbles, not in billion-row chunks. The nibbles are going to be slow. If you measure the entire system and find out where the response time goes, you’re almost guaranteed to find these jobs are a top offender. And yet they don’t matter at all, because they have no impact on the user’s response time. Server load reduction is a shotgun approach that sometimes yields results, because it’s easy to aim a shotgun.
- Method R, or Goal-Driven Performance Optimization. Two methods I know of that are based in sound thinking are Cary Millsap’s Method R and Peter Zaitsev’s Goal-Driven Performance Optimization. Cary wrote an excellent book about his method, and I recommend buying that book and reading it at least twice. These methods are guaranteed to truly optimize the system in question: they will produce the best possible performance improvements with the least possible cost, and they have a termination condition that is satisfied when further improvements are either not possible or cost more than they are worth. A system that has been subjected to one of these methods can be confidently called “fully optimized.”
Response time in queued systems
Method R looks at where a system consumes time and sorts the biggest consumers to the top in a profile, then works on those first. Amdahl’s Law guarantees that this is the best way to improve the system’s performance.
Although the approach is correct, it doesn’t mean it’s easy. It might be easy if a system is stable. But unstable systems, those suffering from queueing delay (usually characterized by response time with a high standard deviation, a.k.a. “spikes” or “blips”), are much harder to optimize. The queries that are performing badly can no longer be assumed to be the source of the performance problem. Instead, they might be “good” queries that are the victim of something else happening.
Unstable systems suffer from a) dependencies between requests, and b) statistical variations in request arrival time, which causes queueing. The classic case is lock contention. Suppose someone goes to your OLTP database and runs an ad-hoc query against the table of invoice line items, and locks the table. Normally that table has specific, fast, well-indexed queries against it, but as soon as the ad-hoc query locks it, the queries instantly pile up and “look bad.” The system becomes an unstable train wreck. Alas, database servers such as MySQL typically don’t give the DBA enough information to blame the problem on a source.
Internal contention inside the database software itself is another potential cause of queueing. I can no longer remember the number of times I’ve disabled the query cache in MySQL and solved a system’s random freezes. Typically, the only way to prove that the query cache mutex is the source of the problem is to take a backtrace in GDB. A complex piece of software without good instrumentation is pretty difficult to troubleshoot in conditions like this.
And that brings me back to Method R. I can see that the queries are suffering from unstable performance, but I cannot see directly how to optimize the system because it is un-instrumented. Unfortunately, falling back to system load optimization is often the best that can be done, in terms of maximum optimization with minimal cost. An expert can do this with as much rigor as possible, and hopefully with good knowledge of the system’s internals can find the source of the problem quickly, but it’s still a much harder problem.
And this is why it is a crime to write un-instrumented software: because when an un-instrumented system starts to get overloaded, it is very hard to determine the source of performance problems.
Recap of Enterprise LAMP Summit and Camp
Last week I attended the Enterprise LAMP Summit and Camp in Nashville, Tennessee. I enjoyed the event and met or reconnected with a lot of great people. I was glad to be able to spend time with some folks from the Postgres community. My own sessions focused on MySQL.
During the Summit I tried to help people understand how to think about performance, and made the case that the Percona versions of the MySQL server are not only the highest-performance available, but uniquely provide the instrumentation necessary to follow a disciplined performance optimization process such as Method R or Goal-Driven Performance Optimization.
At the Camp the next day, there were several sessions on MySQL. My talk was later in the day, so I elected to skip slides and design a talk by taking questions from the audience, then answering them. I thought the attendees had heard enough generic “advice in a vacuum” kind of content by that point in the day. Again I tried to focus on understanding performance and taking a methodical approach.
During the Summit I counted about 65 people at one point, so I suspect there were really about 100 people really in attendance during the day. I think there were slightly more at the Camp. It was a good networking event, and I not only made some good connections, I found opportunities to connect some mutual friends too. The speakers were great quality by and large. There was little to no marketing or sales content, which was welcome. Overall I thought the event was very well done, with only slight glitches that you’d expect at a first-time event. I hope there is a repeat next year and that I am invited to speak again!


