Xaprb

Stay curious!

Response-time optimization in systems that are queued

with 6 comments

The best overall method of performance optimization is optimization for response time. Users care about response time, not load average or cache hit ratios. The job of a system is to accept some request and do the required work, and deliver a result. The time elapsed between the request and the result is the response time.

Methods of Response Time Optimization

Not all optimization methods are created equal. Here are a few I see commonly.

  • No method. Most people simply have no method of performance optimization at all. They just look for things that look “bad” and try to make them look “better.” In the MySQL database world, the classic example is trying to improve a cache hit ratio. This is utter folly, and doesn’t become any less stupid no matter how many times it is taught and repeated.
  • Server Load Reduction. One step up from that is to try to understand what work the database is performing and discover what part of that work consumes the most response time, then improve that to lower the load on the database server. This is better, but still not a logical way to optimize response time for the end user. Imagine that you’ve built your scaling strategy around archiving and purging unnecessary data — a very sensible strategy. Most well-designed archiving and purging jobs are terribly inefficient, for a reason: they are designed to consume resources that are not needed by anything else, so they don’t interfere with anything else. Archiving a billion rows from a table is best done in nibbles, not in billion-row chunks. The nibbles are going to be slow. If you measure the entire system and find out where the response time goes, you’re almost guaranteed to find these jobs are a top offender. And yet they don’t matter at all, because they have no impact on the user’s response time. Server load reduction is a shotgun approach that sometimes yields results, because it’s easy to aim a shotgun.
  • Method R, or Goal-Driven Performance Optimization. Two methods I know of that are based in sound thinking are Cary Millsap’s Method R and Peter Zaitsev’s Goal-Driven Performance Optimization. Cary wrote an excellent book about his method, and I recommend buying that book and reading it at least twice. These methods are guaranteed to truly optimize the system in question: they will produce the best possible performance improvements with the least possible cost, and they have a termination condition that is satisfied when further improvements are either not possible or cost more than they are worth. A system that has been subjected to one of these methods can be confidently called “fully optimized.”

Response time in queued systems

Method R looks at where a system consumes time and sorts the biggest consumers to the top in a profile, then works on those first. Amdahl’s Law guarantees that this is the best way to improve the system’s performance.

Although the approach is correct, it doesn’t mean it’s easy. It might be easy if a system is stable. But unstable systems, those suffering from queueing delay (usually characterized by response time with a high standard deviation, a.k.a. “spikes” or “blips”), are much harder to optimize. The queries that are performing badly can no longer be assumed to be the source of the performance problem. Instead, they might be “good” queries that are the victim of something else happening.

Unstable systems suffer from a) dependencies between requests, and b) statistical variations in request arrival time, which causes queueing. The classic case is lock contention. Suppose someone goes to your OLTP database and runs an ad-hoc query against the table of invoice line items, and locks the table. Normally that table has specific, fast, well-indexed queries against it, but as soon as the ad-hoc query locks it, the queries instantly pile up and “look bad.” The system becomes an unstable train wreck. Alas, database servers such as MySQL typically don’t give the DBA enough information to blame the problem on a source.

Internal contention inside the database software itself is another potential cause of queueing. I can no longer remember the number of times I’ve disabled the query cache in MySQL and solved a system’s random freezes. Typically, the only way to prove that the query cache mutex is the source of the problem is to take a backtrace in GDB. A complex piece of software without good instrumentation is pretty difficult to troubleshoot in conditions like this.

And that brings me back to Method R. I can see that the queries are suffering from unstable performance, but I cannot see directly how to optimize the system because it is un-instrumented. Unfortunately, falling back to system load optimization is often the best that can be done, in terms of maximum optimization with minimal cost. An expert can do this with as much rigor as possible, and hopefully with good knowledge of the system’s internals can find the source of the problem quickly, but it’s still a much harder problem.

And this is why it is a crime to write un-instrumented software: because when an un-instrumented system starts to get overloaded, it is very hard to determine the source of performance problems.

Written by Xaprb

December 9th, 2009 at 11:28 pm

6 Responses to 'Response-time optimization in systems that are queued'

Subscribe to comments with RSS

  1. I’ve recently had a discussion with someone on some technologies comparison, including MySQL. We eventually came to the conclusion that if it runs fast, then who cares about how it does so.

    Having said that, I wish to argue (and i have not yet read Optimizing Oracle Performance – though on my bookshelf). There is clear relationship between some metrics and performance.
    Say I was to review a database which performs well. But I notice full scans are at about 50% of all SELECTs. I argue that this is not so good. Perhaps the DB performs well right now, but will perform poorly when, say, number of users doubles, or traffic increases 10 times more.
    So some issues do not affect performance on the small scale, but can have great impact on large scale; which is where metrics do come handy.

    And, again, perhaps you’re only referring to some smaller part of the whole picture.

    Shlomi Noach

    11 Dec 09 at 8:03 am

  2. Shlomi:

    I think that, besides focusing on response time, another issue raised by Method R is that one has:

    - proper collected diagnostic info
    - scientific projections to support estimations

    So in your example, that would translate into questions like:

    - What projections made you arrive at the conclusion that full scans being present in about 50% of all SELECTS is a bad metric that will cause performance problems in the future?
    - How did you arrive at the number ‘users double’ or ‘traffic increases 10 times more’

    If these are hunches, I think we should be growing beyond those by now.

    For instance, percentages can be deceiving. Perhaps even 80% of SELECTS in a server are full scans, but what if they only account for 10% of all active queries, and more, they are just batch jobs? In that case it may be fine that they aren’t optimized in the traditional sense.

    Does that make more sense to you?

    Fernando Ipar

    3 Jan 10 at 1:07 pm

  3. Hi Baron,
    I’m rather enjoying reading through the first few chapters of the book, though I suspect I won’t be reading it all due to its Oracle nature, which I’m unfamiliar with.
    Yes, your point makes sense, and I agree that user response times are the most important factor, etc.
    But consider: you’re auditing a MySQL server at the request of a customer. He isn’t having trouble with the server, but he hired you to do a quick review.
    You recognize 50% of the queries are full scan. There are no performance problems! Wouldn’t you take a look at the EXPLAIN plan to see *why* so many queries are slow?
    And assuming you did take a look, to find out some JOINs are incorrect (using join buffer, etc.); would you not recommend adding an index?
    Would it be wrong to *assume* that as workload grows, these queries will become a real issue?
    That was the thing I was trying to point out.

    I don’t see the above as ‘hunches’. I see them as proper programming methods. If queries are built correctly in the first place (again, just for the sake of the example), your app+db will scale better than they would if queries were badly planned. I think this is common sense.

    I’d like to hear your thoughts

    Regards

    Shlomi Noach

    3 Jan 10 at 1:57 pm

  4. I’m very tired with a newborn; Fernando, the above is directed at you, of course.

    Shlomi Noach

    3 Jan 10 at 2:25 pm

  5. Shlomi:

    I have a 3 year old and I’m still tired. I’ve been told it improves once they’re married :)

    I understand your point, my point was more in line with what’s on chapter 4 – ‘Targeting the right improvement activity’

    So, back to my original message, 50% of SELECTS being full scans is something bad if you’ve identified this to be bad (now or in the future) for the business.

    If this particular server has a mostly write workload, and/or this 50% of identified selects are batches, or even better, if these 50% of selects are full scans over tables with very little data, which are cached for long periods of time after they’re selected (imagine full table scans over things like SELECT country_id, country_name from countries to populate a country cache).

    So, back to your new example, I’m hired to audit a server with no current performance issues, and identify this full scan selects in 50% of the selects, first, I need to know what % of the total DB activity this represents.

    If the server has 90% read activity and >50%selects are full scans, I still need to check over what data the full scans are done. If it’s like the countries example below, (i.e., not too often, little data, cached, and probably not going to grow too much with time) it’s not that bad.
    Otherwise this would indeed be a good target for a performance improvement activity.

    I think the point is a metric alone, whatever it is, doesn’t give you much info without context.

    I hope I’m making more sense now :)

    Regards

    Fernando Ipar

    3 Jan 10 at 3:29 pm

  6. Fernando,

    Yes, I think we agree here.

    Regards

    Shlomi Noach

    3 Jan 10 at 3:44 pm

Leave a Reply