Xaprb

Stay curious!

Archive for January, 2011

New Ruby Conference: Ruby on Ales 2011

with 3 comments

This looks like fun: Ruby on Ales. It’s March 24-25, 2011 in Bend Oregon (USA), and the tag line is Ruby, snow, and beer. Does it get any better than this?

Written by Xaprb

January 12th, 2011 at 10:19 am

Posted in Conferences,SQL

Tagged with

New in mk-query-digest: variance-to-mean ratio

without comments

This isn’t actually new — it has been out for a few releases. The mk-query-digest tool from Maatkit now outputs information about each class of queries’ variance-to-mean ratio. The new output goes in a couple of places, including perhaps most usefully the “profile” report. Here’s an example from a real MySQL system:

# Profile
# Rank Query ID           Response time    Calls R/Call Apdx V/M   Item
# ==== ================== ================ ===== ====== ==== ===== =======
#    1 0xBFCF8E3F293F6466 11256.3618 68.1% 78069 0.1442 1.00  0.21 SELECT [redacted]
#    2 0x620B8CAB2B1C76EC  2029.4730 12.3% 14415 0.1408 1.00  0.21 SELECT [redacted]
#    3 0xB90978440CC11CC7  1345.3445  8.1%  3520 0.3822 1.00  0.00 SHOW STATUS
#    4 0xCB73D6B5B031B4CF  1341.6432  8.1%  3509 0.3823 1.00  0.00 SHOW STATUS
# MISC 0xMISC               560.7556  3.4% 23930 0.0234   NS   0.0 <17 ITEMS>

The variance-to-mean ratio is placed in the V/M column. It is the ratio of the query response time’s variance to the mean, for that class of queries. It also appears in the detailed output for the queries in the rest of the report.

What is this useful for? It is a dimensionless number that shows how variable a query’s response time is. The dimensionless number is better than a number such as the standard deviation of response time, because it places fast and slow queries on equal footing; when looking at standard deviation, you really need to compare it to typical execution time to see if there’s a problem. (A fast query that varies by a tenth of a second is highly variable. A query that usually runs hours and varies only by a tenth of a second is unbelievably consistent.)

A query with a highly variable response time is interesting not only because it is providing unpredictable performance, but because it often means that the query is either a perpetrator or victim of bad interactions with other queries, and possibly that it accesses a larger working set of data than fits in the server’s caches, so it makes unpredictable random disk accesses. That’s a fancy way of saying that this query might have a high potential for improvement.

To see what I mean, let’s look at the detailed report for one of the queries whose V/M ratio was 0.21:

# Query 1: 24.28 QPS, 3.50x concurrency, ID 0xBFCF8E3F293F6466 at byte 5590079
# This item is included in the report because it matches --limit.
# Scores: Apdex = 1.00 [1.0], V/M = 0.21
# Query_time sparkline: | _^_.^_ |
# Time range: 2008-09-13 21:51:55 to 22:45:30
# Attribute    pct   total     min     max     avg     95%  stddev  median
# ============ === ======= ======= ======= ======= ======= ======= =======
# Count         63   78069
# Exec time     68  11256s    37us      1s   144ms   501ms   175ms    68ms
# Lock time     85    134s       0   650ms     2ms   176us    20ms    57us
# Rows sent      8  70.18k       0       1    0.92    0.99    0.27    0.99
# Rows examine   8  70.84k       0       3    0.93    0.99    0.28    0.99
# Query size    84  10.43M     135     141  140.13  136.99    0.10  136.99
# Query_time distribution
#   1us
#  10us  #
# 100us  ####################################################
#   1ms  ###
#  10ms  ################
# 100ms  ################################################################
#    1s  #
#  10s+
SELECT ... FROM ... WHERE (col1 = 87041469) AND (col2 = 1138714082) LIMIT 1\G

You can see from the Query_time distribution that this query often executes in the hundreds of microseconds, but also frequently in the hundreds of milliseconds. I redacted some details to protect client data, but this query is a primary-key lookup on an extremely large table. I’ll hazard a guess here: when the data is in memory, it runs in hundreds of microseconds; and when it has to hid the disk, it takes tens to hundreds of milliseconds.

More of the math and theory behind this useful metric of query response time variability is available from Robyn Sands’ article via Method R corporation. Thanks to Cary Millsap, who directed my attention to the V/M ratio in the first place.

Written by Xaprb

January 11th, 2011 at 12:05 pm

Posted in Maatkit,SQL

Tagged with , ,

Finding things within some distance in SQL

with 14 comments

One of the query optimization scenarios I’ve seen a lot over the years is finding something within some distance from a point. For example, finding people within some distance of yourself, apartments in a radius from a postal code, and so on.

These queries usually use the great-circle formula. That might be because Google finds lots of pages claiming that this is the right way to do a radius search. “The earth is not flat!”, they all say. That’s true, but it doesn’t mean that the great-circle formula is a good approach. It’s usually a really bad approach, in fact. It’s needlessly precise for most things, not precise enough for others, and it’s an expensive query to execute; all the trig functions tend to eat a bunch of CPU, and make it impossible to use ordinary indexes. This is true for all of the databases I’ve used — MySQL, Postgres, and SQL Server.

The great-circle formula is needlessly precise for a few reasons:

  1. Within the radiuses I’ve usually seen, the earth is flat, or close enough that it doesn’t make a difference. Looking for an apartment within 25 miles of downtown? The error introduced by pretending that the earth is flat on such a small scale doesn’t matter. The Pythagorean theorem would work just as well.
  2. “Downtown” is not a point, it’s an area. Nobody is going to argue if you return search results that vary by a few miles, or even more.
  3. Nobody drives in a straight line from downtown to their apartment. People usually search within a physical radius as a proxy for “find me something conveniently close.” They don’t really expect the miles as-the-crow-flies to be a good proxy — it’s just one they’re used to. In reality, that apartment just across the river might be too far away from work, because you’d have to drive a long way to get to a bridge. (Unless you want to swim to the office every day, that is.)

In cases where you really do need precision, there’s a reasonable chance that the great-circle formula still isn’t right for you, because not only is the earth not flat, the earth isn’t a sphere either.

What’s the optimization I usually suggest? It’s usually perfectly acceptable to just return results within a square centered on the point of interest. In most cases, the results will be just as satisfactory to the users. The remainder are usually very special cases.

Written by Xaprb

January 2nd, 2011 at 4:59 pm