Xaprb

Stay curious!

Speaking at Oracle Open World

with 4 comments

I’ll be presenting at Oracle Open World on the causes of downtime in MySQL, and how to prevent it. This is a research-based session that presents an easy-to-digest post-mortem of hundreds of emergency issues filed by Percona customers. The real causes and types of downtime surprised me quite a bit, and the preventions run counter to a lot of conventional wisdom. I’ll just give a preview by saying that you should consider it a top priority to monitor how full your disks are! On the other hand, despite the fact that every monitoring tool in existence shows the binary log cache hit rate, not a single emergency in Percona history has ever been attributed to that.

The agenda at OOW is mind-bogglingly huge (see Dave Stokes’s blog post, so here are the full official details of my session:

  • Session ID: 09304
  • Session Title: What Causes Downtime in MySQL, and How Can You Prevent It?
  • Venue / Room: Marriott Marquis – Golden Gate C2
  • Date and Time: 10/6/11, 9:00

Written by Xaprb

August 13th, 2011 at 12:10 pm

Posted in Conferences,SQL

Tagged with ,

4 Responses to 'Speaking at Oracle Open World'

Subscribe to comments with RSS

  1. I have no idea how the binlog cache hit rate is measured, but there is a serious performance problem when replication slaves read old binlogs from the master. They do so while holding LOCK_log which blocks anything else from writing to the binlog until the read completes — http://bugs.mysql.com/bug.php?id=61545. So I think there should be a useful alert for this topic, but either monitoring is insufficient or we don’t know yet what to look for. Fortunately, this bug is easy to fix.

    Mark Callaghan

    13 Aug 11 at 4:27 pm

  2. Mark,

    I don’t think there is enough instrumentation to alert on this. The best general-purpose way I can think of is to look at the relay log position on replicas and compare it to the current log and position on the master.

    The binlog cache hit rate I’m referring to is computed from SHOW GLOBAL STATUS as Binlog_cache_disk_use/Binlog_cache_use. I’ve never seen a case where that was a useful diagnostic, much less useful to alert on. Binlog_cache_disk_use per second could be something useful to compute manually if troubleshooting points to that as a source of problems, but it’s not something I’d bake into a monitoring tool a priori. I should note that my opinions on this have changed. I built Binlog_cache_disk_use/Binlog_cache_use into innotop three or four years ago. But that was a younger me, who also thought that the MyISAM key cache hit rate was important. I can only plead that I’ve learned better ways by spending time doing things that didn’t really produce results.

    Xaprb

    14 Aug 11 at 2:10 pm

  3. In the general web space, I agree that monitoring things like the binlog cache usage is generally not very useful – that’s purely down to the average profile of transactions in the web space (lots of short transactions with small statement sizes). The odd case does come up where lots of large statements are grouped together in single transactions, and it can find be useful to tune then (but in those transactions it is such a small percentage of the time, there are often better things to tune for them as well.. But it all adds up if the transaction structure really needs to be that way).

    As for alerting on LOCK_log waits, the instrumentation is available for all of that in performance_schema in 5.5 which a) monitors LOCK_log (mutex) waits down to picosecond resolution and b) monitors all threads, including connected slave IO threads.

    Something like this would give the same kind of info (less the full stack of course) as PMP:

    SELECT IF(pps.name = ‘thread/sql/one_connection’,
    CONCAT(ips.user, ‘@’, ips.host),
    REPLACE(name, ‘thread/’, ”)) user,
    db,
    command,
    state,
    time,
    event_name AS last_wait,
    IF(timer_wait IS NULL,
    ‘Still Waiting’,
    timer_wait/1000000) last_wait_usec,
    source
    FROM performance_schema.events_waits_current
    JOIN performance_schema.threads pps USING (thread_id)
    LEFT JOIN information_schema.processlist ips ON pps.processlist_id = ips.id;

    An example with a normal thread and slave IO thread:

    *************************** 9. row ***************************
    user: msandbox@localhost
    db: NULL
    command: Query
    state: executing
    time: 0
    last_wait: wait/io/file/myisam/dfile
    last_wait_usec: 19.4230
    source: mi_dynrec.c:258
    *************************** 10. row ***************************
    user: msandbox@localhost:54550
    db: NULL
    command: Binlog Dump
    state: Master has sent all binlog to slave; waiting for binlog to be up
    time: 983
    last_wait: wait/synch/cond/sql/MYSQL_BIN_LOG::update_cond
    last_wait_usec: Still Waiting
    source: log.cc:5533

    You would just see a pile up on the “wait/synch/mutex/sql/LOG::LOCK_log” event for threads “Still Waiting”.

    Mark Leith

    15 Aug 11 at 7:09 am

  4. Mark, thanks for pointing this out. I felt very silly when I saw your reply. Obviously this is instrumented in 5.5 and newer. I was stuck in the 5.1 and older mentality.

    Xaprb

    15 Aug 11 at 8:22 am

Leave a Reply