Xaprb

Stay curious!

How often should you use OPTIMIZE TABLE?

with 12 comments

Many times I’ve heard people advise on “best practices” for a MySQL database. This often includes routine maintenance, such as “you should run OPTIMIZE TABLE on all of your InnoDB tables once a week to defragment them for better performance.”

But this advice is unsubstantiated and could even be detrimental. Here are some of the obvious problems that I can think of:

  • The optimized table compacts the primary key (clustered index) to its default 15/16ths fill factor per page. But other indexes will be built in pseudo-random order and are likely to end up just as fragmented afterwards as before. Which indexes are more important for performance? Maybe the primary key is just a dummy value that’s not even used, and the secondary indexes are the ones that would benefit from compacting.
  • Suppose the primary key is the important one, and SELECT queries will perform more quickly if it’s defragmented. Why does it get fragmented? Because of changes to the table. Now these changes could suddenly slow down dramatically as they are forced to split pages at a much higher rate due to the more compact data layout.

Why do people make a blanket “you should defragment” statement without supporting it with hard facts? It sounds like something you’d hear from a naive Windows user who buys a $99 piece of software to make his PC “boot faster” or “fix his registry” or something. Maybe it ain’t broke and should not be fixed.

I believe we hear advice like this because there isn’t easy-to-get data that can tell us the truth. To make decisions about defragmenting tables responsibly, we need either performance data on that table (hard to get in most cases), or failing that, information about cost and frequency of page splits in general (not available from InnoDB at present). It would help to have these metrics, and I think it might not be very hard to add page-split instrumentation to InnoDB.

Written by Xaprb

February 7th, 2010 at 6:39 pm

12 Responses to 'How often should you use OPTIMIZE TABLE?'

Subscribe to comments with RSS

  1. Baron, I agree with you: The main “promise” of this technique is not always delivered since secondary indexes start fragmented. It’s worth pointing out that this is better in the Plugin version (and hence XtraDB):

    http://www.innodb.com/doc/innodb_plugin-1.0/innodb-create-index.html#innodb-create-index-implementation

    “To add a secondary index to an existing table, InnoDB scans the table, and sorts the rows using memory buffers and temporary files in order by the value(s) of the secondary index key column(s). The B-tree is then built in key-value order, which is more efficient than inserting rows into an index in random order with respect to the key values.”

    Morgan Tocker

    7 Feb 10 at 7:24 pm

  2. It’s like having a rule that you take your car in for maintenance every six months. What if you work from a home office, and you put relatively few miles on your car? It’d be better to take your car in every X thousand miles. But on a car, it’s easy to read the odometer. In a database, it’s harder to check how many changes have been made to indexed columns in a given table.

    Bill Karwin

    7 Feb 10 at 8:37 pm

  3. Any DBA worth his or her salt should shy away from issuing such blanket statements.

    Another thing to consider is that the reorg uses system resources that can better be used elsewhere. Unless you have a very generous maintenance window you’ll likely be performing these reorgs online and usually that means taking a performance hit whilst the DBMS does its cleanup (obviously a massively over-provisioned server or very light usage etc. notwithstanding).

    Other DBMS products *do* provide statistics on index fragmentation, page splits, etc. and thus can give you some very meaningful feedback on whether or not you should do a table defragment (using a generic term like “defragment” since each DBMS calls and accomplishes it differently).

    Further, most DBMS products allow you to specify fill-factors and such for index creation along with logical/physical table partitioning & materialized views giving the DBA more fine-grained control over physical performance.

    MySQL lacks most/all of these things and really needs to step up to give DBAs the info they need to appropriately physically place data on disk.

    I wonder how many people waste time chasing down performance problems (or, say, switching to NoSQL products) that would be fixed if they had better tools and/or knew how to use em?

    Matt Rogish

    7 Feb 10 at 9:42 pm

  4. Morgan, right — in the Plugin, you could defragment a secondary index by dropping and re-adding it, which can be done online. But whether that’s actually helping or hurting performance can only be shown by measuring it.

    Matt, for a long time now I’ve had a long-term TODO to make InnoDB accept a fill-factor like other databases can. 15/16ths is an arbitrary constant, and I’ve never thought there was empirical evidence it’s the best one-size-fits-all. For read-only or append-only tables, of course 100% full is the best. For tables that have active data modifications, who knows what’s right? It would be nice to be able to find out.

    Xaprb

    7 Feb 10 at 10:05 pm

  5. With regard to point #2: at least in one (very common) case, it is easy to explain: when the PRIMARY KEY is an AUTO_INCREMENT.

    Defragmentation, then, only occurs on DELETE, and since values are automatically generated, no one expects these values to re-appear.

    While even more values may be expected to be DELETEd in the future, compacting the index delays the point in time where page merges occur, and less of these (statistically speaking, really depends on the exact numbers) are expected to occur within a given timeframe in comparison to the number of merges when no OPTIMIZE occurs.

    Furthermore, those never-to-be-reused values will now capture less disk space.

    With regard to second indexes: they me just as fragmented with relation to the primary key; but within themselves they are very organized. So queries which utilize “Using index” are likely to benefit from the OPTIMIZE.

    Shlomi Noach

    8 Feb 10 at 1:56 am

  6. Shlomi, you should try out xtrabackup’s index statistics feature and see if you change your mind about secondary indexes :-)

    Xaprb

    8 Feb 10 at 10:21 am

  7. @Baron
    I will

    Shlomi Noach

    8 Feb 10 at 12:23 pm

  8. Baron,

    This is all true. The best advice is “optimize when you will benefit”. The problem is that it involves a lot of testing — if there are frequent deletes or fragmentation-causing updates (not all updates cause fragmentation), you want to test to see what the “sweet spot” is.

    Defragmenting every week is definitely excessive for 99.95% of companies out there, even for our clients at Pythian (and your clients at Percona, I’d bet).

    As with everything, “how often should I do foo?” should first be changed to “Would I benefit from doing foo?” and then be followed up with testing to see how often is “good”.

  9. Sheeri, I think you’re taking as a given that

    a) OPTIMIZE TABLE results in defragmentation, which might not be the case (there is only one primary key, but there can be N secondary indexes, so the table could be only 1/(N+1)th defragmented, leaving aside the plugin’s features which aren’t hooked into OPTIMIZE TABLE); and

    b) a defragmented index is optimal, which might not be even remotely the case — an insert into a fully defragmented index could cause worst-case page splitting and tree rebalancing, so maybe the right answer is that the indexes reach an optimal degree of fragmentation on their own and should not be “fixed.”

    But I think you agree implicitly that testing is harder than it should be, because in the absence of good instrumentation, the only way to test is to benchmark the server’s actual workload on the server’s actual data.

    Xaprb

    8 Feb 10 at 1:52 pm

  10. Baron,

    “…an insert into a fully defragmented index could cause worst-case page splitting and tree rebalancing…”

    1. Worst case tree splitting involves N page splits, when N is the depth of the tree; so that’s usually up to 3-4 on numeric primary key. There is no re-balancing of the tree. A B/B+ tree is balanced by design; the splitting is among the mechanism which keeps it balanced.

    2. And, InnoDB saves 1/16 space free. Which allows for such “last minute changes” to work out without splitting.

    Disclaimer: I did not read the InnoDB B+ Tree implementation source code.

    Shlomi Noach

    8 Feb 10 at 2:12 pm

  11. (cotinuing my last comment)
    Of course, the number of page splits assumption above only relates to one index; when there are multiple keys there’s more work to be done. I was referring to a single tree structure changes.

    Shlomi Noach

    8 Feb 10 at 2:16 pm

  12. Does anyone know what happens to a table of randomly inserted PKs as it grows over time in InnoDB? I’m not worried about the secondary indexes, just trying to understand the clustered index. Here is one possible interpretation… where does it go awry?

    - as pages in the buffer pool fill up, they are split in memory
    - split pages are flushed to the WAL and/or doublewrite buffer
    - when the buffer pool fills up, some pages are flushed. Which ones?
    - existing pages are written back to where they already reside?
    - new pages are sorted and written out sequentially in new 1MB “extents”?

    Is that even close? Seems like you’d get some extremely fragmented tables, which has actually been my experience. I do actually try to defragment every few months and see orders of magnitude speed-ups on sequential scans.

    I’ve always wondered why there’s no background defragmentation process. That seems extremely valuable compared to all the TPS optimizations i see published. Do other databases have that? It would be great if it vacuumed data into 10-100MB sequences when IO capacity is available.

    Matt Corgan

    10 Feb 10 at 12:42 am

Leave a Reply