Xaprb

Stay curious!

Deleting millions of rows in small chunks with common_schema

with 4 comments

I wrote pt-archiver for jobs like deleting or archiving rows from a big table in small chunks. These days, that’s the kind of task I like doing inside the database, and Shlomi’s magical common_schema feels a lot more suited for this than an external Perl script.

When I say it’s magical, it really does feel magical. It’s amazing how he’s created an entire expressive scripting language that runs in MySQL and feels just right for the job.

Right now I’m watching this kind of stuff scroll by in my terminal:

+---------------------+
| rows_deleted_so_far |
+---------------------+
|             2871119 |
+---------------------+
1 row in set (7 min 42.67 sec)

+---------------------+
| rows_deleted_so_far |
+---------------------+
|             2872119 |
+---------------------+
1 row in set (7 min 42.75 sec)

Notice that the execution time is steadily increasing. Each chunk of 50,000 rows takes a small amount of time to delete, but the time shown is since the beginning of the job’s execution.

Written by Xaprb

January 28th, 2013 at 4:21 pm

Posted in SQL

4 Responses to 'Deleting millions of rows in small chunks with common_schema'

Subscribe to comments with RSS

  1. I’m pretty interested in seeing that script. I often write procedures to chunk through large updates and deletes. But they are always self-contained in a sql script that creates the procedure, executes it, and then drops the procedure. While the methodology is known (often the most difficult part of any code) and there is plenty of existing examples to copy from, there still is a bit of manual work each time.

    The chunking by key part is easy enough to handle, but how do you manage passing in query or filter criteria?

    For example, I recently had to clean duplicate records from a very large table, which was accomplished via:

    delete frst
    from
    table_name frst,
    table_name scnd
    where
    scnd.fact_column = frst.fact_column and
    scnd.id > frst.id and
    frst.id between lower_bound and upper_bound
    ;

    In the above example, lower_bound and upper_bound are recalculated in each loop iteraction by adding chunksize.

    It seems to me it would be quite difficult to build a framework that would work with almost any query passed in.

    And yes, I realize that another approach to such a massive de-dupe would be to
    1. create a new table
    2. Put a trigger on the old table to copy incoming records over to the new table
    3. Copy distinct records from old table to new table from before when the trigger was created
    4. Atomic rename
    5. drop old table

    I had sufficient I/O to handle the bandwidth and determined that the going for the delete was an easier approach.

    Gregory Haase

    31 Jan 13 at 1:25 pm

  2. Oh, I’m guessing you are just using repeat_exec()? For some reason I thought you were going to add something to common_schema.

    Gregory Haase

    31 Jan 13 at 1:31 pm

  3. I’m just using common_schema.run(‘split(delete from…..)’);

    There’s an example in the split() docs that I pretty much copy-pasted.

    Xaprb

    31 Jan 13 at 6:27 pm

  4. @Gregory,

    “The chunking by key part is easy enough to handle, but how do you manage passing in query or filter criteria?”

    So the chunking key is easy to handle when it’s a single AUTO_INCREMENT column, and much harder when it’s a composite (datetime, varchar(64)).
    No problem with filter criteria. What QueryScript does is inject the chunk range condition into your query.

    “It seems to me it would be quite difficult to build a framework that would work with almost any query passed in.”
    What QueryScript requires from you on multi-table operations is that you tell it which is the “main” table in the query. This is the table that would be split in chunks (the other tables are just joined to the chunks).
    This “main” table is the table that makes more sense query-evaluation-plan-wise. The one that would typically be used first in an EXPLAIN output.

    Baron commented he used “split()”. “repeat_exec()” is an older way of doing it, and not as smart. It’s still good, but QueryScript and split() are so much more complex and magical.

    Shlomi Noach

    18 Feb 13 at 9:08 am

Leave a Reply