Comments on: Deleting millions of rows in small chunks with common_schema http://www.xaprb.com/blog/2013/01/28/deleting-millions-of-rows-in-small-chunks-with-common_schema/ Stay curious! Thu, 02 May 2013 12:36:53 +0000 hourly 1 http://wordpress.org/?v=3.5.1 By: Shlomi Noach http://www.xaprb.com/blog/2013/01/28/deleting-millions-of-rows-in-small-chunks-with-common_schema/#comment-22034 Shlomi Noach Mon, 18 Feb 2013 13:08:05 +0000 http://www.xaprb.com/blog/?p=3025#comment-22034 @Gregory,

“The chunking by key part is easy enough to handle, but how do you manage passing in query or filter criteria?”

So the chunking key is easy to handle when it’s a single AUTO_INCREMENT column, and much harder when it’s a composite (datetime, varchar(64)).
No problem with filter criteria. What QueryScript does is inject the chunk range condition into your query.

“It seems to me it would be quite difficult to build a framework that would work with almost any query passed in.”
What QueryScript requires from you on multi-table operations is that you tell it which is the “main” table in the query. This is the table that would be split in chunks (the other tables are just joined to the chunks).
This “main” table is the table that makes more sense query-evaluation-plan-wise. The one that would typically be used first in an EXPLAIN output.

Baron commented he used “split()”. “repeat_exec()” is an older way of doing it, and not as smart. It’s still good, but QueryScript and split() are so much more complex and magical.

]]>
By: Xaprb http://www.xaprb.com/blog/2013/01/28/deleting-millions-of-rows-in-small-chunks-with-common_schema/#comment-20466 Xaprb Thu, 31 Jan 2013 22:27:19 +0000 http://www.xaprb.com/blog/?p=3025#comment-20466 I’m just using common_schema.run(‘split(delete from…..)’);

There’s an example in the split() docs that I pretty much copy-pasted.

]]>
By: Gregory Haase http://www.xaprb.com/blog/2013/01/28/deleting-millions-of-rows-in-small-chunks-with-common_schema/#comment-20465 Gregory Haase Thu, 31 Jan 2013 17:31:18 +0000 http://www.xaprb.com/blog/?p=3025#comment-20465 Oh, I’m guessing you are just using repeat_exec()? For some reason I thought you were going to add something to common_schema.

]]>
By: Gregory Haase http://www.xaprb.com/blog/2013/01/28/deleting-millions-of-rows-in-small-chunks-with-common_schema/#comment-20464 Gregory Haase Thu, 31 Jan 2013 17:25:13 +0000 http://www.xaprb.com/blog/?p=3025#comment-20464 I’m pretty interested in seeing that script. I often write procedures to chunk through large updates and deletes. But they are always self-contained in a sql script that creates the procedure, executes it, and then drops the procedure. While the methodology is known (often the most difficult part of any code) and there is plenty of existing examples to copy from, there still is a bit of manual work each time.

The chunking by key part is easy enough to handle, but how do you manage passing in query or filter criteria?

For example, I recently had to clean duplicate records from a very large table, which was accomplished via:

delete frst
from
table_name frst,
table_name scnd
where
scnd.fact_column = frst.fact_column and
scnd.id > frst.id and
frst.id between lower_bound and upper_bound
;

In the above example, lower_bound and upper_bound are recalculated in each loop iteraction by adding chunksize.

It seems to me it would be quite difficult to build a framework that would work with almost any query passed in.

And yes, I realize that another approach to such a massive de-dupe would be to
1. create a new table
2. Put a trigger on the old table to copy incoming records over to the new table
3. Copy distinct records from old table to new table from before when the trigger was created
4. Atomic rename
5. drop old table

I had sufficient I/O to handle the bandwidth and determined that the going for the delete was an easier approach.

]]>