Introducing MySQL Parallel Dump
A while ago Peter Zaitsev wrote about his wishes for mysqldump. These included multi-threaded dumps and “safe” dumps that would wait for a server to restart if it crashed, then keep dumping other tables. I’ve had sketches of this done for a while, but during this week I fleshed it out while writing about backup and recovery for our upcoming book. I had my own list of features I wanted to add:
- Support for backup sets, with the backup configuration stored in the database itself.
- Emphasis on tab-delimited dumps.
- Sane defaults, focused on ease of use and ease of recovery.
- Support for compression by default.
The resulting script is satisfactory to me. If you just run it without arguments, it connects to the server mentioned in your .my.cnf file and dumps all databases and tables, one file per table, gzipped, in parallel (at least two, but by default it detects the number of CPUs and runs that many in parallel).
baron@kanga $ mysql-parallel-dump SET DATABASE TABLE TIME STATUS THREADS default mysql columns_priv 0 0 2 default mysql db 0 0 2 default mysql help_category 0 0 2 default mysql func 0 0 2 default mysql help_keyword 0 0 2 ...snip... default test t1 0 0 2 default test t2 0 0 1
You can tell it to dump elsewhere, and it’s easy to dump all tables in tab-delimited format. Here it’s reading its configuration from the database, writing to /tmp, and not backing up tables that have been dumped in the last 5 minutes:
baron@kanga $ mysql-parallel-dump --basedir /tmp --tab --sets set1 \
--settable test.backupset --age 5m
Nothing to do for set set1
baron@kanga $ ls -lR /tmp/set1
/tmp/set1:
total 8
-rw-rw-rw- 1 baron baron 40 2007-09-30 21:43 00_master_data.sql
drwxrwxrwx 2 baron baron 4096 2007-09-30 21:43 test
/tmp/set1/test:
total 16
-rw-rw-rw- 1 baron baron 549 2007-09-30 21:43 t1.sql.gz
-rw-rw-rw- 1 baron baron 31 2007-09-30 21:43 t1.txt.gz
-rw-rw-rw- 1 baron baron 550 2007-09-30 21:43 t2.sql.gz
-rw-rw-rw- 1 baron baron 29 2007-09-30 21:43 t2.txt.gz
And as you can see, it knows I’ve dumped those tables recently and didn’t do them again. Pretty handy for scheduling and resuming backups, no? It makes it easy to keep going if something happens in the middle of the backup and you want to restart.
I’m aware of the similar mysqlpdump script, and I generally don’t like duplicating other people’s efforts, but I decided to go ahead and finish what I’d started. To tell you the truth, neither script is complicated. It’s just a matter of providing a sensible wrapper around existing functionality (in my case, that’s mysqldump and SELECT INTO OUTFILE, which I do directly rather than asking mysqldump to do it with -T, which just makes mysqldump into the same kind of wrapper). I also wanted to provide it as part of the MySQL Toolkit, so it’s all in the same place. Frankly, I also built in a lot more functionality than mysqlpdump has, and I consider the defaults to be more useful. I’d love for mysqldump itself to have better defaults — especially for dumping large datasets, which it’s frankly pretty poor at right now. One of these days MySQL AB will make me obsolete, I just know it…
Oh, and in keeping with my tradition, it’s sort of ridiculously sophisticated and overly generic. It has a little macro language that you can use to basically turn it into a loop iterator over the selected databases and tables, and run any command you wish. Here’s an example:
mysql-parallel-dump -- mysqldump --skip-lock-tables '%D' '%N' \| gzip --fast -c - \> '%D.%N.gz'
That basically duplicates the built-in defaults (except the defaults are actually a lot more complicated than that). But it illustrates how you could use this as a shell to select which tables to dump and fork off sub-processes, handling all the locking, error checking, and so forth for them. Here I’m spawning off mysqldump, but it would be just as easy to execute a custom script.
There’s one more wish Peter and I both have, but which is impossible for right now as far as we know. That’s to do parallel SELECT INTO OUTFILE dumps for a bunch of tables in one transaction. This will not be possible until more than one connection can participate in a single transaction. Ask the MySQL guys about that one!
This script is part of MySQL Toolkit and will be released as soon as I have time. There are a few other bug fixes I want to include. In the meantime, if you’re dying to get it, you can grab it from the MySQL Toolkit subversion repository.

Hi Baron,
Very good job. In my laptop, it saves 30% time. In my desktop (4 processors) it saves almost 50%.
Do you plan to add any provision for restoring tables connected by referential integrity constraints?
Cheers
Giuseppe
Giuseppe Maxia
1 Oct 07 at 7:12 am
Have you looked at restores? They’re a bit more tricky if you’ve got foreign keys since you either have to import in the right order or you’ll have to use SET FOREIGN KEY CHECKS=0, if you can import them in the right order you can use mysql’s `mysqlimport` tool, but it doesn’t know about not respecting foreign keys.
How do you restore with multiple threads? Can you take a huge InnoDB table and split up the import into several segments with commits in between? That would increase the import speed by, well, a lot :)
This looks real cool, at what point in the process do you do the compression? If you do the complete backup first and then compress, you’ll waste a bit of storage space but you can start using the server again a lot faster if you wait with compression.
But you’ve probably thought about all of this already.
Johan
1 Oct 07 at 10:32 am
MySQL Parallel Restore is next :-) As Brian Aker has pointed out, mysqlimport can be multi-threaded in 5.1, but it is only a wrapper around LOAD DATA INFILE. There’s also a need for a multi-threaded wrapper around “cat file | mysql” for restoring normal SQL dumps.
Compression is a mixed bag. If you have highly compressible data, you might be *worse* off writing it all to disk uncompressed than piping directly through gzip, in my benchmarks.
By default, when you do SQL dumps this tool pipes right into gzip. When you do tab-delimited dumps, the only option is to dump to raw files and gzip afterwards.
Xaprb
1 Oct 07 at 10:37 am
Oh — I meant to say, dumping and/or restoring a single table in parallel is another interesting option I’ve mulled over. I’ve done a lot of work on divide-and-conquer algorithms in some of the other tools, such as MySQL Table Checksum, and that could carry over to this too. But at the moment I decided not to do that. I don’t even know if people need it.
Xaprb
1 Oct 07 at 10:40 am
When it comes to restoring an InnoDB table, you don’t have to do it in parallel, just chunk it up into manageable pieces with commits every here and there and serialize it. Otherwise you’ll run into problems when InnoDB wants to make sure that it can rollback your 50gb insert. This is mostly a problem with LOAD DATA INFILE though, since a 50gb insert is kinda hard to in plain SQL.
Johan
1 Oct 07 at 11:10 am
I see what you mean now. So this is something that needs to be supported in the dump — dump the table into several files, that is — otherwise there’ll be no good way to stop and commit in the middle of a huge LOAD DATA INFILE.
Divide and conquer would be a good approach for this.
Xaprb
1 Oct 07 at 12:49 pm
Whoops, sorry Giuseppe. Akismet decided your comment was spam. I think foreign keys can be handled on restore.
Xaprb
1 Oct 07 at 2:15 pm
[...] wrote an introduction to MySQL Parallel Dump yesterday. It’s a much smarter way to dump your data if you have a lot of it, and it’s [...]
MySQL Toolkit version 946 released at Xaprb
1 Oct 07 at 5:30 pm
Baron-
Thanks for the excellent tool. Benchmarks on my system show a very large speed up. I look forward to seeing your restore tool, as I’m running into the same foreign key constraint issues that others have mentioned.
A patch was submitted to mysqlimport to handle disabling of foreign keys, but I think it’s not for the 5.1 version that handles parallel import threads: http://bugs.mysql.com/bug.php?id=19996
Thinking about how multiple threads cannot be part of a single transaction, doesn’t that mean the tables created by the threads could potentially be modified during the dumps? That might make snapshot consistency a moving target on a live system.
Thanks for all your hard work,
Dan
Dan
2 Oct 07 at 3:29 pm
Hi Dan,
How large a speedup? If you can share your benchmarks (data size, speed of mysqldump, speed with mysql-parallel-dump, number/speed of CPUs, number/speed/RAID configuration of disks) it would be great to have some representative numbers for the documentation. I’m trying to give some general ideas in the documentation of when and how much this is likely to be helpful. I’ve found one system where it’s not all that helpful.
Multiple transactions shouldn’t be a problem. Good old LOCK TABLES keeps it consistent.
Xaprb
2 Oct 07 at 3:45 pm
Here are some benchmarks:
Performed on a 3.6GHz Xeon with 4 processors. The filesystem is a RAID 10 array of 15k speed disks, directly attached to the server with a fibre channel.
Note that the larger db had most of the space allocated for a single large table, so a lot of the time was spent on dumping that table. Parallel threads were not that helpful in that case. They were very helpful where there were lots of tables of roughly equivalent sizes.
I can probably run some more thorough tests, if you’d like, since my methods were pretty informal.
Thanks,
Dan
Dan
2 Oct 07 at 4:24 pm
Thanks, that’s great.
The next version will perform even better on the single big table: it’ll dump it into chunks in parallel (this is to help address the issue of loading in parallel and/or loading with commits between chunks).
Xaprb
2 Oct 07 at 4:27 pm
Baron -
Could you be a little more explicit about how you lock tables? Do you lock all tables in a database at the start of the dump and only release the locks when all the threads are done with every table? Or are you locking tables as needed and releasing the lock when the dump for that table is done?
Thanks again,
Dan
Dan
3 Oct 07 at 10:18 am
By default, it does FLUSH TABLES WITH READ LOCK once at the beginning of the whole process, then passes the –skip-lock-tables option to mysqldump so it doesn’t try to get its own lock.
If you give the –locktables option, it locks only tables it uses. This is enabled by default when –sets is given. Each set gets and releases locks on the tables it uses, so tables in each set will be consistent, but may not be consistent with tables from other sets.
You can disable the global lock with –noflushlock.
Xaprb
3 Oct 07 at 10:32 am
Just for information, the new version is released, and can do chunked dumps as promised. Testing and benchmarks welcome!
Xaprb
3 Oct 07 at 10:01 pm
Baron,
Great work, but would you mind confirming if this does a consistent snapshot as in single-transaction for mysqldump for InnoDB tables or is this simply not possible via a parallel method?
Cheers
RV
27 Feb 08 at 10:05 am
It’s not possible. Locking tables is the only way to do it. A single transaction would require many connections to be able to participate in the transaction, which is something I wouldn’t assume MySQL will build anytime soon :-)
I suppose another way would be to lock tables briefly and allow all child processes to do BEGIN TRANSACTION, but that isn’t the way it’s built at the moment.
Xaprb
27 Feb 08 at 11:42 am
If you had multiple schema’s, within one MySQL instance, and each one had no dependency/connection to the other, could one not spawn multiple (single-transaction, master-data) mysqldumps against each one and therefore reduce the time to perform full consistent InnoDB hot backups from a master? If not, then I believe this type of tool is only usable for large InnoDB instances on slaves. I know that performing dumps on masters is not recommended, but our replication confidence is low, so at the moment it’s a must. Hopefully that’s where Maatkit will again come to the rescue.
RV
28 Feb 08 at 6:02 am
Hi,
There’s a current forum conversation on the same topic. It would be good to move this there so it’s all in the same place:
http://sourceforge.net/forum/forum.php?thread_id=1952777&forum_id=664350
Xaprb
28 Feb 08 at 9:12 am