Maatkit learns how to map-reduce
The May release of Maatkit included a new feature in mk-query-digest. This allows you to process queries in many pieces, write out intermediate results, and then combine the pieces in a separate step. Maybe it’s not exactly map-reduce, but it makes a good headline.
The purpose is to enable query analysis across an arbitrarily large group of servers. Process queries on all of them, ship the results to a central place, and then combine them together. Pre-processing the results has some nice benefits, such as reduced bandwidth requirements, speeding up processing by doing it in parallel, and reducing the workload on the central aggregator. One Percona customer with many MySQL instances is trying this out.
The --save-results option on mk-query-digest saves the digested results to a file, stopping just before the final stages of the query event pipeline. There is a tool in Subversion trunk, tentatively called mk-merge-mqd-results, which reads these saved files, aggregates them together, and then finishes the process of computing statistics and making a report.



Will this eliminate the need for heavy memory usage when parsing large log files? Currently I have to ship the logs to a centralized server anyway, because I can’t afford to use 500M (or more) of RAM to parse a 1G (or larger) file…..
Sheeri K. Cabral
6 May 10 at 10:31 am
Daniel has figured out where the memory went. He’s a total rock star. Perl likes to use a lot of memory. The fix is non-trivial, but we know what needs to be done — check the mailing list. Sparse arrays end up being the problem.
Xaprb
6 May 10 at 10:56 am
OK, so the answer is “this doesn’t fix that memory issue”. That was my question. Your answer of “we’re working on that problem” is useful, but doesn’t answer my question directly.
Also, misusing technical terms adds to the problem buzzwords have. That you do it on purpose is really frustrating. I’m sure you get more web hits, so it’s a great SEO strategy, but it’s really annoying for those of us who are trying to fix misconceptions. Now I will have to deal with folks asking me about whether or not I’m using the new map-reduce features in mk-query-digest when I am doing or explaining a query review.
Sheeri K. Cabral
6 May 10 at 12:59 pm
This will be extremely useful for me. I will try it real soon.
Mark Callaghan
6 May 10 at 3:47 pm
heh, map/reduce is a correct term.
map a workload to a server, analyze it, get reduced output.
I’ve been calling pretty much every distributed workload analysis I’ve been doing “map/reduce” – I’m pretty sure it matches the concept quite well ;-)
Domas
6 May 10 at 11:50 pm
If someone is unhappy with the tool, feel free to not use it at all. There are many tools out there such as this… Oh wait, what other tools? Maybe someone can create a tool that will not take a lot of memory but finish a long time.
I’ve looked into the code and it is extremely well organized. I’m very impressed on how the code is written.
Bryan Ryan
7 May 10 at 12:31 am
Bryan Ryan, I’ll pass your compliment along to Daniel.
Xaprb
7 May 10 at 8:00 am
Daniel — we <3 Maatkit
Mark Callaghan
7 May 10 at 9:22 am
Thanks. :-)
Daniel Nichter
7 May 10 at 10:07 am
Bryan – I wasn’t complaining about the memory usage, I have a workaround and am using it happily.
All I was doing was asking if this particular “divide and conquer” strategy addresses the memory usage issue.
My complaint was that I didn’t get a straight answer. I use mk-query-digest and in fact have asked (and my company paid for) features in this.
In fact, I completely understand about the memory usage because there’s a lot of aggregation and comparison, so I understand the need to use that much memory.
Sheeri Cabral
7 May 10 at 2:07 pm
Here is a straight answer: no, this doesn’t reduce memory consumption. I didn’t mean to avoid the question. My reply was off-topic because I wanted to praise Daniel for his good work on figuring out what uses memory.
Xaprb
7 May 10 at 2:44 pm