Xaprb

Stay curious!

The cache-oblivious algorithms inside Tokutek’s TokuDB

with 2 comments

Tokutek have said they are working towards explaining their indexing algorithms. I spoke to some of the Tokutek people over the last 14 months or so about this, although I didn’t really start to pay attention until the beginning of the year. While Vadim, Peter and I were writing our blog post on TokuDB, I asked them to provide scholarly references, and they did, but warned me it would be dense reading, in part because it’s so academic. Mark Callaghan also told me he had gotten them to walk him through the math behind their indexing algorithm and found it hard.

Here’s a blog post with links to the research behind their work. I’m happy to say that after working through one of the papers at a superficial level, I agree — it’s pretty dense, though I think the concepts can be made understandable to mortals. It took me an hour and a half, but I didn’t take the time to convince myself of the validity of the proofs; that would probably take a long time. Moreover, I think these papers only describe parts of the foundation, not the actual implementation. I look forward to the layman’s version, which I’m sure will be more accessible to Bears of Very Small Brain like myself.

Disclaimer: I work for Percona, and Percona was paid to do some benchmarking and analysis for Tokutek. However, I am not paid to say this: I think TokuDB is one of the more interesting technologies I’ve seen in a while. By that, I mean it’s actually something new, a real advancement in applied computer science. Not just the same old B-Trees with a different twist of lemon.

Further Reading:

Written by Xaprb

June 7th, 2009 at 3:02 pm

Posted in SQL

Tagged with ,

2 Responses to 'The cache-oblivious algorithms inside Tokutek’s TokuDB'

Subscribe to comments with RSS

  1. This is exactly why benchmarks are critical. Any theory about a faster technology can be ultimately proved or disproved with benchmarks.

    I believe Tokutek has published good numbers for inserts, but I have not seen numbers for queries.

    Also last I heard they were not supporting multi-user transactions, so that would limit this DB’s potential usage.

    Ivan Novick

    7 Jun 09 at 11:12 pm

  2. Ivan, exactly. The numbers they published were from benchmarks we did for them on a consulting basis, and it was only single-threaded inserts. There are a lot of limitations. I don’t know anything about their code, because it is closed source, so I can only wait and see whether they remove those limitations; but they’ve told me they plan to. That’s why I call this an “interesting” technology.

    Xaprb

    8 Jun 09 at 6:51 am

Leave a Reply