I mentioned Bloom Filters in my talk today at Strata. Afterwards, someone told me it was the first time he’d heard of Bloom Filters, so I thought I’d write a little explanation of what they are, what they do, and how they work.
But then I found that Jason Davies already wrote a great article about it. Play with his live demo. I was able to get a false positive through luck in a few keystrokes: add alice, bob, and carol to the filter, then test the filter for candiceaklda.
Why would you use a Bloom filter instead of, say…
- Searching the data for the value? Searching the data directly is too slow, especially if there’s a lot of data.
- An index? Indexes are more efficient than searching the whole dataset, but still too costly. Indexes are designed to minimize the number of times some data needs to be fetched into memory, but in high-performance applications, especially over huge datasets, that’s still bad. It typically represents random-access to disk, which is catastrophically slow and doesn’t scale.