Comments on: Estimating column cardinality the damn cool way http://www.xaprb.com/blog/2012/09/22/estimating-column-cardinality-the-damn-cool-way/ Stay curious! Thu, 02 May 2013 12:36:53 +0000 hourly 1 http://wordpress.org/?v=3.5.1 By: Ralph Corderoy http://www.xaprb.com/blog/2012/09/22/estimating-column-cardinality-the-damn-cool-way/#comment-20293 Ralph Corderoy Sat, 29 Sep 2012 12:58:18 +0000 http://www.xaprb.com/blog/?p=2873#comment-20293 Hi Arjen, interesting on the lack of suitability of CRC.?Perhaps http://en.wikipedia.org/wiki/Crc32 would benefit from stating this, e.g. under Application.?There could even be a reference to your paper.?:-)

I see http://dev.mysql.com/doc/refman/5.6/en/mathematical-functions.html#function_crc32 doesn’t state which of the many CRC-32s it uses.?Given the example outputs match crc32(1) from package libarchive-zip-perl I guess it’s CRC-32-Adler which is *not* a CRC but a checksum!

Using `openssl rand 128′ 10,000 times the last four bits of crc32(1)’s output has 0×6 occurring the least at 595 times and 0×4 the most with 656 times.?(10,000/16 = 625).?But then sha1sum on the same files ranges from 0×7=584 to 0xe=671.

]]>
By: Arjen Lentz http://www.xaprb.com/blog/2012/09/22/estimating-column-cardinality-the-damn-cool-way/#comment-20287 Arjen Lentz Tue, 25 Sep 2012 22:56:54 +0000 http://www.xaprb.com/blog/?p=2873#comment-20287 Hi Baron

Using a CRC for a hash is indeed rather bad. CRC is purely intended to detect certain bit-errors in a data block. It’s never intended for comparing different blocks of data in any way.
Also, it’s not designed to create any particular spread of result values, given different blocks of data. It looks at bit streams, not an overall value that you need.

I wrote an item on this over 20 years ago, based on seeing incorrect applications of CRC causing heaps of grief. I still groan every time I see CRC used as a hash. It’s not a hash.

I figure that in order to get sane estimates it’s rather important that you get a decent spread. You’re vastly reducing the dataset and using probabilistic methods, so if you’re off early on things will only get worse.

You should be able to validate spread using an evenly distributed dataset, and just looking at the 1024 buckets. Do that with a real hash function suitable for the purpose, then I think this idea definitely has potential.

Note that even if you use either not a sane hash function, or a non-evenly distributed dataset, you may still observe a relatively even distribution in the buckets. However, the actual values contributing to each bucket will be “wrong”, thus throwing the calculations out-of-whack anyway.

Cheers,
Arjen.

]]>
By: Xaprb http://www.xaprb.com/blog/2012/09/22/estimating-column-cardinality-the-damn-cool-way/#comment-20285 Xaprb Mon, 24 Sep 2012 17:26:41 +0000 http://www.xaprb.com/blog/?p=2873#comment-20285 The original dataset is gone now; the 36M is on a different dataset (which has grown to the point that I don’t want to run a DISTINCT against it). My gut feeling is 36M is credible on this table.

]]>
By: Shlomi Noach http://www.xaprb.com/blog/2012/09/22/estimating-column-cardinality-the-damn-cool-way/#comment-20284 Shlomi Noach Mon, 24 Sep 2012 16:43:28 +0000 http://www.xaprb.com/blog/?p=2873#comment-20284 1. LOL on the POW(…). And I’m writing scrolls…
2. 36M is far off than your real 10M distinct values. The paper suggests a small margin of eror, some 5%. How does a 300% compare?

]]>
By: Xaprb http://www.xaprb.com/blog/2012/09/22/estimating-column-cardinality-the-damn-cool-way/#comment-20283 Xaprb Mon, 24 Sep 2012 15:30:22 +0000 http://www.xaprb.com/blog/?p=2873#comment-20283 I re-ran the query on another server, this one with lots more rows; it took 1h27m. With the correct order of operations to POW(), I get 36449500, which is a very reasonable answer. I’m pretty sure a SELECT(DISTINCT) would have taken about twice as long on this server.

To respond to Shlomi’s comment, I added the rowcount column so I could debug the buckets table better, and get an idea of selectivity. In this case, I have 85495403 rows in the table, so the column’s selectivity is about 43%. Also, I did check that the variables are initialized and used in the correct order, though anyone who uses this code should double-check that on their own data.

]]>