Xaprb

Stay curious!

Estimating column cardinality the damn cool way

Have you seen Damn Cool Algorithms: Cardinality Estimation yet? If not, take a few minutes and read through it. Now, what if we try using that approach instead of COUNT(DISTINCT) in MySQL to see how many distinct values there are in a column?

I recently needed this information in real life, and the table is large with many duplicate values. The column is some 32-character hex string, a hash value that represents a session ID. I’ll call the column `sess_id`. I wanted to know how many distinct values it had, but I thought it would be cool (damn cool, really) to try this approach and see what happened.

I read the blog post, convinced myself that it made sense, and tried to code it. Here’s my rough translation of the algorithm into MySQL-speak. Note that I’m using `crc32()`, which may not be a great choice for a hash. (The value in the column is already a hash, so in theory I could just use the value, but I wanted to see how this would work if implemented in a way that can be applied generically.)

I decided to use a hard-coded 1024 “buckets”, and instead of an array, I used a temporary table.

```create temporary table buckets( id int not null primary key, max_zeroes int not null default 0, rowcount int not null default 0 )engine=memory;```

Here’s the single pass over the values. I’m using a user-defined variable `@c` to avoid recomputing the hash. Let me know if you see any mistakes in my code.

```insert into buckets(id, max_zeroes, rowcount) select (@c := crc32(sess_id)) & 1023, if(@c < 1024, 22, instr(reverse(bin(@c >> 10)), '1') -1), 1 from tbl1 on duplicate key update max_zeroes=greatest(max_zeroes, values(max_zeroes)), rowcount = rowcount + 1;```

That query took 32 minutes; by comparison, a similar query using COUNT(DISTINCT) took 46 minutes. So far, so good. Now here’s the final bit, to combine the “buckets” and get the cardinality estimate. Again, flag errors if you see any:

```select pow(sum(max_zeroes) / 1024, 2) * 1024 * 0.79402 from buckets; +-------------------------------------------------+ | pow(sum(max_zeroes) / 1024, 2) * 1024 * 0.79402 | +-------------------------------------------------+ | 151741.65811039635 | +-------------------------------------------------+```

That’s not even close. I happen to know that there are more than 10122796 distinct values, and 21669591 rows overall.

Did I do something wrong? I’m trying to decide whether my error is more likely to be that I’m using CRC32, a little coding mistake, the way MySQL handles numbers internally… something else? I’m not sure what’s most likely. I ran this one step at a time when I wrote the code, and all of the hashing and bit-counting seems to be correct, but the aggregate results are just way off.

Written by Baron Schwartz

September 22nd, 2012 at 3:55 pm

Posted in SQL

15 Responses to 'Estimating column cardinality the damn cool way'

1. I think this is wrong:
select pow(sum(max_zeroes) / 1024, 2) * 1024 * 0.79402 from buckets;

From my reading of the original code, I think you want this instead:
select pow(2, sum(max_zeroes) / 1024) * 1024 * 0.79402 from buckets;

Ernie S

22 Sep 12 at 4:46 pm

2. At first glance everything looks right to me. Maybe the fudge factor (.79402) needs adjusting? I haven’t read the papers, but perhaps the number needs adjustment based on the usage of crc32.

Justin Swanhart

22 Sep 12 at 11:02 pm

3. I think the arguments in pow() needs to be swapped. It looks like you are calculating avg_of_max_zeroes ^ 2 while it should be 2 ^ avg_of_max_zeroes.

select pow(2,sum(max_zeroes) / 1024) * 1024 * 0.79402 from buckets;

Damodharan

23 Sep 12 at 12:02 am

4. An error on your calculation is to assume that:
(@c := crc32(sess_id)) & 1023
is always computed prior to:
if(@c > 10)), ’1′) -1)

It might, and it often is. I don’t know if that skewed your calculation, but you might want to check it out.
You can force order of calculation as presented in Programmatics Queries: things you can code in SQL.

It may look something like this:

insert into buckets(id, max_zeroes, rowcount)
select
case
when (@c := crc32(sess_id)) & 1023 is not null then @c
end AS bucket_id,
case
when (@c := crc32(sess_id)) & 1023 is null then null
else if(@c > 10)), ’1′) -1)
end AS bucket_max_zeroes,
1
from tbl1
on duplicate key update
max_zeroes=greatest(max_zeroes, values(max_zeroes)),
rowcount = rowcount + 1;

I’m doing the crc32 calculation twice, unfortunately, as I do not know which of the two columns is evaluated first; but I’m making sure that by the time you access @c for evaluation, it is definitely well defined.

Hope this does not add much to computation time. Will continue to examine the logic.

Shlomi Noach

23 Sep 12 at 4:02 am

5. Ummm, no need for the first “case” statement (this is leftover from trying to work out all calculation within one case)

Shlomi Noach

23 Sep 12 at 4:05 am

6. Darn, erro in the above code (of mine). Here’s the fix:
``` insert into buckets(id, max_zeroes, rowcount) select (@c := crc32(sess_id)) & 1023 case when (@c := crc32(sess_id)) is null then null else if(@c > 10)), '1') -1) end AS bucket_max_zeroes, 1 from tbl1 on duplicate key update max_zeroes=greatest(max_zeroes, values(max_zeroes)), rowcount = rowcount + 1; ```

Shlomi Noach

23 Sep 12 at 4:15 am

7. DARN!! HTML escaping the “less-than”…

Feel free to edit/remove the above two comments, edit this one, and hopefully this one works:

insert into buckets(id, max_zeroes, rowcount)
select
(@c := crc32(sess_id)) & 1023,
case
when (@c := crc32(sess_id)) is null then null
else if(@c < 1024, 22, instr(reverse(bin(@c >> 10)), ’1′) -1)
end AS bucket_max_zeroes,
1
from tbl1
on duplicate key update
max_zeroes=greatest(max_zeroes, values(max_zeroes)),
rowcount = rowcount + 1;

Shlomi Noach

23 Sep 12 at 4:17 am

8. Some further notes:

- Your alorithm (pending variable error) is loyal to the origin.

- I don’t thing the crc32 should make for trouble. But to test it, check out the variance of `rowcount` in `buckets`.
Now, make up a `buckets2` table, iterate through same count or rows in `tbl1`, but this time choose bucket_id using RAND()*1024, or via SHA1 etc. If variance is similar to that of crc32, you’re good.

- The overall speed boost is disappointing, I must say. But maybe this can be sped up by parallelization, as suggested in the origin (though no MEMORY table then). Also “rowcount” calculation is not required; not sure if that would help.

Shlomi Noach

23 Sep 12 at 4:50 am

9. Ernie is right~~

select pow(2, sum(max_zeroes) / 1024) * 1024 * 0.79402 from buckets;

mysql> select @a := sqrt (( 151741.65811039635 / 0.79402 ) / 1024 );
+——————————————————-+
| @a := sqrt (( 151741.65811039635 / 0.79402 ) / 1024 ) |
+——————————————————-+
| 13.661132812 |
+——————————————————-+
1 row in set (0.00 sec)

mysql> select pow(2, @a) * 1024 * 0.79402;
+—————————–+
| pow(2, @a) * 1024 * 0.79402 |
+—————————–+
| 10532759.055516426 |
+—————————–+
1 row in set (0.00 sec)

huarong

24 Sep 12 at 2:13 am

10. Facepalm.

Michael Bolton: I don’t know what happened, I must have missed a decimal point or something…

Xaprb

24 Sep 12 at 9:57 am

11. I re-ran the query on another server, this one with lots more rows; it took 1h27m. With the correct order of operations to POW(), I get 36449500, which is a very reasonable answer. I’m pretty sure a SELECT(DISTINCT) would have taken about twice as long on this server.

To respond to Shlomi’s comment, I added the rowcount column so I could debug the buckets table better, and get an idea of selectivity. In this case, I have 85495403 rows in the table, so the column’s selectivity is about 43%. Also, I did check that the variables are initialized and used in the correct order, though anyone who uses this code should double-check that on their own data.

Xaprb

24 Sep 12 at 11:30 am

12. 1. LOL on the POW(…). And I’m writing scrolls…
2. 36M is far off than your real 10M distinct values. The paper suggests a small margin of eror, some 5%. How does a 300% compare?

Shlomi Noach

24 Sep 12 at 12:43 pm

13. The original dataset is gone now; the 36M is on a different dataset (which has grown to the point that I don’t want to run a DISTINCT against it). My gut feeling is 36M is credible on this table.

Xaprb

24 Sep 12 at 1:26 pm

14. Hi Baron

Using a CRC for a hash is indeed rather bad. CRC is purely intended to detect certain bit-errors in a data block. It’s never intended for comparing different blocks of data in any way.
Also, it’s not designed to create any particular spread of result values, given different blocks of data. It looks at bit streams, not an overall value that you need.

I wrote an item on this over 20 years ago, based on seeing incorrect applications of CRC causing heaps of grief. I still groan every time I see CRC used as a hash. It’s not a hash.

I figure that in order to get sane estimates it’s rather important that you get a decent spread. You’re vastly reducing the dataset and using probabilistic methods, so if you’re off early on things will only get worse.

You should be able to validate spread using an evenly distributed dataset, and just looking at the 1024 buckets. Do that with a real hash function suitable for the purpose, then I think this idea definitely has potential.

Note that even if you use either not a sane hash function, or a non-evenly distributed dataset, you may still observe a relatively even distribution in the buckets. However, the actual values contributing to each bucket will be “wrong”, thus throwing the calculations out-of-whack anyway.

Cheers,
Arjen.

Arjen Lentz

25 Sep 12 at 6:56 pm

15. Hi Arjen, interesting on the lack of suitability of CRC.?Perhaps http://en.wikipedia.org/wiki/Crc32 would benefit from stating this, e.g. under Application.?There could even be a reference to your paper.?:-)

I see http://dev.mysql.com/doc/refman/5.6/en/mathematical-functions.html#function_crc32 doesn’t state which of the many CRC-32s it uses.?Given the example outputs match crc32(1) from package libarchive-zip-perl I guess it’s CRC-32-Adler which is *not* a CRC but a checksum!

Using `openssl rand 128′ 10,000 times the last four bits of crc32(1)’s output has 0×6 occurring the least at 595 times and 0×4 the most with 656 times.?(10,000/16 = 625).?But then sha1sum on the same files ranges from 0×7=584 to 0xe=671.

Ralph Corderoy

29 Sep 12 at 8:58 am