Stay Curious!

How to find data distributions with SQL

In an earlier article I wrote about grouping data into ranks with a catch-all bucket. In this article I’ll show you how to group the data into variable-sized buckets any way you please.

This query came up when a business partner asked me to send over the distribution of some hierarchical data. It’s the same category/subcategory/item data as in my article about optimizing joins and subqueries. The partner wanted to know, broadly speaking, “how many subcategories have very small and very large numbers of items”.

I could have done a simple query:

select category, count(*) as num
from item group by category

That would have resulted in one row for every category, which would have been thousands of rows – not very useful for answering the question. I needed just a few rows, showing how many subcategories are large and how many are small. I started by filling a temporary table with my desired size ranges:

create temporary table ranges (
   s int not null, -- start of range
   e int not null  -- end of range
);

insert into ranges (s, e) values
   (1, 1),
   (2, 10),
   (11, 50),
   (51, 100),
   (101, 200),
   (201, 500),
   (501, 1000),
   (1000, 9999);

Then I grouped the data by subcategory, joined it against the ranges by size, and grouped again by range, counting and summing the sizes of each of the subcategories to get totals. In the query below, I analyze the distribution of items in category 14:

set @category := 14;

select concat(s, '-', e) as range, sum(num) as total, count(*) as num
from ranges
inner join (
   select s.id, count(*) as num
   from subcategory as s
      inner join item as i on i.subcategory = s.id
   where s.category = @category
   group by s.id
) as x on x.num between ranges.s and ranges.e
group by ranges.s, ranges.e

The results look roughly like this:

The distribution is clearly biased towards single-item categories, with the occasional huge category. Part of the goal was to rewrite our grouping algorithm to chunk things together in groups of 20 to 80 (depending on a variety of complex factors I won’t explain here). This query helped us get a realistic picture before and after the algorithm change.

Posted on Sat, Jun 10, 2006. Approximately 400 Words.

Databases