Xaprb

Stay curious!

How to find data distributions with SQL

with one comment

In an earlier article I wrote about grouping data into ranks with a catch-all bucket. In this article I’ll show you how to group the data into variable-sized buckets any way you please.

This query came up when a business partner asked me to send over the distribution of some hierarchical data. It’s the same category/subcategory/item data as in my article about optimizing joins and subqueries. The partner wanted to know, broadly speaking, “how many subcategories have very small and very large numbers of items”.

I could have done a simple query:

select category, count(*) as num
from item group by category

That would have resulted in one row for every category, which would have been thousands of rows — not very useful for answering the question. I needed just a few rows, showing how many subcategories are large and how many are small. I started by filling a temporary table with my desired size ranges:

create temporary table ranges (
   s int not null, -- start of range
   e int not null  -- end of range
);

insert into ranges (s, e) values
   (1, 1),
   (2, 10),
   (11, 50),
   (51, 100),
   (101, 200),
   (201, 500),
   (501, 1000),
   (1000, 9999);

Then I grouped the data by subcategory, joined it against the ranges by size, and grouped again by range, counting and summing the sizes of each of the subcategories to get totals. In the query below, I analyze the distribution of items in category 14:

set @category := 14;

select concat(s, '-', e) as range, sum(num) as total, count(*) as num
from ranges
inner join (
   select s.id, count(*) as num
   from subcategory as s
      inner join item as i on i.subcategory = s.id
   where s.category = @category
   group by s.id
) as x on x.num between ranges.s and ranges.e
group by ranges.s, ranges.e

The results look roughly like this:

The distribution is clearly biased towards single-item categories, with the occasional huge category. Part of the goal was to rewrite our grouping algorithm to chunk things together in groups of 20 to 80 (depending on a variety of complex factors I won’t explain here). This query helped us get a realistic picture before and after the algorithm change.

Written by Xaprb

June 10th, 2006 at 8:47 pm

Posted in Uncategorized

One Response to 'How to find data distributions with SQL'

Subscribe to comments with RSS

  1. SO many thanks.. that was really useful for me… I was trying to do a report of salaries grouped by ranges and I had no idea how to do it… and at the end the result was: SELECT concat( start, ‘-’, end ) AS range, count( salary ) AS total FROM employees INNER JOIN ranges ON employees.salary BETWEEN ranges.start AND ranges.end GROUP BY ranges.start, ranges.end

    Flamander

    11 Jul 08 at 1:42 pm

Leave a Reply