How to find contiguous ranges with SQL
In an earlier article I discussed how to find missing members in a sequence with SQL. In this article I’ll do the reverse: demonstrate how to find the start and end point of each contiguous range.
Motivation
Someone posted a comment on the article linked above, asking how to do this. At least, that’s what I think the question was; I might be misinterpreting it. I considered replying in the comments on that article, but decided it should go in its own article instead.
I’ll use the same sample data as in the earlier article: a sequence of integers from 1 to 20, with the numbers 5, 11, 12, 13, and 14 missing. I’ll also delete the value 7, so 6 is a range of length 1. The desired answer is
| start | end |
|---|---|
| 1 | 4 |
| 6 | 6 |
| 8 | 10 |
| 15 | 20 |
The solution
This isn’t as easy as I thought it would be at first. I stared at it for a while, then it came to me: I want to find the start and end of each contiguous range, so I need to define “start” and “end.” The start of a range is defined by the absence of the preceding number. I initially thought “has a next but no previous,” but that’s incorrect because a single number is a valid range; if I require the start to have a “next,” that eliminates 6 (I initially wrote the whole thing wrong, then thought about single-number ranges and re-wrote everything). So the definition of “start” is a number that has no “previous.”
The end of a range is almost the reverse: it has no “next” but might have a “previous.” Additionally, it should be the smallest such number that’s greater than or equal to the start. The “or equal to” is again necessary to include ranges that are just one number.
Each of these queries is fairly simple by itself, using exclusion joins. Here’s one that will find the start of every range:
select l.id
from sequence as l
left outer join sequence as r on r.id = l.id - 1
where r.id is null;
| id |
|---|
| 1 |
| 6 |
| 8 |
| 15 |
I’m referring to the left-hand table as “l” and the right-hand table as “r.” Here’s a query that will find the end of every range. It’s almost the same:
select l.id
from sequence as l
left outer join sequence as r on r.id = l.id + 1
where r.id is null;
| id |
|---|
| 4 |
| 6 |
| 10 |
| 20 |
Bringing the two together, and meeting the “smallest value greater than or equal to” requirement, is a more complex query. Here I solve it with a correlated subquery:
select l.id as start,
(
select min(a.id) as id
from sequence as a
left outer join sequence as b on a.id = b.id - 1
where b.id is null
and a.id >= l.id
) as end
from sequence as l
left outer join sequence as r on r.id = l.id - 1
where r.id is null;
I’ve re-aliased the subquery’s tables as “a” and “b” to avoid confusion with “r” and “l.”



Shouldn’t it be “a.id = b.id 1″ instead of “a.id = b.id – 1″ in the full statement?
DrDebian
14 Aug 07 at 3:04 pm
Any thoughts on how to write a SQL query that works with address ranges? I have a scenario where I need to find overlapping or gaps between ranges of addresses for a given street. For instance I have
SegmentID Street_name Left_from addr left_to_addr
1 Main 10 99
2 Main 10 199
There is an overlap of addresses in the table above since the Main1 address range is inculded in Main2 address range.
I would greatly appreciate your help here as I desperately need a way out of this. Thank you.
oumar
28 Aug 07 at 5:41 pm
That’s a beautiful thing, man. Very nicely done. Thanks for publishing it!!
SteveW
26 Nov 07 at 4:18 pm
Thanks for a great article. You’ve helped me a lot!
Dragoljub ĆurÄić
25 Feb 08 at 12:50 pm
This is pretty sweet!
I have a question though… How would one restrict this query to an id range?
For example if you only wanted to find only the ranges between 3 and 12. Further if you wanted to restrict it on other (non id) factors?
(My data set is more like the numbers 1 to 20 set to true, with the numbers 5, 11, 12, 13, and 14 set to false or missing and I only want the ranges between 3 and 12 of the true ids)
This has given me a great start, thanks!
Kyle
5 May 08 at 8:19 pm
How do I query to indicate missing sequence numbers with an “*”, (like the banks to on your bank statement when a check number is missing).
For example:
1
2
3
6*
7
8
10*
11
12
15*
Nonie
17 Jun 08 at 4:14 pm
Same question as Oumar!
Joseph
29 Oct 08 at 6:40 pm
This is beautiful, its probably the most clean, but extremly powerful piece of code i have ever seen.
Cheers mate! Your earn it!
Gustav
13 Nov 08 at 7:16 am
It’s easy enough to extend to work rows representing ranges.
I just did this for an ipv4 address range table with the structure
customer integer
fromAddress integer
toAddress integer
select l.customer, l.fromAddress as start,
(
select min(a.toAddress) as id
from ipRanges as a
left outer join ipRanges as b on a.toAddress = b.fromAddress – 1
and a.customer = b.customer
where b.fromAddress is null
and a.toAddress >= l.toAddress
) as end
from ipRanges as l
left outer join ipRanges as r on r.toAddress = l.fromAddress – 1 and r.customer = l.customer
where r.fromAddress is null;
Note that this query also groups by the customer field – just remove the references to “customer” in the sql and it’ll work on just the ip ranges
Royce
11 Jan 09 at 8:45 pm
How would I do the same as above based on a DateTime field as the unique entry.
For example I have a table as follows
RecordTime Tension
2009-01-01 00:00:00.000 1
2009-01-01 00:00:01.000 -8888
2009-01-01 00:00:02.000 3
2009-01-01 00:00:03.000 4
2009-01-01 00:00:04.000 2
2009-01-01 00:00:05.000 1
2009-01-01 00:00:06.000 -8888
2009-01-01 00:00:07.000 -8888
2009-01-01 00:00:08.000 4
2009-01-01 00:00:09.000 2
2009-01-01 00:00:10.000 1
2009-01-01 00:00:11.000 -8888
2009-01-01 00:00:12.000 -8888
2009-01-01 00:00:13.000 -8888
2009-01-01 00:00:14.000 -8888
2009-01-01 00:00:15.000 1
2009-01-01 00:00:16.000 -8888
2009-01-01 00:00:17.000 3
2009-01-01 00:00:18.000 4
2009-01-01 00:00:19.000 2
The above Table is defined as RecordTime DATETIME, Tesnion INTEGER
Any assistance would be appreciated.
Andrew Lindsay
16 Jul 09 at 11:11 am
Thanks for this, it really gave me a great starting point. The solution works well for small tables, unfortunately it gets really inefficient when used on large tables.
I wonder if someone has tried to solve the same problem in PL/SQL. It should be possible to do things a lot more efficiently there (no need to use a left outer join, just iterate over the sorted sequence and find gaps).
Hannes Koller
22 Sep 09 at 11:40 am
Thanks a lot!
I have a question.
Based on the example above, is it possible to retrieve the range, say, where id >= 2 and <= 18?
Initially I tried with the start range only, but I'm not able to get it correctly. The first start range will be 6, instead of 2.
Any ideas? Thanks
Jade
24 Sep 09 at 12:03 am
Hi,
after my last post I tried to solve the problem in PL/SQL. Although my problem is a little bit different, the principle should essentially be the same.
I have a table (sensordata_raw) with data from a number of detectors. Each detector is supposed to produce a record every minute. My task was to find the gaps (timeframes where a detector has not written a record).
I first tried the queries presented at this page but for a larger table ( about 20 Mio Records) this proved too inefficient. The PL/pgSQL solution I found works quite well for a table of this size
Here is my code if anyone is interested….
CREATE OR REPLACE FUNCTION find_gaps() RETURNS integer AS $$
DECLARE
sensordata_record RECORD;
current_detector_id INTEGER := -1;
last_timestamp TIMESTAMP;
expected_next TIMESTAMP;
real_next TIMESTAMP;
gap_start TIMESTAMP;
gap_end TIMESTAMP;
gap_count INTEGER := 0;
last_gap_end TIMESTAMP := NULL;
distance_to_last_gap INTERVAL;
BEGIN
RAISE NOTICE ‘Starting at %’, timeofday();
BEGIN
EXECUTE ‘DROP TABLE gaps ‘;
EXCEPTION
WHEN OTHERS THEN
RAISE NOTICE ‘ Table gaps did not exist ‘;
END;
EXECUTE ‘CREATE TABLE gaps (detector_id INTEGER, gap_start TIMESTAMP, gap_end TIMESTAMP, distance_to_last_gap INTERVAL);’;
FOR sensordata_record IN SELECT * FROM sensordata_raw ORDER BY detector_id,timestamp ASC LOOP
IF sensordata_record.detector_id != current_detector_id THEN –new detector
last_timestamp := NULL;
last_gap_end := NULL;
current_detector_id := sensordata_record.detector_id;
RAISE NOTICE ‘% Next Detector: %’, timeofday(), current_detector_id;
END IF;
IF (last_timestamp IS NOT NULL) AND (date_trunc(‘minute’, last_timestamp) != date_trunc(‘minute’,sensordata_record.timestamp) ) THEN — skip checking for first
expected_next := date_trunc(‘minute’,last_timestamp) + interval ’1 minute’;
real_next := date_trunc(‘minute’,sensordata_record.timestamp);
IF real_next != expected_next THEN — gap found
gap_start := expected_next;
gap_end := real_next – interval ’1 minute’;
distance_to_last_gap := NULL;
IF last_gap_end IS NOT NULL THEN
distance_to_last_gap := gap_start – last_gap_end;
END IF;
INSERT INTO gaps values (current_detector_id,gap_start,gap_end,distance_to_last_gap);
gap_count := gap_count + 1;
last_gap_end := gap_end;
– RAISE NOTICE ‘ GAP FOUND % to % ‘,gap_start, gap_end ;
END IF;
END IF;
last_timestamp := sensordata_record.timestamp;
END LOOP;
RAISE NOTICE ‘Finished at %’, timeofday();
RETURN gap_count;
END;
$$ LANGUAGE plpgsql;
Hannes Koller
24 Sep 09 at 3:10 am
Hi Works Great I Need this within but only when 3 other fields meet required criteria, Is there an easy way to do this (Access ?)
NickR
30 Jun 10 at 10:16 am
Given a range(given starting_numer and ending_number) I need to find if the numbers in a table are contiguious between the staring_number and ending_numer. Is there an easy way to do this? Could you please help?
Vans
7 Oct 10 at 7:49 pm
It looks like this query has very poor performance for large datasets. I found the inner select can be substituted by:
(select a.id as id from sequence a where a.id > l.id ORDER BY a.id ASC LIMIT 1) as end
This will greatly improve performance
EZboy
17 Nov 10 at 11:58 am
This is so close to the solution I need, but I’m afraid a dreaded cursor is my only recourse. I have to identify groups of tickets in a range. Problem is I have multiple ranges within a given row. e.g.
Section Row Seat
112 G 1
112 G 2
112 G 3
112 G 4
112 G 9
112 G 10
112 G 11
Your sql correctly identifies 1 and 9 as the starting points for ranges and 4 and 11 but how do I then join back to the above table and correctly identify seat 3 in group 1 vs group 2? I need to be both less than the end of the range, and greater than the beginning.
Desired output
Section Row StarSeat EndSeat GroupID
112 G 1 4 1
112 G 9 11 2
or
Section Row Seat GroupID
112 G 1 1
112 G 2 1
112 G 3 1
112 G 4 1
112 G 9 2
112 G 10 2
112 G 11 2
Thanks,
John
John
12 Apr 11 at 11:01 am
Ezboy,
Yep that’s a TON faster… but it also doesn’t appear to give the range, just the next item in the list (unless I’m doing something wrong).
Anybody have any suggestions to make this more efficient on large data sets?
glitch
17 Dec 11 at 8:53 pm
I have described an approach that, instead of using “min()” within a subquery to join the “range start” and “range end” tables as done in this post, creates a synthetic index for each and joins on that.
http://gcbenison.wordpress.com/2011/09/26/queries-that-group-tables-by-contiguous-blocks/
I believe that for many large tables, the synthetic index approach will be faster but imposes an additional space requirement due to the added temporary table.
Greg Benison
13 May 12 at 10:08 pm
love this query, do not understand the logic, and I’ve adapted it, and it works form. So much power from such few lines.
nathanm
25 Sep 12 at 9:30 am