Xaprb

Stay curious!

How to subtract in SQL over samples that wrap

This article explains how to do subtraction in SQL over samples that wrap back to zero when they exceed a boundary.

A reader wrote in with a question about how to find how much traffic has passed through a network interface. The reader has a script that samples the interface’s statistics and stores them in a database. The statistics wrap back around when they exceed the maximum size of an integer, so it’s not a strictly increasing sequence. The question, paraphrased, is “how can I find out how much traffic has gone through the interface in any given time period?”

A key assumption is that the counter never wraps back to zero more than once between samples. If it does, all hope is lost.

Setup

To simplify the math, pretend the counter wraps at 1,000 and you have the following table:

```create table samples(
num int not null auto_increment primary key,
bytes int not null
);

insert into samples(bytes) values
(100), (900),
(230), (700), (982),
(163), (600);

select * from samples;
+-----+-------+
| num | bytes |
+-----+-------+
|   1 |   100 |
|   2 |   900 |
|   3 |   230 |
|   4 |   700 |
|   5 |   982 |
|   6 |   163 |
|   7 |   600 |
+-----+-------+```

How much traffic?

A manual calculation is easier than it looks, and solving this by hand is the key to solving it in SQL. You don’t have to do a bunch of nasty math, like subtracting 982 from 163 (that’s already too hard for me). You just have to notice where the counter wraps. You can find these places by seeing where the number decreases from one sample to the next. In the example, the counter wraps twice: from 900 to 230, and from 982 to 163. Here’s the data, graphed with wraps “unrolled.”

There are several ways to proceed from here. One way is to calculate the traffic as 1,000 times the number of wraps. Then you just do a little math to “clean up the edges:” subtract the first number in the sequence, and then add the last number. This gives (2 * 1000) – 100 + 600, which is 2500.

Another approach is to go row by row, summing the differences from the previous row and the last row. When the counter wraps, you add 1000 before taking the difference. This math gives the same answer. This is a lot harder to do by hand.

Either technique works given an arbitrary start and end point in the sequence. Now let’s see how to do these in SQL.

Problem: find the “previous” row

While these methods seem easy to humans, they resist many relational solutions because of the notion of “previous row.” SQL is set-oriented, and doesn’t do iterative row-by-row data manipulation. If you try to do this by grouping each strictly increasing set of data together and using aggregate functions like `SUM`, you’ll have trouble. You need the values from the “previous set” to do that, and that doesn’t work like you might want it to.

However, it’s not that hard to get the current and last row matched up side-by-side so you can operate upon them within the context of a single row:

```select cur.num, cur.bytes, prev.bytes as prev_bytes
from samples as cur
left outer join samples as prev on cur.num = prev.num + 1;
+-----+-------+------------+
| num | bytes | prev_bytes |
+-----+-------+------------+
|   1 |   100 |       NULL |
|   2 |   900 |        100 |
|   3 |   230 |        900 |
|   4 |   700 |        230 |
|   5 |   982 |        700 |
|   6 |   163 |        982 |
|   7 |   600 |        163 |
+-----+-------+------------+```

Once you think of “previous” in SQL terms, it becomes easy. Armed with this tool, we are ready to take on the queries.

Technique 1: count wraps and clean up the edges

Now that we’ve figured out how to find the “previous row,” how can we express the “count wraps and clean up edges” algorithm in SQL? Brace yourself:

```select 1000 * sum(t1.wraps) - t2.start + o.bytes as total
from samples as o
inner join (
select cur.num, count(prev.num) as wraps
from samples as cur
left outer join samples as prev on cur.num = prev.num + 1
and cur.bytes < prev.bytes
group by cur.num
) as t1 on t1.num <= o.num
cross join (
select bytes as start from samples order by num limit 1
) as t2
where o.num = 7
group by o.num```

Anything I say about that query will probably make it harder to understand, so I'll just count on you reading it carefully. It may help to remove some of it so you can see the intermediate results:

```select sum(t1.wraps) as wraps, t2.start, o.bytes
from samples as o
inner join (
select cur.num, count(prev.num) as wraps
from samples as cur
left outer join samples as prev on cur.num = prev.num + 1
and cur.bytes < prev.bytes
group by cur.num
) as t1 on t1.num <= o.num
cross join (
select bytes as start from samples order by num limit 1
) as t2
group by o.num;
+-------+-------+-------+
| wraps | start | bytes |
+-------+-------+-------+
|     0 |   100 |   100 |
|     0 |   100 |   900 |
|     1 |   100 |   230 |
|     1 |   100 |   700 |
|     1 |   100 |   982 |
|     2 |   100 |   163 |
|     2 |   100 |   600 |
+-------+-------+-------+```

You can also write it as a correlated subquery, instead of a subquery in the `FROM` clause:

```select 1000 * (
select count(*) from samples as cur
inner join samples as prev on cur.num = prev.num + 1
and cur.bytes < prev.bytes
where cur.num <= samples.num
)
- (select bytes from samples order by num limit 1)
+ samples.bytes as total
from samples
where num = 7```

Both queries need `WHERE` clauses in multiple places to make them behave if you want anything other than the full range of data summed up. For example, if you want to sum over rows 3 through 6, the first query becomes

```select 1000 * sum(t1.wraps) - t2.start + o.bytes as total
from samples as o
inner join (
select cur.num, count(prev.num) as wraps
from samples as cur
left outer join samples as prev on cur.num = prev.num + 1
and cur.bytes < prev.bytes
where cur.num > 3
group by cur.num
) as t1 on t1.num <= o.num
cross join (
select bytes as start from samples where num >= 3 order by num limit 1
) as t2
where o.num = 6
group by o.num```

The problem with both queries is the `<=` predicate, which turns them into O(n2) algorithms. They're essentially a cross-join. Plus, they're hard to understand. It turns out that the simplest method by hand is complicated in SQL.

Method 2: Adjust when there's a wrap

The second method I showed above is more complex for humans, but it's actually simpler to do in SQL:

```select sum(
if (cur.bytes >= prev.bytes,
cur.bytes - prev.bytes,
cur.bytes - prev.bytes + 1000)) as total
from samples as cur
inner join samples as prev on cur.num = prev.num + 1
-- optional WHERE clause for choosing start/end:
-- where cur.num > 3 and cur.num <= 6```

A slightly more compact way to write this is

```select sum(
cur.bytes - prev.bytes + if(cur.bytes >= prev.bytes, 0, 1000)) as total
from samples as cur
inner join samples as prev on cur.num = prev.num + 1
-- where cur.num > 3 and cur.num <= 6```

This query is both simpler and more efficient than the first method I showed. If your platform doesn't support `IF()`, use a `CASE` statement.

Method 3: do it with user-variables in MySQL

It's possible to do even better than the simple join technique on MySQL. Using some MySQL-specific tricks, you can make this query a once-through, low-cost algorithm, much the way you might do it by hand or in a programming language that supports iteration. If you want to know how this works, and why the query has to be written in such a contorted way, read my article on advanced user-variable techniques in MySQL.

```set @last_bytes := null;

select sum(greatest(
if(bytes >= @last_bytes,
bytes - @last_bytes,
coalesce(bytes + 1000 - @last_bytes, 0)),
least(0, @last_bytes := bytes)
)) as bytes
from samples order by num;```

This is a bit trickier to write than some of the other user-variable examples I've shown, because you can't use `@last_bytes is null` in any `IF()` or `CASE` statement. If you do, the query optimizer will look at `@last_bytes` at compile time, see that the statement can be optimized out and replaced with a constant, and your query will not work as you expect it to.

Summary

In this article I've shown you three methods to do SQL math on a counter that wraps around to zero when it reaches a limit. I showed them to you in order of increasing efficiency, but the second method is both the simplest and the most platform-independent (and probably the most sane).

Can you think of any other ways to do this, or any other uses for these kinds of techniques? Write into the comments!

Written by Baron Schwartz

February 19th, 2007 at 8:42 am

Posted in Uncategorized

Tagged with , ,

How to delete duplicate rows with SQL, Part 2

By reader request, this article explains ways to remove duplicate rows when they are completely identical, and you don’t have a primary key or other criterion to identify which rows to “save.”

This is a special case of deleting duplicates. I’ve written another article about the more general case, so I assume you have the background it gives. If not, you should probably go read my article about how to delete duplicate rows in SQL.

Introduction

In general, this is a hard problem. Suppose you have the following data, and you want to delete everything but the first row of its type (you don’t care which, because all duplicate rows are completely identical).

When you’re done, you want just two rows in the table:

Why this is hard

This is hard because there is no way to do this in standard SQL (correct me if I’m wrong). SQL is based on relational algebra, and duplicates cannot occur in relational algebra, because duplicates are not allowed in a set. That’s why SQL doesn’t give you tools to solve this problem.

No database product is truly relational, so in real life it’s possible for duplicates to occur. When it happens, you will have to resort to platform-specific methods to solve it. There should always be a way to do it, because there is always a difference between apparently identical rows. It might be an internal row ID, for example (as in Oracle). If nothing else, the rows have different memory and disk locations in the computer.

The easy way

The easiest thing to do is add a column with a unique number. This is called something different on every platform: it’s an `IDENTITY` column in SQL Server, an `AUTO_INCREMENT` column in MySQL, a `SERIAL` in PostgreSQL, and so on. Look at your platform’s documentation for instructions how to do it.

Once you’ve done that, you’re on easy street. Now go read my previous article to do the actual deleting.

If that won’t do…

Build a new table with distinct values from the old table, then drop and rename:

```CREATE TABLE new_fruits ...;

INSERT INTO new_fruits(fruit)
SELECT DISTINCT fruit FROM fruits;

DROP TABLE fruits;

RENAME TABLE new_fruits fruits;```

If you can’t do that…

Perhaps you simply can’t do either of the above. Maybe your table is too large, for example. In that case you’re going to have to use some sort of iterative technique to do it; loop through the rows one at a time and delete every row you see more than once. This is also going to be a platform-specific solution; you may need to use a `WHILE` loop or server-side cursor. Consult your platform’s documentation for more; I can’t possibly cover all the bases here.

Two examples for MySQL

Here’s a quick technique that uses advanced user-variable techniques on MySQL to delete the rows. MySQL’s server-side cursors are read-only, so some other technique has to be used. User-variables can do the trick, if you write the statement just right — it’s very touchy.

```set @num := 0, @type := '';

delete from fruits
where greatest(0,
@num := if(type = @type, @num + 1, 0),
least(0, length(@type := type))) > 1
order by type;```

If you don’t understand that, go read the article :-) This can be very efficient because it doesn’t require any `GROUP BY` clause. If your rows are “naturally ordered” with all the duplicates adjacent to each other, you can even omit the `ORDER BY` clause (if your rows aren’t “sorted naturally,” you will miss some duplicate rows).

The other obvious option is to repeatedly identify a duplicated row, find how many times it’s duplicated, and delete one less than that many rows. You will need to either do this in a stored routine, or get help from some programming language. For example, in pseudo-code:

```set @num := 0;

select @type := type, @num := count(*)
from fruits
group by type
having count(*) > 1
limit 1;

while @num > 0

delete from fruits where type = 'type'
limit @num - 1;

set @num := 0;
select @type := type, @num := count(*)
from fruits
group by type
having count(*) > 1
limit 1;

end while```

That is pseudo-code, by the way; if you’re doing this in a stored procedure, you’re going to have to concatenate strings together to make an executable statement and execute it. If you’re using an external programming language, you’ll need to fetch the values that are duplicated and dynamically build a statement that deletes all but one row.

Summary

In this article I explained how to solve the special-case problem of removing duplicate rows with no distinguishing columns at all. It’s a harder case of the general problem, and SQL has no built-in way to solve it, so you have to learn your platform’s tricks to solve it. I showed you how to add a unique column so you can use the “easy” techniques I explained in an earlier article. You might also be able to put the rows into another table and drop the original table. Failing that, you have to use something like cursors. As a bonus, I explained two ways to do this in MySQL, one of them sneaky and the other not.

Written by Baron Schwartz

February 6th, 2007 at 9:03 am

Posted in Uncategorized

Tagged with , ,

How to calculate table checksums in MySQL

MySQL has no built-in functionality to calculate a table’s checksum in any storage engine but MyISAM (this is not true; I read the manual wrong, but it doesn’t eliminate the usefulness of the technique in this article). Table checksums can confirm that two tables are identical — useful to verify a slave server is in sync with its master (see my article on reliable MySQL replication for more). Fortunately, it’s easy to calculate a checksum on non-MyISAM tables with user variables. This technique works on any storage engine with any version of MySQL, doesn’t require the `BLACKHOLE` storage engine, and avoids locks caused by `INSERT... SELECT` on InnoDB tables.

Update I’ve coded this method into a Perl script for you to use. See MySQL Table Checksum for more details.

MyISAM checksums

If you want to verify that two MyISAM tables are identical, you can use CHECKSUM TABLE. If you need to do this frequently, you might want to make sure your MyISAM tables have the `CHECKSUM = 1` option on so checksums are maintained continuously and don’t have to be calculated.

Foundation

I derived this technique from Giuseppe Maxia’s article on how to use the `BLACKHOLE` storage engine to avoid cursors. Giuseppe uses a two-step process:

1. Concatenate the values of all columns in the first row, and get a checksum of it with a cryptographic hash function.
2. Concatenate this checksum with a global checksum, and take the checksum of the result, then move to the next row and repeat.

After processing each row in the table this way, you have a checksum of the entire table’s contents.

Two ways to iterate over rows

Giuseppe showed two ways to iterate over every row in the table: one with cursors, and another with user variables. Cursors in MySQL can’t be used with dynamically prepared statements, so they are much less practical for a generic method of checksumming any table; that would require creating a different routine for each table. So cursors are out, and user-variables are in. Here’s the essential code:

```set @crc := '', @cnt := 0;
select @crc := sha1(concat(@crc,
sha1(concat_ws('#', col1, col2... colN)))),
@cnt := @cnt + 1 from tbl order by ID;
select @crc, @cnt;```

(This code both CRCs the table and counts the number of rows in it).

The problem with this method, as Giuseppe points out, is the result set. If you run this on a table with a million rows, you’re going to get a million rows of CRCs and counts back from the first `SELECT`. This is not good; you really want the server to throw the intermediate values away so they don’t consume resources needlessly.

Giuseppe’s solution to this is to change the `SELECT` to an `INSERT... SELECT`. The `INSERT` goes into a `BLACKHOLE` table, which is like dumping it into `/dev/null`. It is simply discarded. This avoids some overhead, and Giuseppe found that it’s faster than using a cursor.

Limitations and performance concerns

There are still some undesirable effects of this approach. One is the `BLACKHOLE` storage engine. It’s not included in MySQL’s official binaries, so you’ll have to compile your own server to use it. It’s also not available before version 4.1.11. Those limitations rule this method out for many users.

The second is InnoDB. If you use InnoDB (odds are you do, or you wouldn’t be reading this) the `INSERT... SELECT` will place shared locks on every row in the table. If you’re doing this on an otherwise quiet server, it may not matter, but if you’re taking live checksums on a production machine, it’s much better to avoid those locks.

I’ll show you how to avoid both problems.

Wrap the user-variables up and discard the results

If you’ve read my articles about advanced user-variable techniques in MySQL, you might see where I’m headed. You don’t have to generate a large result set from that SELECT statement. In fact, you can hide the variable assignments, which you care about, inside a function whose result you don’t care about, and use an aggregate function to eliminate all but one row. Study the following code, which uses the same `fruits` table as in the user-variables article:

```set @crc := '';

select min(
length(@crc := sha1(concat(
@crc,
sha1(concat_ws('#', type, variety, price)))))
from fruits use index(PRIMARY);
+---------+
+---------+
|      40 |
+---------+

select @crc;
+------------------------------------------+
| @crc                                     |
+------------------------------------------+
| 3be9117fff37bcdd3f422e6ce4d24ee2a6642566 |
+------------------------------------------+```

Notice a couple of things: there’s only one row, and the `MIN()` calculation that collapses all those rows into one is very efficient. Maybe a MySQL developer can comment on exactly how much memory this will take, but I think it should be really cheap, since it processes a row at a time and then throws it away.

I omitted the row count calculation for clarity. If you want to count rows as well, the following code will do both in one statement:

```set @crc := '', @cnt := 0;

select min(least(
length(@crc := sha1(concat(
@crc,
sha1(concat_ws('#', type, variety, price))))),
@cnt := @cnt + 1
from fruits use index(PRIMARY);

select @crc, @cnt;```

You should always reset `@crc` and `@cnt` between runs so you get repeatable results.

Further considerations

It’s important to order the `SELECT` by something predictable, or the results will be non-deterministic. However, an `ORDER BY` clause won’t do it — that orders the final result, not the table scan. Forcing a certain index to be used will do the trick, hence the `USE INDEX` clause above. If you don’t have a primary key on your table, use a `UNIQUE` key if that’s available; otherwise, you’re probably not going to be able to get a dependable checksum.

This method is easy to use inside a stored procedure or routine on MySQL 5.0. You can easily build a column list from `INFORMATION_SCHEMA` and generate the dynamic SQL to checksum a table.

This technique is CPU-bound on our servers. I used the BENCHMARK() function to time different hash functions to try to improve the speed. In my tests, `SHA1()` took about 85% as long as `MD5()`. This surprised me; based on some cryptographic function benchmarks I found on the web, I thought `MD5()` would be faster. It may be system-dependent.

I’m also a little worried about using `CONCAT_WS()` to stringify each row for the hash function. This function skips `NULL` values, so it’s easy to come up with an edge case where two rows with different columns hash to the same thing. Although this is a very unlikely problem, I’d still rather not have it. If you know of a way to fix this, please let me know.

Finally, just a comment on doing this on running servers: if you’re comparing a master and slave, you can use `LOCK TABLES` on the master, find the checksum there, and then find the checksum on the slave before releasing the lock on the master. If your slave isn’t far behind the master, it ought to have plenty of time to catch up while the checksum is running on the master, at which point that table should be identical to the master (because you have the table locked on the master, no changes to the table will be replicating to the slave). This makes it practical to verify a slave is in sync without stopping the whole server.

Perl snippet to generate a SQL statement

For the Perl programmers out there, the following subroutine accepts a database handle, database name and table name, and returns the table’s storage engine and a ready-to-run query.

```sub checksum_query {
my ( \$dbh, \$db, \$tbl ) = @_;
my \$ddl = (\$dbh->selectrow_array("show create table \$db.\$tbl"))[1];
my ( \$type ) = \$ddl =~ m/^\) (?:ENGINE|TYPE)=(\S+)/m;
my \$cols  = join(', ', \$ddl =~ m/^\s+(`[^`]+`)/gm);
my \$index = \$ddl =~ m/PRIMARY KEY/ ? ' USE INDEX(PRIMARY)' : '';
my \$query = 'SELECT MIN(LEAST(0, LENGTH(@crc := SHA1('
. 'CONCAT_WS("#", @crc, SHA1(CONCAT_WS("#", ' . \$cols . '))))),'
. '@cnt := @cnt + 1)) AS len FROM ' . "`\$db`.`\$tbl` \$index";
return ( \$type, \$query );
}```

I prefer to use `SHOW CREATE TABLE` instead of `DESCRIBE TABLE`, because it gives me all information about the table, such as the storage engine and index types. I have also found it to be faster, and of course it works on pre-5.0 versions.

Summary

In this article I showed you how to calculate a checksum for an entire table’s contents in MySQL. I used my favorite technique of hiding user-variable assignments inside a function to reduce a table scan’s result set to a single integer. This avoids a lot of overhead, and eliminates the need to insert the result set into a `BLACKHOLE` table, which means you can use the technique on a standard MySQL-supplied server. It also avoids InnoDB row locks. And it works on all storage engines, with all versions of MySQL that support user variables (3.23.6 and up).

I also touched a bit on the finer points of `NULL` values and `ORDER BY` for consistent results. And I gave you some ready-to-use Perl code to generate the SQL you need to execute against a given table.

This is the fastest, easiest, most efficient way I know to compare the contents of two or more tables, which is necessary to verify that a replicated master and slave server are still in sync.

Written by Baron Schwartz

January 25th, 2007 at 4:01 pm

Posted in Uncategorized

Tagged with , ,