Xaprb

Stay curious!

Archive for the ‘C. J. Date’ tag

The dangerous subtleties of LEFT JOIN and COUNT() in SQL

with 8 comments

I want to show you two very expensive SQL idioms that almost everyone uses. These are not query optimization problems — they are financially expensive patterns of SQL.

These queries don’t do what the user thinks they do, and thus they create subtle bugs. And the bugs cause a lot of trouble, and they’re hard to track down, and much time is spent on them. I’ve been meaning to write about this for a while, but a book I read recently brought this topic up again, so I dug this draft out and finished it.

Pattern One: LEFT JOIN

There is a big problem with LEFT JOIN. Can you think what it is? If not, ask yourself what’s the most confusing thing about SQL? That’s right, it’s NULL. People don’t understand NULL, and even people who do understand it have trouble with it, because it’s ugly and nasty and illogical. And LEFT JOIN introduces NULL into your queries, even when there aren’t any NULLs in your data.

What’s wrong with this query?

select tweedle, dee, dum
from table1
   left join table2 on table1.foo = table2.foo
where bar = 5;

If you don’t see a problem, you’ve just discovered why this is an expensive query pattern. Suppose I rewrote it like this?

select tweedle, dee, dum
from table1
   left join table2 on table1.foo = table2.foo
where table2.bar = 5;

Consider what happens when there is no matching row in table2. SQL fills in the “missing” row with NULL. And what happens then? table2.bar = 5 is unknown (it’s neither true nor false), so that entire row is eliminated from the result.

And as a result, the query is exactly equivalent to an INNER JOIN. Subtle but true; and I see this query pattern over and over again in the real world. A significant amount of the time that I’m asked to do query optimizations (20% of the time? maybe more?), I end up explaining that a LEFT JOIN query doesn’t do what the user thinks it does.

Pattern Two: COUNT()

Ironically, COUNT() is one of the few SQL aggregate functions that actually has correct behavior according to the relational model some of the time (but not all of the time). Unfortunately, it has two problems: 1) all the other functions behave incorrectly all of the time according to the relational model, so it’s different, and 2) it has a special-case behavior that many people don’t understand.

To recap briefly, COUNT() has two behaviors. If you say COUNT(expression), then it counts the number of rows in the result for which that expression is not NULL. In other words, it counts the number of values in the result. The other behavior is COUNT(*), which counts the cardinality (number of rows) of the result.

There’s a lot to discuss here, but I want to look at just one thing. Let’s look at the following query:

select user.userid, count(email.subject)
from user
   inner join email on user.userid = email.userid
group by user.userid;

What’s happening in that query? There are a few possibilities I see. The most obvious way to translate this query into English is “get the number of conversations for each user.” Except that’s not what it does. It counts the number of emails each user has. If I wanted the number of conversations, I’d need to say “count(distinct email.subject)”. You need the DISTINCT keyword there.

I said it counts the number of emails each user has, but it really doesn’t. It actually counts the number of times there is a value in email.subject (e.g. it is NOT NULL). If the column is defined to allow NULL, there might be a difference between the number of values in email.subject and the number of emails! So, if the query’s author really wanted to know the number of emails, the query should be this:

select user.userid, count(*)
from user
   inner join email on user.userid = email.userid
group by user.userid;

But what if that’s not what the author of the query meant? There’s no way to really know. There are several possible intended meanings for the query, and there are several different ways to write the query to express those meanings more clearly. But the original query is ambiguous, for a few reasons. And everyone who reads this query afterwards will end up guessing what the original author meant. “I think I can safely change this to…”

This example is meant to be simple, so it doesn’t illustrate perfectly, but I’ve seen lots of cases where many different interpretations are equally reasonable. I often end up being asked to optimize performance for such queries, and again I can only write back to the client “someone needs to tell me what this query is supposed to do, because I can’t figure it out.”

Summary

So now you’ve seen two query idioms to avoid, and why to avoid them. And you’ve seen that the problems are really caused by NULL! If you’re interested in exploring this train of thought further, you should get a copy of SQL and Relational Theory by C. J. Date.

Written by Xaprb

April 8th, 2009 at 4:13 am

Posted in Coding,SQL

Tagged with

A review of SQL and Relational Theory by C. J. Date

with 10 comments

SQL and Relational Theory

SQL and Relational Theory

SQL and Relational Theory How to Write Accurate SQL Code by C. J. Date, O’Reilly 2009. Page count: 266 pages of “real” text, plus hefty appendixes. (Here’s a link to the publisher’s site: SQL and Relational Theory How to Write Accurate SQL Code).

This is a very important book for anyone involved with databases. Before I say why, I need to apologize to Mr. Date. I tech-reviewed part of the book and did not care for it. I am afraid I was quite a curmudgeon in my review comments. So, Mr. Date, if you’re reading this — I want to say I enjoyed the book very much after all.

Back to the topic. This book is a follow-on and replacement to Database in Depth: Relational Theory for Practitioners, which I also own along with some of Mr. Date’s earlier books. However, I found it a much more profitable read than Database in Depth. Mr. Date’s writing is still recognizably his, with lots of digressions, notes, asides, and footnotes. But something’s different. It is much more direct and readable, and I think readers will find it practical and applicable to their real-world problems. (This is Mr. Date’s expressed hope.)

I had a lot of moments while reading the book where I thought yes! He’s said it so clearly! It always bugs me to see those mistakes, and finally here’s a book that can teach people how not to make them.

The book will help the student develop understanding of the relational model, and learn how SQL diverges from the relational model. As the book says, to the extent that SQL is correct and useful, it’s largely because it adheres to the relational model; in the places where it takes a side trip, there be dragons. It’s universal in every discipline, but I believe Mr. Date is the only authority I’ve read who says this about SQL: learn the rules, and when you know them, you know when and why it’s safe to break them. Otherwise you’re a ship without a rudder and compass.

The general format is this: learn the theory of some topic; learn about a relational approach; learn how that maps onto SQL; and learn some rules of thumb for staying away from the dangers caused by SQL’s approach. Here are some of the recommendations, which he has highlighted in bold for easy reference at the appropriate places in the book:

  • Since the phrase “null value” is a contradiction in terms, don’t use it; always say just “null” instead.
  • Never write SQL code that relies on ordinal positioning.
  • Never lie to the system by defining as a key some column combination that you know isn’t irreducible.

Anyone who’s tried to explain NULL to a beginner to SQL, or explain the difference between a key and an index, or fixed a system that broke because it relied on blind inserts and the table definition changed, will see the value of these recommendations. There’s one such recommendation every few pages. This book is packed with them, and every one is well justified and rigorously explained.

That’s not all. Mr. Date teaches the connection between logic and the relational model. Let me tell my own story here for a moment. I didn’t know anything about databases (except a course on relational theory, which I was unable to connect to the “real world” at all) until after I’d left university with a degree in Computer Science. And then I learned SQL, and then sometime later, I put two and two together: all those discrete math courses, all those proofs, all those predicates and propositions, and so on — it was the foundation for the sensible part of relational databases. So eventually it all did come together in my head, and I got a huge kick out of that. Now I cheer whenever I see someone teaching people to think logically about databases.

So I was definitely cheering as I read things like Types are sets of things we can talk about; relations are (true) statements we can make about those things. And then there’s a whole chapter on how to write questions you’d like to ask of a database, transform those into logical expressions, and then transform those in turn into SQL that is correct and will give the right answer.

I don’t expect this book will work for everyone; some people will not have the foundation in either SQL or logic they need to understand it, and so it will be a steep learning curve unless they’re willing to go study some other things as needed. And I think there is room for different approaches to teaching this subject, depending on the reader’s mental orientation (visual, logical, etc). But for those who like the formal logic-based approach, and/or are willing to invest some effort into really understanding, I think the payoffs will be great.

Written by Xaprb

March 29th, 2009 at 6:18 pm

Posted in Coding,Review,SQL

Tagged with