Archive for March, 2006

SSH public-key forwarding

SSH is one of the most important tools I have. I use it every day to communicate securely between many different computers, and consider it indespensable. In this article I’ll show you how to forward your SSH agent to connect from any remote server to any other remote server without putting your private key on either of them.

Introduction

First, the concept: SSH is a secure protocol for setting up a communications channel between two computers on a network. This communications channel can act as the “carrier” for any number of uses. All traffic is encrypted, so you can get a “secure line” across an insecure network.

Probably the most familiar usage of SSH is connecting and logging in remotely to another computer. Once logged in, you can use the command line just as though you were sitting at the other computer’s terminal. Other uses include secure FTP, secure file copies, secure CVS access, running GUI applications on another computer and forwarding the display to the computer you’re working on so it appears to be running on your own computer, and on and on.

If there’s a UNIX or GNU/Linux distribution that doesn’t include SSH by default, I’ve never heard of it. Even Mac OS X includes it. Microsoft Windows does not, but there’s an excellent free implementation that runs under Windows: PuTTY.

No more passwords!

One of the really nice things about SSH is that it supports multiple authentication methods. One is passwords. Passwords, though, are one of the least secure and convenient methods of authenticating. Enter public-key authentication: SSH can authenticate me with an encryption key. If I put my public key on someserver, present my private key to the SSH program and tell it to connect to someserver, the SSH server on the other end uses my public key to log me in. I never have to type a password.

All I have to do is generate an encryption key pair. The keys are matched to each other in such a way that anything I encrypt with my private key can only be decrypted with my public key, and vice versa. That’s how I can log in without sending my private key to the other server — my SSH program just encrypts the traffic and sends it, and the server at the other end of the connection can either decrypt it or it can’t. If it has my public key available, it can.

Now, to make it really convenient, I need to load my private key into memory. I use the handy keychain program to do this. One done, my key is ready for use until I reboot the computer. At my previous employer I used PuTTY on Windows, so I used the Pageant program to load my private key. I also used the excellent free CVS client TortoiseCVS (there’s a similar one for Subversion, too — TortoiseSVN).

Guess what? No more passwords! Any computer I connect to will just magically log me in, as long as it has my public key. For more information on this, please read the keychain documentation, especially the IBM DeveloperWorks articles to which it links.

One thing to watch out for: don’t let anyone have your private key. Keep it private and secure. Make sure you know which of your keys is the public key and which is the private key. Nobody needs your private key, and you shouldn’t have it on any other computer — you put your public key on the computers to which you need to connect.

More convenience: forwarding keys

OK, so I’ve set everything up. I have put my public key on the two computers I connect to, server1 and server2. I open a terminal and type ssh server1. Boom! I’m logged in. Then I realize I put a file on server2 and I need it. I’m on server1 right now. No problem; I connect via SSH. But wait, server2 asks me for my password! What’s this?

What happened is, I logged into server1 with my private key, but when I tried to connect to server2, my private key wasn’t available to authenticate me. At this point, I might copy my private key to server1, but that’s not the best way to do it. In fact, it’s possible to tell SSH to forward my private key to servers when I connect to them. This is in-memory only; the key never gets stored on the disk, so it stays secure. But it’s available if it’s needed to connect to other servers. This chain of forwarding can go on and on. All I need to do is tell SSH to forward the keys (more strictly, forward the agent, which has loaded my keys into memory). This is a configuration option, set in the .ssh/config file on my local computer — the one I’m sitting at:

Host *
ForwardAgent yes

That configuration directive tells SSH to forward my SSH agent to all hosts. Voila!

Next steps

I’ve covered a lot of ground here, and not really gone into much depth with any of it, but I hope this gives you an idea of how much there is to SSH — one of those great tools that can be as simple or as complex as you need it to be. I’m not even pretending to be an expert; I know I don’t know that much about it myself. I only know as much as I’m motivated to learn :-)

While I’m on the subject, I want to point out that probably the most widely used implementation of SSH is OpenSSH. It’s from the same people that make OpenBSD, a famously secure operating system. And though it is nearly ubiquitous and indispensable, even for Microsoft folks, very little attention goes to the fact that it’s free but nobody really contributes money to it. The OpenSSH project is painfully underfunded, and there is widespread discussion about the problems this may cause. I encourage you to donate something to it, even if it’s just ten dollars. It is easily the most underfunded project I can think of, especially considering how much of the world depends on it.

Technorati Tags:No Tags

You might also like:

  1. Ubuntu on Dell Inspiron 1501

Temporary table subtleties in MySQL

Temporary tables behave very differently across various database servers. If you’re not familiar with MySQL, some things might catch you off guard. In this article I explain some subtleties of temporary tables in MySQL and explain when you might encounter problems with them. I also show you how the platform-specific features can sometimes be very useful indeed.

Creating temporary tables

In MySQL, creating a temporary table has exactly the same syntax as creating a regular table, with a few restrictions (no foreign keys, for example). For almost all scenarios though, you can create a temporary table anywhere you’d use a regular table. The table is visible only to the connection by which it is created. Even the same user won’t see the table on another connection. When the connection is closed, the table is deleted.

Once per query

Temporary tables can only appear once in every query. Sometimes this can bite unexpectedly. For example, sometimes I use temporary tables to assist in writing queries, and I might write a subquery against the temporary table, then join it back to itself:

select... from temp_table
inner join (
   select ... from temp_table
);

This will fail. It’s fairly obvious in this example, but I’ve run into this a few times when it caught me off guard and took a while to figure out what was going on. The error message is ERROR 1137 (HY000): Can't reopen table: 'temp_table'. Again that’s fairly obvious, but for whatever reason — I can’t recall the circumstances — it was actually hard to root out the problem.

One place I wished this would work was in a query against the integers table. Alas!

Temporary table limitations are documented in the manual.

Masking regular tables

A MySQL temporary table can have any legal name, even the same name as an existing regular table. In this case it will mask the regular table.

This makes some people cringe, but it actually creates some very interesting possibilities, especially for testing — where testing against non-production data is impossible. Yes, I know some people say that’s never necessary, but in real life it sometimes is. For example, testing communications against a system with no way to run anything but real live transactions (many companies, such as a very large one with both A and Z in its name, provide no fully isolated staging systems to test against). In a case such as this, I might run a transaction after masking my real table with a temporary table, examine the temporary table, and when I’m done fetch the changes, drop the temporary table and apply the changes to the (un-masked) regular table.

In my current employment I’ve found this technique useful for automated testing because we’re pretty small and don’t have the resources yet to have a full suite of development servers (we’re almost there…). Until the development servers are completely ready, it’s important to be able to test database interactions just like everything else. One way to do this is begin a transaction, do something, then roll back the transaction. In a table with millions of rows, this is a terrible idea. It’s much better for an object to expose its database handle, which the test harness can use to create a temporary table masking the huge regular table. That way the automated unit tests can stay very low-impact — a requirement if they’re to be run constantly.

Query cache

MySQL has a genuine bug that affects temporary tables. If the query cache is enabled for the current connection and I create a temporary table that masks a regular table, I can INSERT into the temporary table but SELECT from the regular table! What’s happening here? MySQL is incorrectly looking in the query cache, which contains data from the regular table. There’s a bug report that describes this behavior: SELECT on temporary table fails when query cache is on.

The bug is fixed in newer versions of MySQL, and the MySQL developers consider the affected versions (4.1.7, for example) to be extremely old, but many shared hosting providers consider those versions “stable” and probably won’t upgrade for years.

It’s easy to work around the problem; disabling the query cache will fix it.

CREATE and DROP privileges

The privilege to create temporary tables is separate from the privilege to create regular tables, and implies the privilege to drop them. One thing I’ve seen happen is someone has permission to create temporary tables, creates some, and tries to drop them, but gets denied. The gotcha here is the drop statement must also have the word TEMPORARY, e.g.

drop temporary table temp_table;

If the word TEMPORARY is omitted, the statement tries to drop a regular table. It is a very good idea to always include the word TEMPORARY to avoid accidentally dropping a regular table! It is way too easy to get confused, especially when working with temporary tables that mask real tables. When working across databases, things get even worse; imagine I’m in database A, create temporary table B.orders, and then drop table orders, forgetting to qualify it with the database name. Oops! If there’s a table named orders in database A, I just dropped it! If I’d included the word TEMPORARY in my statement, nothing bad would have happened. I won’t say who but, ahem, “somebody I know well” has done this on a production system.

Technorati Tags:No Tags

You might also like:

  1. How to eliminate temporary tables in MySQL
  2. How to do efficient forward-only SQL maintenance jobs

How to find contiguous ranges with SQL

In an earlier article I discussed how to find missing members in a sequence with SQL. In this article I’ll do the reverse: demonstrate how to find the start and end point of each contiguous range.

Motivation

Someone posted a comment on the article linked above, asking how to do this. At least, that’s what I think the question was; I might be misinterpreting it. I considered replying in the comments on that article, but decided it should go in its own article instead.

I’ll use the same sample data as in the earlier article: a sequence of integers from 1 to 20, with the numbers 5, 11, 12, 13, and 14 missing. I’ll also delete the value 7, so 6 is a range of length 1. The desired answer is

The solution

This isn’t as easy as I thought it would be at first. I stared at it for a while, then it came to me: I want to find the start and end of each contiguous range, so I need to define “start” and “end.” The start of a range is defined by the absence of the preceding number. I initially thought “has a next but no previous,” but that’s incorrect because a single number is a valid range; if I require the start to have a “next,” that eliminates 6 (I initially wrote the whole thing wrong, then thought about single-number ranges and re-wrote everything). So the definition of “start” is a number that has no “previous.”

The end of a range is almost the reverse: it has no “next” but might have a “previous.” Additionally, it should be the smallest such number that’s greater than or equal to the start. The “or equal to” is again necessary to include ranges that are just one number.

Each of these queries is fairly simple by itself, using exclusion joins. Here’s one that will find the start of every range:

select l.id
from sequence as l
    left outer join sequence as r on r.id = l.id - 1
where r.id is null;

I’m referring to the left-hand table as “l” and the right-hand table as “r.” Here’s a query that will find the end of every range. It’s almost the same:

select l.id
from sequence as l
    left outer join sequence as r on r.id = l.id + 1
where r.id is null;

Bringing the two together, and meeting the “smallest value greater than or equal to” requirement, is a more complex query. Here I solve it with a correlated subquery:

select l.id as start,
    (
        select min(a.id) as id
        from sequence as a
            left outer join sequence as b on a.id = b.id - 1
        where b.id is null
            and a.id >= l.id
    ) as end
from sequence as l
    left outer join sequence as r on r.id = l.id - 1
where r.id is null;

I’ve re-aliased the subquery’s tables as “a” and “b” to avoid confusion with “r” and “l.”

Technorati Tags:No Tags

You might also like:

  1. How to find missing values in a sequence with SQL
  2. How to find data distributions with SQL
  3. How to subtract in SQL over samples that wrap
  4. How to write a SQL exclusion join
  5. How to simulate FULL OUTER JOIN in MySQL

Thoughts on ReiserFS

I recently watched a video of Hans Reiser’s presentation on ReiserFS, a journalling filesystem. It piqued my curiosity, so I visited the ReiserFS website to read more about the filesystem. I ended up un-learning some things I thought I knew, and had some thoughts to share on ReiserFS in general.

  • The Reiser team believes equal access to source code is a civil right. I agree.
  • I’ve been using version 3 for a long time. As far as I’m concerned, it’s been the best choice for a journalling filesystem since it was created. I’ve never lost any data whatsoever from a ReiserFS filesystem (though if my hard drive ever truly failed, the filesystem wouldn’t matter). Version 4 is out of the oven now (ok, so I’m a few years behind…), and it looks absolutely amazing. I think I will upgrade going forward.
  • The performance enhancements in version 4 are stunning. Consider this: for the first time in history, a compressed filesystem is faster than an uncompressed one. What does that mean? It means the real bottleneck is reading/writing the data on the disk, and the compression and decompression is so fast that it’s less expensive to do a bunch of computation and transfer less data, than to do less computation and transfer more data. That is remarkable.
  • ReiserFS is faster, scales better, and is more space-efficient than any other filesystem, according to the benchmarks on their website.

ReiserFS has a lot of implications not only for ordinary file-storage, but for special purpose systems too. It’ll be interesting to see what the next few years bring.

Technorati Tags:No Tags

You might also like:

  1. How to make file names cross-platform
  2. Review of the iRiver HD340

Simple and complex types in XML Schema

If you’ve worked much with XML Schema, or tried to read a schema, you’ve probably run into markup that refers to simple and complex types. These terms can be confusing. In this article I’ll explain what they mean in simple terminology, and point you to resources that can help you learn more.

Motivation

Suppose I’m writing code to talk to a web service, and I’ve been getting error messages complaining about something called “email” being malformed when I try to invoke the getPreferences operation. I want to figure out exactly what the service expects me to send it. I open up the WSDL and search until I find the relevant definitions:

<element name="email">
  <simpleType>
    <restriction base="xsd:string" />
  </simpleType>
</element>

<element name="getPreferences">
  <complexType>
    <sequence>
      <element name="email" type="email" />
    </sequence>
  </complexType>
</element>

WSDL is written with XML Schema, so I’m looking at an XML Schema document. But I’ve forgotten exactly what it means. What are simpleType and complexType again? Unless I work with schemas fairly frequently, I get confused about this (even though I’ve worked with schemas for many years). I always have to refresh my memory.

The simple answer

The most succinct answer is as follows:

In XML Schema, there is a basic difference between complex types which allow elements in their content and may carry attributes, and simple types which cannot have element content and cannot carry attributes.

That’s from the XML Schema Primer, which I highly recommend. I think it’s probably the best introduction to XML Schema.

To summarize,

  • simple types can only have content directly contained between the element’s opening and closing tags. They cannot have attributes or child elements.
  • complex types can have attributes, can contain other elements, can contain a mixture of elements and text, etc etc.

Now I know the web service is expecting an element that looks like the following:

<getPreferences>
  <email>john@nowhere.com</email>
</getPreferences>

So why am I confused?

This is pretty simple. Why do I have to refresh my memory whenever I haven’t worked with schemas for a few months? The answer is, the above refers to element type. There’s also something called content type, and some of the names are similar (and therefore confusing). Each element’s content is one of several content types:

  1. A simple type’s content can be one of:
    1. atomic types, which have indivisible values, such as #000 and #AACCDD
    2. list types, which have whitespace-separated lists of indivisible values, such as blue green red
    3. union types, which have either atomic or list values, but they can be the union of other types, such as blue #000 red for a set of colors
  2. Complex types have a “content model,” which refers to how the content (the data between the element’s opening and closing tags) is arranged:
    1. simple content is only character data, no child elements allowed
    2. element-only content is only children, no data allowed
    3. mixed content means character data and child elements can be intermingled
    4. empty content means the element is empty (<foo/>) and either conveys information by just existing, or has attributes but no content.

Just to clarify: elements have an element type, and their content has a content type. By the way, attributes can only have simple types, because they cannot themselves have attributes or children.

For further reading, I again heartily recommend the Primer linked above. Another good resource is Priscilla Walmsley’s Definitive XML Schema. She not only knows her stuff (she’s part of the W3C XML Schema Working Group), but she writes very well.

Technorati Tags:No Tags

You might also like:

  1. Why not to use CSS for columnar layouts
  2. How to display an HTML table as a folder tree
  3. Advanced HTML table features, Part 1
  4. Seldom-used HTML form elements
  5. Don’t change a constant variable

GnuCash to MySQL export script

Download gnucash2mysql and queries

It’s tax season, and this year I decided to export my GnuCash data into a MySQL database for analysis. This article is about that process, including source code for the export script and a couple of simple queries.

The script

GnuCash is a powerful double-entry accounting system for managing personal and/or business finances. I’ve used it since about 2002, and find it adequate for my needs. However, the reporting is widely acknowledged to be one of its weakest points. Reporting is implemented in the Scheme programming language, and if the included reports aren’t enough, most people probably won’t be able to write their own. Personally, I find it much easier (for certain reports) to write SQL queries instead. With that in mind, I set out to write a little script that will insert my GnuCash data into a MySQL database.

It’s actually a fairly simple task; the GnuCash developers chose XML as the file format, so the data is easily accessible from other programs. I decided to export only the account structure and transactions, leaving alone the business features such as customers and invoices.

Before I started, though, I spent some time investigating the built-in PostgreSQL backend. The default data format is XML, but the GnuCash developers also have support for storing the data in a PostgreSQL database. Unfortunately, I couldn’t get it to work. At least in version 1.8.11, it seems to be broken. In fact, the GnuCash website and developer mailing lists indicate this is a low priority for right now, as they are working towards a new version which uses an updated graphical toolkit (GTK+). Apparently the source is high-quality, but a bit out of date since the developers plan to revisit that once the new version is ready.

Satisfied that I’m not reinventing the wheel, I went ahead with a script to export the data myself. Since the XML format is really straightforward, I decided to do the simplest thing I could get to work. Perl seemed like a good choice. I am familiar with expat, and Perl has an expat module, so that was also a natural choice for the XML parsing. The basic idea is to push the data through the script like drinking from a firehose, and when it detects certain things — the start of an element, some character data, the end of an element — call a function to handle the data. Depending on what the data is (what element I’m currently examining, etc) I either store it for later reference, or push it into the database. I also automated creating the necessary table structure in the database.

Access to a MySQL database is required. The script uses tables called account, transaction and split. The DDL used to create the tables is at the end of the Perl script file. I used InnoDB tables so I’d have transaction support.

The script requires these Perl modules: DBI, Term::ProgressBar, and XML::Parser::Expat. You can run it with the --help option to see how to run it.

The queries

These queries assume precision decimal math. Versions of MySQL less than 5.0 use imprecise math. I wrote an article on MySQL and decimal math. If you have any issues with these queries, it may help to consult that article.

This query finds all unbalanced non-equity transactions by summing the splits:

select
    s.amount,
    a.name,
    t.description,
    t.posted
from account as a
    inner join (
        select transaction, sum(amount) as amount, max(account) as account
        from split
        group by transaction
        having sum(amount) <> 0
    ) as s on s.account = a.id
    inner join transaction as t on t.id = s.transaction
where a.type <> 'EQUITY'

This query sums all expenses for 2005 by month and account:

select date_format(posted, '%Y-%m') as month, name, sum(amount) as amount
from transaction as t
    inner join split as s on s.transaction = t.id
    inner join (
        select id, name from account
            where type='EXPENSE'
    ) as a on a.id = s.account
where year(posted) = 2005
group by date_format(posted, '%Y-%m'), name
order by date_format(posted, '%Y-%m'), name;

This query finds average monthly expenditures by account since January 2005:

select @num_months := count(distinct date_format(posted, '%Y-%m'))
    from transaction
    where posted >= '2005-01-01';

select cast(sum(amount) / @num_months as decimal(8,2)) as 'Average monthly amount',
    concat(coalesce(grandparent_name, ''),
        if(grandparent_name is null, '', ' > '),
        coalesce(parent_name, ''),
        if(parent_name is null, '', ' > '),
        name) as name
from (
    select date_format(posted, '%Y-%m') as month,
        a.name,
        aa.name as parent_name,
        aaa.name as grandparent_name,
        sum(amount) as amount
    from transaction as t
        inner join split as s on s.transaction = t.id
        inner join (
            select id, name, parent from account
            where type='EXPENSE'
        ) as a on a.id = s.account
        left outer join account as aa on aa.id = a.parent
        left outer join account as aaa on aaa.id = aa.parent
    where posted >= '2005-01-01'
    group by date_format(posted, '%Y-%m'), a.name
) as x
group by name
order by name;
Technorati Tags:No Tags

You might also like:

  1. More GnuCash to MySQL tools and queries
  2. How to use foreign key cascades in MySQL

How to avoid many-to-one problems in SQL

It’s been a while since I’ve posted an abstract, theoretical article on a fine point of SQL. Today I want to bring your attention to two ways in which an RDBMS can allow you to do something that has no well-defined result. These involve queries where several values are eligible, but only one is chosen — and chosen in an undefined manner. I’ll show you the two blunders, tell you when they might occur, and explain how to avoid them.

Selecting an un-grouped column in a grouped query

As far as I know, this bad behavior only applies to MySQL. As the manual explains, MySQL “extends the use of GROUP BY” to allow selecting columns that do not appear in the GROUP BY clause. What does this mean? Well, suppose I have the following data:[1]

MySQL lets me write the following query against the data:

select Fruit, Price, count(*) as c
from Fruits
group by Fruit;

The results will look like this:

Here’s the problem: the query groups the tuples (rows) into two groups, one group containing two Apples and one with two Oranges. I can’t logically get “the price” from two tuples in a group, because there is no one “the” price. In the formal mathematics upon which SQL is based, this query is nonsense. MySQL’s documentation admits as much, and tells me not to do this unless all the tuples in the group have the same value in that column — or I’ll risk getting unpredictable behavior. In my example, it’s pretty easy to see MySQL chooses the value from the “first” tuple in the group (a funny notion, given that there is no first tuple because sets are theoretically unordered).

This is a bad behavior introduced solely for the sake of optimization, and the documentation admits that. Grouping requires sorting, which requires comparing values, so grouping literally by as few columns as possible — even when the logical grouping may be by more columns — is an (ill-gotten) efficiency gain.

As with any other non-standard technique, the benefit is offset by lack of portability. Plus, it might cause evil glares from colleagues ;-).

To avoid this problem on MySQL, use standard SQL (sorry for stating the obvious). To make my query standard SQL, I’d either have to use an aggregate function on that column, such as SUM, MIN, AVG, MAX or similar — or, I could add the column to the GROUP BY clause, which would separate the results into more groups and change the output. In other words, every column must either be in the grouping clause or an aggregate function.

Some platforms, such as SQL Server 2000, will not allow the query. MySQL can be made to throw an error too, if ONLY_FULL_GROUP_BY mode is enabled.

It’s important to group the query by the data, not by the expressions used to transform the data. Programmers who don’t really understand what’s going on sometimes just change the code to make the errors stop, often making an even worse mess out of the query. Watch out for this. Here’s a good article on how to group queries properly.

Updates from a join

Unlike the first gotcha above, this one can happen systems other than MySQL. The issue is an UPDATE in a join with a FROM, where tuples from the base table being updated appear multiple times. Here is a query:

update t1
   inner join t2 on t1.id = t2.id
set t1.col1 = t2.col1;

/* Equivalent syntax for SQL Server 2000
update t1
   set t1.col1 = t2.col1
   from t1 inner join t2 on t1.id = t2.id
*/

If you’re used to seeing it, it may look like there’s nothing wrong with that query[2]. Suppose, though, that my data looks like this:

(Ignore for a moment that this table has pretty much the same data as the Fruits table…)

I’ll re-write the query to show how I might unwittingly update a FruitPrices tuple from multiple Fruits tuples:

update FruitPrices as fp
   inner join Fruits as f on f.Fruit = fp.Fruit
set fp.Price = f.Price;

What does this statement actually do? Well, logically it first joins the base tables together:

Next it updates each Price value in the left-hand side from the column on the right-hand side. But wait, the value appears twice — that means logically, Apples are being assigned $5.00 twice, and Oranges are being assigned both $4.00 and $6.00 prices. Danger, Will Robinson! Which one wins? As it turns out, in MySQL again the “first” value wins. Not in SQL Server 2000, though — the “last” one wins on that platform, if memory serves. It doesn’t really matter the particulars of which value wins; it would be more legitimate if the database server threw an error, in my opinion.

I can think of a few ways to avoid this situation.

Method 1: Avoid non-standard syntax

Neither syntax above is standard, and neither makes any sense from a true relational standpoint, which is why they have undefined, vendor-specific behavior. A standard UPDATE statement does not have a FROM clause. Joe Celko has written extensively about this:

The correct syntax for a searched update statement is

<update statement> ::=
  UPDATE <table name>
     SET <set clause list>
  [WHERE <search condition>]

<set clause list> ::=
  <set clause> [{ , <set clause> }...]

<set clause> ::= <object column> = <update source>

<update source> ::= <value expression> | NULL | DEFAULT

<object column> ::= <column name>

The UPDATE clause simply gives the name of the base table or updatable view to be changed.

That’s not terribly enlightening to most people, especially those not used to reading BNF! Let me try to correct the query:

update FruitPrices as fp
   set fp.Price =  (
      select f.Price from Fruits as f
      where f.Fruit = fp.Fruit);
ERROR 1242 (21000): Subquery returns more than 1 row

Oops! It looks like MySQL is now complaining about me trying to update a single value from a whole set of values! Very good. This shows me that my query is wrong, instead of silently doing something bad. Here’s a query that works:

update FruitPrices as fp
   set fp.Price =  (
      select max(f.Price) from Fruits as f
      where f.Fruit = fp.Fruit);

Method 2: Join one-to-one

The second way, if you must use non-standard, mathematically invalid syntaxes, is to make sure the join is based on indexes and primary keys in such a way that the many-to-one problem doesn’t happen. For example, if the columns used in the join criterion are the primary key in the right-hand base table, it’s safe.

Method 3: Group the right-hand side

The last is to follow the advice of the article linked above and group the right-hand table appropriately. This is effectively the same thing as my second suggestion.

[1] You can create the tables I’m using with the following scripts:

create table Fruits(
   Fruit varchar(50),
   Price decimal(3,2));

create table FruitPrices(
   Fruit varchar(50) not null primary key,
   Price decimal(3,2));

insert into Fruits values 
   ("Apples", 5.00),
   ("Apples", 5.00),
   ("Oranges", 6),
   ("Oranges", 4);

insert into FruitPrices(Fruit)
   values("Apples"), ("Oranges");

What’s really wrong with these queries?

The relational model, which SQL doesn’t follow exactly, is all about functions in the mathematical sense. Recall a function is just a mapping from the domain to the range, and one input value from the domain must produce exactly one output in the range. A given input value may not map to two output values. This is why a lot of functions can’t be turned around backwards and still be functions. For example, sin(0) is 0, but so is sin(2*pi), and sin is a function; but the inverse isn’t a function. If you turn sin around and try to put 0 into the back end, what do you get out? You get 0, and 2*pi, and … infinitely many other values.

The incorrect statements I’ve shown above make no sense because they’re trying to shove data into a function backwards, and there can be more than one result on the output. I’ve shown how RDBMSs often just pick one of the outputs, and it’s fine to know that’s going to happen, but it’s also important to know what is really going on.

This really does matter. Two days ago at work, my boss brought up a situation where a production query on our main database server had created a bad situation because of updates in a join. Bogus!

[2] If you’re not familiar with either of these syntaxes, I feel your pain. I wasn’t either until I got out of database-theory classes. These syntaxes are confusing because they are meaningless, not because you are inexperienced. And every DB vendor implements them differently, yet another reason to avoid them.

Technorati Tags:No Tags

You might also like:

  1. How to write a SQL exclusion join
  2. What is a SQL blind insert?
  3. How to select from an update target in MySQL
  4. How to write multi-table, cross-database deletes with aliases in MySQL
  5. How to delete duplicate rows with SQL, Part 2

How to avoid imprecise DECIMAL math in MySQL

MySQL versions 4.1 and below use imprecise math in operations with DECIMAL data, which is supposed to be precise (that’s the whole point). There is no real solution to the problem, though there are workarounds. There is also at least one genuine bug in MySQL related to this problem. In this article I’ll explain the problems, demonstrate them in action, and show you how to work around them.

The problem

Many fractional values cannot be represented exactly as a floating-point number in computers. For example, the value one-tenth, which we represent as 0.1 in base ten, is impossible to represent exactly in base two. SQL databases provide fixed-point data types to support precision math where it is needed. Currency values are a typical use.

MySQL supports DECIMAL data types, which store numbers as strings instead of as numbers so there are no issues representing them exactly, but versions prior to 5.0 perform operations on the values with floating-point math. For example, the SUM function converts the string representation to floats before operating on them. This is documented in the MySQL manual.

Demonstration

I first encountered this problem while building a system to import my financial data into a database so I could query the transactions with SQL. I ran a query to find unbalanced transactions caused by splits that had been deleted:

select transaction, sum(amount) as amount
from split
group by transaction
having sum(amount) <> 0;

Here’s the result:

The comparison sum(amount) <> 0 should have eliminated about half those tuples. The fact that it didn’t, combined with the presence of -0.00 (negative zero), made me suspect floating-point values were being used behind the scenes. The numbers were being displayed as fixed-point, but if I could display them as floating-point, I could verify my theory. It’s not possible to use CAST to cast a value to floating-point in MySQL 4.1, but I accomplished the same thing by multiplying the amount column by 1e1. When I did this, I found the numbers weren’t exactly zero; they were just close, for example, 3.1086244689504e-13.

After hunting around for a while without luck, I entered a bug report, which was changed to not-a-bug status with a friendly pointer to the documentation (oops!).

The bug

There’s still at least one real bug, though. The following query adds an ORDER BY clause to the query above:

select...
order by amount;

The results are interesting indeed!

What happened to the spurious results? They disappeared! An ORDER BY clause is never supposed to do anything but order the results; it certainly should not eliminate tuples. I think this is kind of weird, creepy and cool at the same time.

If I had to guess, I’d say the results are being ordered before they are filtered by the HAVING clause, and in the process, converted back to DECIMAL from float, then eliminated by the HAVING. That may not be what’s really happening, but it seems likely.

This strange behavior raises the possibility of improving the code, too. Ordering may not be as efficient as it could be; it should be the last operation in a SELECT so it operates on as few tuples as possible. And an implicit cast forced by the ordering operation doesn’t seem right, either; it means the ordering clause changes the values in the tuples as well as changing which tuples are present in the output. Both should be of concern.

Workarounds

There is no way to force precision math in these types of operations. It cannot be done. No amount of casting or rounding will fix the problem reliably on every platform (math is different on every platform, and a “solution” may not work everywhere). There are a couple of workarounds, though.

  • Compare to an acceptable tolerance. For example, instead of having sum(amount) <> 0, use having abs(sum(amount)) < .001.
  • Add an ORDER BY clause! Just kidding.
  • Save the results in temporary tables, then perform further operations on them. Once they’re stored in temporary tables, they will be cast back to their string representation and the close-but-not-quite values will become exact again.
  • Upgrade to version 5.0 or above, where exact math is implemented with 64-bit integer operations. Yay!

With a little imagination, it’s probably possible to work around most situations. I hope this article helps you avoid possible problems with imprecise math.

Technorati Tags:No Tags

You might also like:

  1. GnuCash to MySQL export script
  2. Why I use explicit date math in SQL

How to flatten hierarchies with awk

Suppose you have a spreadsheet with columns of product category names and numbers. The product hierarchy has two levels, indicated by writing top-level categories without numbers. How do you flatten the hierarchy quickly for insertion into a relational database? My solution is awk. In this article I’ll show you some sample data, demonstrate how to process and reformat it with awk, and explain how to avoid some basic pitfalls.

Sample data

Here is the sample data in the spreadsheet, with header columns removed:

The top-level entries have no identity themselves; it is not possible to place a product into a top-level category. Since I want to insert the categories into a database for an application to use, I don’t want these entries at all. I want to flatten everything out, separating the levels with >. Here is my desired result:

I could do this any number of ways, but since it lends itself well to line-by-line processing, I elected to use awk. Perl would have worked just as well.

Flattening the categories

The basic idea is to examine each line and see if it has a category number. If it doesn’t, it’s a parent category, and I save its name to a variable. If it does, it’s a child category, and I print out the (saved) parent’s name, the angle bracket, and its own name, followed by the category number. Here’s some code to do that:

/\t$/ {
    current = $1 " > ";
}

/\t.+$/ {
    printf("%s %s\t%d\n", current, $1, $2);
}

The first block matches anything with a tab at the end of the line, and saves the value of the first column to the variable current. The second block prints out current, the first column, and the second column. I saved it to transform.awk and invoked it like so:

awk -f transform.awk original.csv

Since the CSV file’s fields are surrounded by double quotes, I piped the result through sed to nuke them:

awk -f transform.awk original.csv | sed -e 's/"//g'

More ideas

I can use this general idea in a number of ways. Unfortunately, the original CSV format is slightly hare-brained, so this doesn’t generalize to hierarchies deeper than two levels. One file I transformed did have several levels of hierarchy. The top-level categories were bolded, intermediate were not, and “leaf nodes” had a number. As an Excel spreadsheet, the bolding is fine. Once it’s saved to a CSV file, the bolding disappears. I tried the following script to get me partway there:

/\t$/ {
        level = level + 1;
        if (level == 1) {
                level1 = $1;
                current = $1 " > ";
        }
        if (level == 2) {
                level2 = $1;
                current = level1 " > " $1 " > ";
        }
}

/\t.+$/ {
        level = 0;
        printf("%s %s\t%d\n", current, $1, $2);
}

That’s fine as far as it goes, but it’s not a complete solution. A quick Vim macro solved the rest of the problem for me. If automating is harder than doing it with a macro, and I won’t be doing it a lot, I’ll just use a macro. If I do it often, I’ll automate (three strikes and you automate!).

Pitfalls

  • Watch out for awk printing lines that don’t have a number. If it’s expecting two columns and thinks the second column should be a number, it’ll print a zero. That’s why the second code block doesn’t just print every line.
  • Beware spurious spaces: printf("%s %s \t %d\n", current, $1, $2); will cause every number to have leading, and every category name to have trailing spaces.

Summary

I hope this gives you some ideas on using awk for processing files line-by-line. It is built specifically for the job; when you have a file that needs this type of processing, there’s no better tool. For a quick one-off job when you don’t need complex logic saved in a file, you can easily write an awk program right on the command line. For example, to find all non-top-level categories and swap the category and id:

$ awk '/\t.+$/{print $2 "\t" $1}' original.csv
1       Novels
2       Biographies
3       Self-Help
4       Rock
5       Jazz
6       Classical
Technorati Tags:No Tags

You might also like:

  1. More GnuCash to MySQL tools and queries
  2. RAISERROR severity levels demystified
  3. How to convert MySQL output to HTML tables
  4. How to find data distributions with SQL
  5. Review of the iRiver HD340

Blogs as glamour magazines

I wrote previously about IE blog’s excellent quality. Unfortunately, there’s bad news to report, too. I’ve been noticing some trends in web design blogs. Many have a similar style to glamour magazines. While many elements of glamour magazines are not echoed in these blogs, to the extent they are, I think it detracts from them quite a bit.

Top X lists

Many blogs posts try to “bite-size” a topic. A common theme is Top n Lists. Some are good, but many are completely substanceless. Anyone can write one in a few minutes.

Compare this to the cover of a glamour magazine:

  • Ten sex tips that will drive him wild
  • Is your diet working? Take our 15-point quiz
  • 7 easy moves to firmer thighs

I think the appeal is the number itself. It subtly implies the list is complete — if the number isn’t there, the title doesn’t assert anything about completeness. Who wants to read Easy moves to firmer thighs? There’s subconscious psychology at work here. Of course, the list often falls short of a strong start, much less completeness!

Repetitive, non-original content

The Internet is becoming an echo chamber. Someone writes a post, then someone else writes a post about it, then… it’s not uncommon to find thousands of links to a post. Many of these posts appear to be original content, until you read carefully and realize it’s a formulaic, substanceless copy-and-paste (partial or complete) with little or no original content — sometimes not even a short intro sentence.

For an example, try a Google search for “top 10 custom javascript functions of all time” and see how many results you get. I’m getting about 142,000 results. MSN finds thousands of pages that link directly to the post on the author’s site. One such post, on a very popular blog, misspelled two words in the title until it was corrected a few minutes later! That’s an indication of how little time goes into these echo-chamber posts.

It’s becoming harder and harder to find original content in the soup of copy-and-paste posts. Some blog networks, such as 9rules (of which this site is a member) are acting as trusted filters, but even then a lot of the content is derived.

Glamour magazines are often similarly unoriginal. At first glance, many articles look like real journalism, but then it becomes apparent much of the material is secondary, often just copied from some “authority” on the subject with a few sentences framing the quotes. I know people who’ve been exploited by such “article-writing” practices. There are many vultures waiting for someone else to make something they can use.

How glamour magazines are better

I wish it weren’t true, but glamour magazines are often written better than blogs. For example, let me pick on one particular blog which claims to be well written, and is therefore a fair target for criticism: Good Copywriting. The second post contained this text urging people to clean up their spelling:

Spelling, grammar, what? This is an all too common mistake I find on the web. Web sites, even professional ones, are riddled with simple spelling and gramatical mistakes… Run your site through a spell checker and grammar checker…

Amusingly, the author misspelled “definitely” as “definately” a few paragraphs later. I corrected this in a comment, which never got approved by the moderator — but minutes later, the spelling was fixed. Subsequent posts have been just as bad, with jewels like “truely” (twice in a couple of sentences!), “guarenteed,” “an equal’s sign,” “brining” (should be “bringing”), “effected” (instead of “affected” — there is a difference), run-on sentences, and so forth. Glamour magazines, for all their faults, generally seem to be proofread!

The future

I don’t have a crystal ball, but I have a feeling people will tire of repetitive, non-original content. I think the current blogging fad will fade quite a bit, and the folks doing it for the better reasons will probably continue writing quality material. I welcome such a change.

One final note: please feel free to leave comments telling me how I can improve my own writing, too!

Technorati Tags:No Tags

You might also like:

  1. IE blog is a great experience
  2. How to install and maintain multiple WordPress blogs easily
  3. Copyright statement, privacy policy and terms of use