Xaprb

Stay curious!

Implementing SQL with Unix utilities

with 4 comments

As I’ve become a better shell programmer over the last year or two, I’ve been surprised to discover some tools I didn’t know about. It eventually dawned on me, as I did more and more brute-force processing of large datasets, as well as some of the more delicate things that went into Aspersa -> Percona Toolkit, that many tasks I used to do with SQL and spreadsheets can be accomplished easily with well-structured text files and Unix utilities. And they don’t require loading data into a database or spreadsheet (the latter of which almost always performs terribly).

To give an idea, here are some of the relational operations (in SQL speak) you can perform:

  1. SELECT col1, col2… can be implemented with several variants of Unix utilities: cut and awk are the two most obvious. I tend to use awk only when needed, or when it’s more convenient to combine operations into a single tool.
  2. JOIN can be implemented with the… wait for it… join utility. You’ll need to sort its input first, though.
  3. Many GROUP BY operations can be performed with combinations of grep -c, sort with or without the -urnk options (look at the man page — you can apply options to individual sort keys), and uniq with or without the -c option. Many more can be done with 20 or 30 characters of awk.
  4. Output formatting is easy with column, especially with the -t option.

In addition to the above, Bash’s subshell input operator syntaxes can help avoid a lot of temporary files. For example, if you want to join two unsorted files, you can do it like this:

$ join <(sort file1) <(sort file2)

That’s kind of an overview — I end up hacking together a bunch of things, and I’m sure I’m forgetting something. But pipe-and-filter programming with whitespace-delimited files is generally a much more powerful (and performant) paradigm than I realized a few years ago, and that’s the point I wanted to share overall.

As a concrete example, I remember a mailing list thread that began with “I have a 500GB file of 600 billion strings, max length 2000 characters, unsorted, non-unique, and I need a list of the unique strings.” Suggestions included Hadoop, custom programs, Gearman, more Hadoop, and so on — and the ultimate solution was sort -u and sort --merge, trivially parallelized with Bash. (By the way, an easy way to parallelize things is xargs -P.)

What are your favorite “low-level” power programming techniques?

Written by Xaprb

October 12th, 2012 at 11:58 am

Posted in SQL

4 Responses to 'Implementing SQL with Unix utilities'

Subscribe to comments with RSS

  1. I’m a big fan of log file and other text processing with bash, and it’s amazing what you can achieve with a carefully crafted terrifying one-liner.

    Hadn’t seen the -P (note uppercase) option for xargs before, I’ll definitely have to check that one out.

    AussieDan

    12 Oct 12 at 12:35 pm

  2. Thanks for catching the lowercase -p. I fixed it.

    Xaprb

    12 Oct 12 at 12:43 pm

  3. Great post!

    I’ve been doing something similar in my work, but my use-case was around re-merging large (many column) CSV files that had been split up into one-file-per-column CSV files.

    To reconstruct the original CSV, my “join” operation doesn’t need to compare fields like yours does. It’s literally just stuffing rows from each file together into one output CSV (with a comma between each column).

    For example, with files column1.csv, column2.csv, and column3.csv, I use the ‘pr’ command like so:

    pr –merge –omit-header –separator=, column1.csv column2.csv column3.csv

    An obvious motivation for using this solution was that I could store event log attributes in separate column CSV’s and only project out the columns needed for a particular process. That way if something like referral URLs (which tend to be large) isn’t needed for a piece of analysis, I don’t even bother to pull it off disk.

    Also, with this approach I get some of the compression advantage that column-stores get, because each file has only one “datatype” and is frequently of low-cardinality. Based on some initial measurements, I found a savings of about 30% when comparing the compressed individual column files against the original compressed row-oriented CSV.

    I use the sub-shell trick, too, so generating the original CSV is as simple as:

    pr –merge –omit-header –separator=, <(zcat column1.csv.gz) <(zcat column2.csv.gz) <(zcat column3.csv.gz)

    Lastly, notice that 'pr' can merge together any number of column files… pretty cool utility.

    Cheers!

    jmarch

    12 Oct 12 at 7:48 pm

  4. Someone on Twitter pointed me to this article that’s much better than mine: http://matt.might.net/articles/sql-in-the-shell/

    Xaprb

    22 Oct 12 at 11:49 am

Leave a Reply