Comments on: Implementing SQL with Unix utilities http://www.xaprb.com/blog/2012/10/12/implementing-sql-with-unix-utilities/ Stay curious! Thu, 02 May 2013 12:36:53 +0000 hourly 1 http://wordpress.org/?v=3.5.1 By: Xaprb http://www.xaprb.com/blog/2012/10/12/implementing-sql-with-unix-utilities/#comment-20355 Xaprb Mon, 22 Oct 2012 15:49:11 +0000 http://www.xaprb.com/blog/?p=2905#comment-20355 Someone on Twitter pointed me to this article that’s much better than mine: http://matt.might.net/articles/sql-in-the-shell/

]]>
By: jmarch http://www.xaprb.com/blog/2012/10/12/implementing-sql-with-unix-utilities/#comment-20336 jmarch Fri, 12 Oct 2012 23:48:25 +0000 http://www.xaprb.com/blog/?p=2905#comment-20336 Great post!

I’ve been doing something similar in my work, but my use-case was around re-merging large (many column) CSV files that had been split up into one-file-per-column CSV files.

To reconstruct the original CSV, my “join” operation doesn’t need to compare fields like yours does. It’s literally just stuffing rows from each file together into one output CSV (with a comma between each column).

For example, with files column1.csv, column2.csv, and column3.csv, I use the ‘pr’ command like so:

pr –merge –omit-header –separator=, column1.csv column2.csv column3.csv

An obvious motivation for using this solution was that I could store event log attributes in separate column CSV’s and only project out the columns needed for a particular process. That way if something like referral URLs (which tend to be large) isn’t needed for a piece of analysis, I don’t even bother to pull it off disk.

Also, with this approach I get some of the compression advantage that column-stores get, because each file has only one “datatype” and is frequently of low-cardinality. Based on some initial measurements, I found a savings of about 30% when comparing the compressed individual column files against the original compressed row-oriented CSV.

I use the sub-shell trick, too, so generating the original CSV is as simple as:

pr –merge –omit-header –separator=, <(zcat column1.csv.gz) <(zcat column2.csv.gz) <(zcat column3.csv.gz)

Lastly, notice that 'pr' can merge together any number of column files… pretty cool utility.

Cheers!

]]>
By: Xaprb http://www.xaprb.com/blog/2012/10/12/implementing-sql-with-unix-utilities/#comment-20335 Xaprb Fri, 12 Oct 2012 16:43:41 +0000 http://www.xaprb.com/blog/?p=2905#comment-20335 Thanks for catching the lowercase -p. I fixed it.

]]>
By: AussieDan http://www.xaprb.com/blog/2012/10/12/implementing-sql-with-unix-utilities/#comment-20334 AussieDan Fri, 12 Oct 2012 16:35:51 +0000 http://www.xaprb.com/blog/?p=2905#comment-20334 I’m a big fan of log file and other text processing with bash, and it’s amazing what you can achieve with a carefully crafted terrifying one-liner.

Hadn’t seen the -P (note uppercase) option for xargs before, I’ll definitely have to check that one out.

]]>