<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Xaprb &#187; data warehousing</title>
	<atom:link href="http://www.xaprb.com/blog/tag/data-warehousing/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.xaprb.com/blog</link>
	<description>Stay curious!</description>
	<lastBuildDate>Thu, 09 Feb 2012 10:55:47 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.2.1</generator>
		<item>
		<title>A review of Pentaho Solutions by Roland Bouman and Jos van Dongen</title>
		<link>http://www.xaprb.com/blog/2009/12/13/review-pentaho-solutions-bouman-dongen/</link>
		<comments>http://www.xaprb.com/blog/2009/12/13/review-pentaho-solutions-bouman-dongen/#comments</comments>
		<pubDate>Mon, 14 Dec 2009 03:11:07 +0000</pubDate>
		<dc:creator>Xaprb</dc:creator>
				<category><![CDATA[Review]]></category>
		<category><![CDATA[SQL]]></category>
		<category><![CDATA[Business Intelligence]]></category>
		<category><![CDATA[data warehousing]]></category>
		<category><![CDATA[Jaspersoft]]></category>
		<category><![CDATA[Jos van Dongen]]></category>
		<category><![CDATA[MySQL]]></category>
		<category><![CDATA[olap]]></category>
		<category><![CDATA[Pentaho]]></category>
		<category><![CDATA[Roland Bouman]]></category>

		<guid isPermaLink="false">http://www.xaprb.com/blog/?p=1476</guid>
		<description><![CDATA[Pentaho Solutions, Business Intelligence and Data Warehousing with Pentaho and MySQL. By Roland Bouman and Jos van Dongen, Wiley 2009. Page count: about 570 pages. (Here&#8217;s a link to the publisher&#8217;s site.) The book is big in part because it&#8217;s about a GUI tool, so there are the requisite number of screenshots (but not too [...]


<strong>Further Reading:</strong><ul><li><a href='http://www.xaprb.com/blog/2009/08/30/failure-scenarios-and-solutions-in-master-master-replication/' rel='bookmark' title='Permanent Link: Failure scenarios and solutions in master-master replication'>Failure scenarios and solutions in master-master replication</a></li>
<li><a href='http://www.xaprb.com/blog/2010/02/19/a-review-of-understanding-mysql-internals-by-sasha-pachev/' rel='bookmark' title='Permanent Link: A review of Understanding MySQL Internals by Sasha Pachev'>A review of Understanding MySQL Internals by Sasha Pachev</a></li>
<li><a href='http://www.xaprb.com/blog/2010/01/15/review-get-it-done-with-mysql-peter-brawley-arthur-fuller/' rel='bookmark' title='Permanent Link: A review of Get it Done with MySQL 5&#038;6 by Peter Brawley and Arthur Fuller'>A review of Get it Done with MySQL 5&#038;6 by Peter Brawley and Arthur Fuller</a></li>
<li><a href='http://www.xaprb.com/blog/2010/12/14/a-review-of-mongodb-the-definitive-guide-by-chodorow-and-dirolf/' rel='bookmark' title='Permanent Link: A review of MongoDB, the Definitive Guide by Chodorow and Dirolf'>A review of MongoDB, the Definitive Guide by Chodorow and Dirolf</a></li>
<li><a href='http://www.xaprb.com/blog/2009/02/21/review-of-scalable-internet-architectures-by-theo-schlossnagle/' rel='bookmark' title='Permanent Link: Review of Scalable Internet Architectures by Theo Schlossnagle'>Review of Scalable Internet Architectures by Theo Schlossnagle</a></li>
</ul>]]></description>
			<content:encoded><![CDATA[<div id="attachment_1477" class="wp-caption alignleft" style="width: 250px"><a href="http://www.amazon.com/Pentaho-Solutions-Business-Intelligence-Warehousing/dp/0470484322?tag=xaprb-20"><img src="http://www.xaprb.com/blog/wp-content/uploads/2009/12/pentaho-solutions.jpg" alt="Pentaho Solutions" title="Pentaho Solutions" width="240" height="240" class="size-full wp-image-1477" /></a><p class="wp-caption-text">Pentaho Solutions</p></div>

<p><a href="http://www.amazon.com/Pentaho-Solutions-Business-Intelligence-Warehousing/dp/0470484322?tag=xaprb-20">Pentaho Solutions</a>, Business Intelligence and Data Warehousing with Pentaho and MySQL.  By Roland Bouman and Jos van Dongen, Wiley 2009.  Page count: about 570 pages.   (Here&#8217;s <a href="http://www.wiley.com/WileyCDA/WileyTitle/productCd-0470484322.html">a link to the publisher&#8217;s site</a>.)</p>

<p>The book is big in part because it&#8217;s about a GUI tool, so there are the requisite number of screenshots (but not too many).  It  is structured into four parts, each on a different topic.</p>

<p>The first part is 4 chapters on getting started with <a href="http://www.pentaho.com/">Pentaho</a>: from a quick-start through installing, configuring, and understanding the Pentaho BI Stack.  Pentaho is a complex suite of tools, and there&#8217;s a handy architecture diagram to help you grok what the parts are and how they fit together.  You&#8217;ll learn about topics such as Mondrian and configuring the database connection pool.</p>

<p>The second part is a primer on dimensional modeling and DW design.  It uses a sample database that the authors developed for the book, which you can download from the publisher&#8217;s website.  You&#8217;ll learn about star schemas and data marts.  You can skip this part if you&#8217;re familiar with BI concepts in general, and just want to learn how Pentaho implements them.</p>

<p>Part three is about Pentaho data integration.  The first chapter is a primer on integration, which again you&#8217;ll be able to skip if you know your stuff.  Then you&#8217;ll walk through topics such as generating dimensions, designing, and deploying data integration solutions with Kettle and Spoon.</p>

<p>Part four is about designing and building BI applications with Pentaho: learning about its metadata layer; using the reporting tools; scheduling, subscriptions, and bursting; OLAP; data mining; and building dashboards.  This is about half the book, really.  There&#8217;s a lot to it &#8212; it&#8217;s all about how to take a generic and flexible suite of tools and get something specific and useful out of it.  If you&#8217;ve ever done that, you&#8217;ll know why this could occupy half a book.  This isn&#8217;t a simple suite of tools that only does one thing well.</p>

<p>In the end, this is a good beginner-to-intermediate book for people who want to learn about data warehousing, business intelligence, Pentaho, or all of the above. If you don&#8217;t know anything about these topics, you&#8217;ll find the entire book quite useful.  If you know a lot about BI and DW, you&#8217;ll probably get the most out of the Pentaho-specific bits.  On the other hand, people who already have an advanced level of proficiency with Pentaho will probably know much of what&#8217;s in this book.  Those seeking to build advanced solutions presumably also know a lot about the general BI concepts, too.  So this is probably not the book for you if you already know what you&#8217;re doing with Pentaho.</p>

<p>Proprietary BI systems cost at least an arm and a leg, and possibly more.  That&#8217;s why open-source BI is such a hot topic.  If you&#8217;re looking to get acquainted with Pentaho, I think this is an excellent book &#8212; that&#8217;s what I got it for, and I wasn&#8217;t disappointed.  Now if only I could find a similar book for <a href="http://www.jaspersoft.com/">Jaspersoft</a>.</p>

<p><strong>Further Reading:</strong><ul><li><a href='http://www.xaprb.com/blog/2009/08/30/failure-scenarios-and-solutions-in-master-master-replication/' rel='bookmark' title='Permanent Link: Failure scenarios and solutions in master-master replication'>Failure scenarios and solutions in master-master replication</a></li>
<li><a href='http://www.xaprb.com/blog/2010/02/19/a-review-of-understanding-mysql-internals-by-sasha-pachev/' rel='bookmark' title='Permanent Link: A review of Understanding MySQL Internals by Sasha Pachev'>A review of Understanding MySQL Internals by Sasha Pachev</a></li>
<li><a href='http://www.xaprb.com/blog/2010/01/15/review-get-it-done-with-mysql-peter-brawley-arthur-fuller/' rel='bookmark' title='Permanent Link: A review of Get it Done with MySQL 5&#038;6 by Peter Brawley and Arthur Fuller'>A review of Get it Done with MySQL 5&#038;6 by Peter Brawley and Arthur Fuller</a></li>
<li><a href='http://www.xaprb.com/blog/2010/12/14/a-review-of-mongodb-the-definitive-guide-by-chodorow-and-dirolf/' rel='bookmark' title='Permanent Link: A review of MongoDB, the Definitive Guide by Chodorow and Dirolf'>A review of MongoDB, the Definitive Guide by Chodorow and Dirolf</a></li>
<li><a href='http://www.xaprb.com/blog/2009/02/21/review-of-scalable-internet-architectures-by-theo-schlossnagle/' rel='bookmark' title='Permanent Link: Review of Scalable Internet Architectures by Theo Schlossnagle'>Review of Scalable Internet Architectures by Theo Schlossnagle</a></li>
</ul>]]></content:encoded>
			<wfw:commentRss>http://www.xaprb.com/blog/2009/12/13/review-pentaho-solutions-bouman-dongen/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Kickfire: relational algebra in a chip</title>
		<link>http://www.xaprb.com/blog/2008/04/14/kickfire-relational-algebra-in-a-chip/</link>
		<comments>http://www.xaprb.com/blog/2008/04/14/kickfire-relational-algebra-in-a-chip/#comments</comments>
		<pubDate>Mon, 14 Apr 2008 18:57:28 +0000</pubDate>
		<dc:creator>Xaprb</dc:creator>
				<category><![CDATA[SQL]]></category>
		<category><![CDATA[ARIES]]></category>
		<category><![CDATA[column store]]></category>
		<category><![CDATA[data warehousing]]></category>
		<category><![CDATA[Kickfire]]></category>
		<category><![CDATA[MySQL]]></category>
		<category><![CDATA[Ravi Krishnamurthy]]></category>
		<category><![CDATA[stream processing]]></category>
		<category><![CDATA[TPC H]]></category>

		<guid isPermaLink="false">http://www.xaprb.com/blog/2008/04/14/kickfire-relational-algebra-in-a-chip/</guid>
		<description><![CDATA[I spent the day Thursday with some of Kickfire&#8217;s engineers at their headquarters. In this article, I&#8217;d like to go over a little of the system&#8217;s architecture and some other details. Everything in quotation marks in this article is a quote. (I don&#8217;t use quotes when I&#8217;m glossing over a technical point &#8212; at least, [...]


<strong>Further Reading:</strong><ul><li><a href='http://www.xaprb.com/blog/2008/04/04/kickfire-stream-processing-sql-queries/' rel='bookmark' title='Permanent Link: Kickfire: stream-processing SQL queries'>Kickfire: stream-processing SQL queries</a></li>
<li><a href='http://www.xaprb.com/blog/2008/04/09/kickfire-is-not-ssd-based/' rel='bookmark' title='Permanent Link: Kickfire is not SSD-based'>Kickfire is not SSD-based</a></li>
<li><a href='http://www.xaprb.com/blog/2010/09/19/a-review-of-relational-database-design-and-the-optimizers-by-lahdenmaki-and-leach/' rel='bookmark' title='Permanent Link: A review of Relational Database Design and the Optimizers by Lahdenmaki and Leach'>A review of Relational Database Design and the Optimizers by Lahdenmaki and Leach</a></li>
<li><a href='http://www.xaprb.com/blog/2010/09/07/a-gentle-introduction-to-couchdb-for-relational-practitioners/' rel='bookmark' title='Permanent Link: A gentle introduction to CouchDB for relational practitioners'>A gentle introduction to CouchDB for relational practitioners</a></li>
<li><a href='http://www.xaprb.com/blog/2010/03/08/nosql-doesnt-mean-non-relational/' rel='bookmark' title='Permanent Link: NoSQL doesn&#8217;t mean non-relational'>NoSQL doesn&#8217;t mean non-relational</a></li>
</ul>]]></description>
			<content:encoded><![CDATA[<p>I spent the day Thursday with some of Kickfire&#8217;s engineers at their
headquarters.  In this article, I&#8217;d like to go over a little of the system&#8217;s
architecture and some other details.</p>

<p>Everything in quotation marks in this article is a quote.  (I don&#8217;t use
quotes when I&#8217;m glossing over a technical point &#8212; at least, not in this
article.)</p>

<p>Even though I saw one of Kickfire&#8217;s engineers running queries on the system,
they didn&#8217;t let me actually take the keyboard and type into it myself.  So
everything I&#8217;m writing here is <strong>still second-hand knowledge</strong>.
It&#8217;s an unreleased product that&#8217;s in very rapid development, so this is
understandable.</p>

<p><a
href="http://www.tpc.org/tpch/results/tpch_perf_results.asp?resulttype=noncluster&amp;version=2%&amp;currencyID=0">Kickfire&#8217;s
TPC-H benchmarks are now published</a>, so you can see the results of what I&#8217;ve
been seeing them work on.  They are now #1 in the world, in two categories.  Visit them at their booth in the exhibition area at the conference, and you will be able to see more for yourself.</p>

<h3>The big picture</h3>

<p>At a high level, Kickfire is an appliance consisting of two or more commodity
rack-mountable 1U pizza-box units.</p>

<p>One unit contains the Kickfire chip and a lot of standard, high-speed,
server-grade ECC memory.  This unit is what executes the queries at high
speed.</p>

<p>The other unit is connected to the Kickfire chip unit via a standard PCIe
interconnect.  It runs stock CentOS 5, with MySQL 5.1.  Kickfire has their own
storage engine, which uses fairly well-known techniques such as column storage
and compression.</p>

<p>To the outside world, the unit behaves just like an ordinary MySQL server.
You connect to it in the same manner, you issue the same kinds of queries, you
manage users and privileges the same way, and so on.  However, when you run a
query, it doesn&#8217;t get executed in the traditional MySQL manner (nested-loop
joins with calls to the storage engine via the storage engine API).  Instead,
the query goes to the Kickfire chip and executes there.  The chip is designed to
execute queries very fast, through a variety of techiques that a) I&#8217;m not
allowed to tell you about yet or b) are sometimes unclear to me because
Kickfire was being a little protective about some of my technical questions.</p>

<p>I met with quite a few people at Kickfire, but I&#8217;ll just mention one: Ravi
Krishnamurthy.  Before Kickfire approached me, I had not heard of him.  Anyway,
I&#8217;ll just link to <a
href="http://scholar.google.com/scholar?q=ravi+krishnamurthy">Ravi
Krishnamurthy</a> on Google Scholar, and let you read up on his papers if you
want.  It&#8217;s enough to say that I really enjoyed speaking with him and the other
people at Kickfire.</p>

<p>One of the overall impressions I got was that the Kickfire engineers aren&#8217;t
the type to do something halfway.  When complete, this is not intended to be a
system that has only some of the features you&#8217;d expect.</p>

<h3>I/O bottlenecks</h3>

<p>The Kickfire chip has no registers.  Instead, the Kickfire chip addresses a
very large amount of memory directly.  Remember, registers are a bottleneck.  As
I said in <a
href="http://www.xaprb.com/blog/2008/04/04/kickfire-stream-processing-sql-queries/">my
first article on Kickfire</a>, using registers to process large amounts of data
is like using a paper cup to fill your bathtub.  Allowing the chip to address
this memory directly removes a huge bottleneck.</p>

<p>There is still on-disk storage, though. (And no, it&#8217;s not SSD.)  The
interconnect between the on-disk storage and the memory is a standard PCIe
connection.  Nothing exotic or proprietary.  But the system is apparently
capable of moving a very large amount of data at very high speed from the disks
to the Kickfire chip&#8217;s memory, where it can be addressed in O(1) speed like an
array lookup.</p>

<p>Another interesting technique is that the system does not decompress the data
to operate on it.  According to the engineers, the queries run on the data in
its compressed form.  As Ravi told me, implementing this is &#8220;not for the faint
of heart.&#8221;</p>

<p>Kickfire seems to have really worked hard at removing bottlenecks wherever
possible.  For example, they&#8217;ve rewritten the out-of-the-box drivers for key
pieces of the commodity hardware they&#8217;re using.</p>

<h3>Souped-up MySQL</h3>

<p>If you know how MySQL executes queries, the statement &#8220;Kickfire executes
joins directly in the Kickfire chip&#8221; implies that the Kickfire system isn&#8217;t just
a storage engine, because MySQL currently processes many of the most costly
parts of queries at the server level, not in the storage engine.  Obviously
Kickfire is not going to perform well unless it changes that. 
Kickfire has in fact built their own optimizer, which replaces the
MySQL optimizer.  It compiles the incoming query into a series of
macro-operations, which apparently are very similar to the basic relational
operators (project, join, etc).  This is then sent to the chip for execution,
and as the chip produces results it injects them back into the stream
of bytes that the server normally uses to send results back to the client.</p>

<p>The Kickfire chip doesn&#8217;t implement everything in hardware.  For example,
there is no MD5() function in the chip.  When it encounters an operation it
can&#8217;t do in hardware, it makes a call back to the MySQL server to fill in the
gaps in its functionality.</p>

<p>The rewritten optimizer sounds like an interesting piece of engineering.
Ravi told me with pride that the optimizer is &#8220;world-class&#8221; and &#8220;can stand
toe-to-toe with the best optimizers in the database industry.&#8221; It is a
cost-based optimizer with rewrites (e.g. it transforms the operator tree into
the most efficient equivalent structure) and it is exhaustive (e.g. it tries all
possible combinations to find the best execution plan, unlike MySQL&#8217;s optimizer
which by default switches to a greedy search when the number of tables to be joined
becomes large [correction: as Timour pointed out to me today, I made it sound like MySQL's optimizer
isn't exhaustive; I neglected to mention that you can configure it]).</p>

<p>I asked whether they had benchmarked the optimizer&#8217;s performance.  (I mean
how fast it can find an optimal query plan, not the performance of its results.)
Of course, there is no standard benchmark for this, but I think it&#8217;s interesting
just to compare it against the MySQL optimizer.  They had not done this, but I
think they will now that I have mentioned it.  I think it&#8217;s relevant because if
you use Kickfire for short queries, a slow-performing optimizer could actually
become noticeable.</p>

<h3>Is it really stream processing?</h3>

<p>I wanted to know whether the chip really does stream processing, or whether
it is only conceptually stream processing that&#8217;s really implemented some other
way.  It sounds to me like it&#8217;s the genuine article.  I asked
some pointed questions to this effect, such as &#8220;is there a way to interrupt a
partially completed query.&#8221;  As it turns out there is, but only because
the stream processor apparently does time-slicing like a standard chip, and when
it comes up for air it can check to see if a query should be aborted.  In
general, I was told, there is no interruption once the data stream starts
flowing.  That lets the query literally &#8220;run at the speed of electrons.&#8221;</p>

<p>But what about subqueries, you ask?  That&#8217;s what I asked too.  Stream
processing is all very well for joins, but what about a correlated subquery, for
example?</p>

<p>It turns out that if you&#8217;re clever, you can figure out ways to decorrelate
them and then execute them in streaming fashion.  The same holds for aggregation
over data that&#8217;s not in the order needed for streamed aggregation.  Pretty
interesting ideas; I can&#8217;t go into them, because those are proprietary, but
Ravi and I talked about them for quite a while.</p>

<p>And very large IN() lists can be turned into a relation and
treated like any other.</p>

<h3>Storage</h3>

<p>Storage is obviously crucial to processing extremely large amounts of data
very fast.  A few of the things I noted about the storage:</p>

<ul>
<li>Each column is stored in a fixed width.  This is how Kickfire can look up a
row as though it&#8217;s doing an array access.</li>
<li>The internal representation is chosen automatically and may not match what
you think.  Kickfire can profile data as it&#8217;s loaded, and choose the type as
it goes.</li>
<li>If you tell Kickfire you&#8217;ll only store values that are X large in a column,
and it builds its column storage space to hold that large a value, what happens
when you then start adding larger values later?  Ravi explained how it works,
and it&#8217;s proprietary right now, but suffice to say that Kickfire does not need
to rewrite all the data you&#8217;ve already stored if you suddenly start
storing values you didn&#8217;t anticipate.  Yet, it can still maintain O(1)
array-lookup performance on the compressed data.</li>
<li>You can pass the storage engine special comments in the CREATE TABLE
statement to tell it what kinds of data each column will get.  These comments
are part of MySQL&#8217;s standard syntax &#8212; Kickfire has not changed the MySQL
parser, so it should be 100% syntax-compatible with a standard MySQL
server.</li>
<li>Kickfire has a very Oracle-like set of features around tablespaces, extents,
and so on.  You can have multiple tablespaces, and you can add devices to
tablespaces, etc.</li>
<li>Storage is transactional and ACID-compliant, with logging and <a
href="http://en.wikipedia.org/wiki/Algorithms_for_Recovery_and_Isolation_Exploiting_Semantics">ARIES</a>
recovery much like Oracle, InnoDB, etc.  If it surprises you that a system built
for large data warehouses would be transactional and ACID-compliant, welcome to
the club.  I was expecting the usual special-case behavior,
you know, you can load data but you can&#8217;t update it, or something like
that.  But as I said, Kickfire isn&#8217;t doing this halfway.  Plus, TPC-H requires
ACID properties.</li>
</ul>

<h3>Loading, ETL, and star schemas</h3>

<p>Loading data is also important to accelerate: executing queries on large
amounts of data isn&#8217;t good if it takes forever to get the data into the server.
Kickfire has their own suite of tools, including one for loading data that
accelerates the load process with the SQL chip itself.</p>

<p>Kickfire&#8217;s attitude towards star schemas is that you shouldn&#8217;t need to build
a special schema for your data warehouse.  They think their system will be so
fast that you can keep your data in the same schema you use for OLTP.  If that
turns out to be true, that will save a lot of work.  (How much effort have you
put into building a separate schema for your data warehouse?)</p>

<h3>Other notes of interest</h3>

<p>Here are some other tidbits I thought I&#8217;d share with you:</p>

<ul>
<li>The system has support for foreign keys.  It automatically creates indexes
on foreign keys and primary keys.</li>
<li>The standard types of indexes don&#8217;t really apply.  Instead, the indexes
are &#8220;hardware-friendly&#8221; (the other term they used was that the indexes are
&#8220;impedance-matched to the hardware&#8221;).  There are special features for indexing
ranges of dates and indexing words inside a string (but this is not a full-text
index; I&#8217;m unclear on how it really works, but it helps accelerate LIKE queries,
which is important for the TPC-H benchmarks)</li>
<li>The deadlock detection is via cycle detection in the waits-for graph, not
timeout-based. As a result, it should be fast.</li>
<li>The system I saw was running in debug mode, and wrote its optimized query
plan to a file for every query.  I talked with them about making this available
via SQL.  The plan is much more detailed and informative than MySQL&#8217;s EXPLAIN.
They asked me whether it would be a good idea to wedge this information into
EXPLAIN, and I told them I wouldn&#8217;t do that; <a
href="http://en.oreilly.com/mysql2008/public/schedule/detail/300">EXPLAIN is a
tabular output that doesn&#8217;t make much sense unless you really know how to read
it</a>.  When you&#8217;re trying to understand a query plan, which is generally a
tree of relational operators, you need a <a
href="http://www.xaprb.com/blog/2007/07/29/introducing-mysql-visual-explain/">hierarchical
view of it</a>.</li>
<li>They told me that they use the INFORMATION_SCHEMA extensively, but I did not
get a chance to look at it myself.</li>
<li>They also told me that they use UDFs extensively for system management, but
again I can&#8217;t confirm.</li>
</ul>

<h3>Licensing</h3>

<p>As you probably know, I&#8217;m a strong believer in <a
href="http://www.fsf.org/">Free Software</a>.  I am not aware of any plans for
Kickfire to release the source code for their modified version of MySQL or their
storage engine or optimizer.  These are the satellite diamonds that surround the
crown jewels: open-sourcing them would make it easier to reverse engineer
the chip, which they don&#8217;t want.  However, they&#8217;ve promised me that they&#8217;re
going to open-source some of the migration tools, etc etc.  Not initially, but
as time permits; and later they&#8217;ll look at open-sourcing other parts.</p>

<p>I have made sure that they know where I stand on this: I think the ethical
thing to do is GPL all the code that they ship, and I think everyone I talked to
heard me say that at least once.  If you&#8217;re going to buy their magical hardware,
you deserve to have the source code for everything that runs on it, too.  And
they need to release the interface specs for their hardware so people can use it
in new and surprising ways.  Who knows &#8212; someone could use it to find a cure
for cancer.</p>

<h3>Summary</h3>

<p>My two days with Kickfire left me with a lot more questions, not
surprisingly, and I don&#8217;t think that will change until I actually get access to
a machine and start testing it myself.  I saw a lot of slideshows; I saw some
demos; I walked into the server rooms and saw the pretty blinking lights; but
I&#8217;m not going to tell you that Kickfire will do X or Y because I don&#8217;t know a
heck of a lot.  I was hoping for more hands-on experience and in-depth technical
details, but that wasn&#8217;t the way it really worked out.  However, based on what
I&#8217;ve seen, I have no reason to believe other than that Kickfire&#8217;s system will do
what they claim: it will run large, complex queries on very large datasets
extremely quickly.</p>

<p><strong>Further Reading:</strong><ul><li><a href='http://www.xaprb.com/blog/2008/04/04/kickfire-stream-processing-sql-queries/' rel='bookmark' title='Permanent Link: Kickfire: stream-processing SQL queries'>Kickfire: stream-processing SQL queries</a></li>
<li><a href='http://www.xaprb.com/blog/2008/04/09/kickfire-is-not-ssd-based/' rel='bookmark' title='Permanent Link: Kickfire is not SSD-based'>Kickfire is not SSD-based</a></li>
<li><a href='http://www.xaprb.com/blog/2010/09/19/a-review-of-relational-database-design-and-the-optimizers-by-lahdenmaki-and-leach/' rel='bookmark' title='Permanent Link: A review of Relational Database Design and the Optimizers by Lahdenmaki and Leach'>A review of Relational Database Design and the Optimizers by Lahdenmaki and Leach</a></li>
<li><a href='http://www.xaprb.com/blog/2010/09/07/a-gentle-introduction-to-couchdb-for-relational-practitioners/' rel='bookmark' title='Permanent Link: A gentle introduction to CouchDB for relational practitioners'>A gentle introduction to CouchDB for relational practitioners</a></li>
<li><a href='http://www.xaprb.com/blog/2010/03/08/nosql-doesnt-mean-non-relational/' rel='bookmark' title='Permanent Link: NoSQL doesn&#8217;t mean non-relational'>NoSQL doesn&#8217;t mean non-relational</a></li>
</ul>]]></content:encoded>
			<wfw:commentRss>http://www.xaprb.com/blog/2008/04/14/kickfire-relational-algebra-in-a-chip/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>Kickfire: stream-processing SQL queries</title>
		<link>http://www.xaprb.com/blog/2008/04/04/kickfire-stream-processing-sql-queries/</link>
		<comments>http://www.xaprb.com/blog/2008/04/04/kickfire-stream-processing-sql-queries/#comments</comments>
		<pubDate>Fri, 04 Apr 2008 13:01:01 +0000</pubDate>
		<dc:creator>Xaprb</dc:creator>
				<category><![CDATA[SQL]]></category>
		<category><![CDATA[caching]]></category>
		<category><![CDATA[Cg]]></category>
		<category><![CDATA[column store]]></category>
		<category><![CDATA[CPUs]]></category>
		<category><![CDATA[data warehousing]]></category>
		<category><![CDATA[Graphics]]></category>
		<category><![CDATA[Keith Murphy]]></category>
		<category><![CDATA[Kickfire]]></category>
		<category><![CDATA[MPP]]></category>
		<category><![CDATA[MySQL]]></category>
		<category><![CDATA[mysqluc2008]]></category>
		<category><![CDATA[pluggable storage engine]]></category>
		<category><![CDATA[QPU]]></category>
		<category><![CDATA[Von Neumann bottleneck]]></category>

		<guid isPermaLink="false">http://www.xaprb.com/blog/2008/04/04/kickfire-stream-processing-sql-queries/</guid>
		<description><![CDATA[Some of you have noticed Kickfire, a new sponsor at this year&#8217;s MySQL Conference and Expo. Like Keith Murphy, I have been involved with them for a while now. This article explains the basics of how their technology is different from the current state of the art in complex queries on large amounts of data. [...]


<strong>Further Reading:</strong><ul><li><a href='http://www.xaprb.com/blog/2008/04/14/kickfire-relational-algebra-in-a-chip/' rel='bookmark' title='Permanent Link: Kickfire: relational algebra in a chip'>Kickfire: relational algebra in a chip</a></li>
<li><a href='http://www.xaprb.com/blog/2008/04/09/kickfire-is-not-ssd-based/' rel='bookmark' title='Permanent Link: Kickfire is not SSD-based'>Kickfire is not SSD-based</a></li>
<li><a href='http://www.xaprb.com/blog/2009/08/18/how-to-find-un-indexed-queries-in-mysql-without-using-the-log/' rel='bookmark' title='Permanent Link: How to find un-indexed queries in MySQL, without using the log'>How to find un-indexed queries in MySQL, without using the log</a></li>
<li><a href='http://www.xaprb.com/blog/2009/12/31/a-simple-way-to-make-birthday-queries-easier-and-faster/' rel='bookmark' title='Permanent Link: A simple way to make birthday queries easier and faster'>A simple way to make birthday queries easier and faster</a></li>
<li><a href='http://www.xaprb.com/blog/2009/11/01/catching-erroneous-queries-without-mysql-proxy/' rel='bookmark' title='Permanent Link: Catching erroneous queries, without MySQL proxy'>Catching erroneous queries, without MySQL proxy</a></li>
</ul>]]></description>
			<content:encoded><![CDATA[<p>Some of you have noticed <a href="http://www.kickfire.com/">Kickfire</a>, a
new sponsor at this year&#8217;s <a href="http://www.mysqlconf.com/">MySQL Conference and
Expo</a>.  Like <a href="http://www.paragon-cs.com/wordpress/?p=132">Keith
Murphy</a>, I have been involved with them for a while now.  This article
explains the basics of how their technology is different from the current state
of the art in complex queries on large amounts of data.</p>

<p>Kickfire is developing a MySQL appliance that combines a pluggable
storage engine (for MySQL 5.1) with a new kind of chip.  On the surface, the
storage engine is not that revolutionary: it is a column-store engine with data
compression and some other techniques to reduce disk I/O, which is kind of par
for the course in data warehousing today.  The chip is the
really exciting part of the technology.</p>

<p>The simplest description of their chip is that it runs SQL natively.</p>

<p>OK, but now you need to do something: <em>get &#8220;SQL chip&#8221; out of your mind</em>.  It
doesn&#8217;t work the way you think it does, and your pre-conceived ideas may prevent
you from understanding how different this really is.  (Everyone says their
technology is a paradigm shift, so I expect you to be numb to this phrase.)</p>

<p>I can&#8217;t explain all of the technology in this post,
partially because of NDA, but I want to prepare you for when you do hear the
details.  If you&#8217;re like me, you&#8217;ll miss a lot of stuff because you have tunnel
vision, and then you&#8217;ll say &#8220;wait, I get it now!  Can you please repeat
everything you&#8217;ve been saying for the last hour so I can think about it all over
again?&#8221;</p>

<h3>An important note</h3>

<p><strong>Very important:</strong> I have not seen this technology, tasted it,
smelled it, or benchmarked it.  This information is based on discussions with
their engineering and other staff.  I will not pretend
to know anything I don&#8217;t. I will be spending two days in the lab with the engineers next
week, and then I will be able to write in greater detail with more 
confidence.</p>

<h3>How your computer currently works</h3>

<p>To understand how Kickfire&#8217;s chip works, you need to understand something you
probably take for granted: how most chips work.  Most computers today use the
same architecture they always have: there&#8217;s data that is held in the CPU, and
data that is not.  The CPU has registers, which hold a miniscule bit of data &#8211;
the data it is currently working with.  When the CPU processes an instruction
that asks for some more data it doesn&#8217;t have, the CPU has to go fetch it.  In
the meantime, the instruction can&#8217;t complete.</p>

<p>As you might imagine, this is not terribly efficient.  Fetching data that&#8217;s
not in the CPU can take hundreds of CPU cycles (or more).  To work around this,
computer architects have developed a hierarchy of caches: the on-chip cache, the
main memory, and the hard drive, to name a few.  The caches make it faster to
get data when it&#8217;s not already on hand.  And modern chips have a pipeline, too.
The pipeline looks at the instructions as they flow towards the CPU, tries to
predict which data they&#8217;re going to need, then pre-fetches it.</p>

<p>In the best case, this works okay.  Not always &#8212; for example, the Pentium 4
has a very long pipeline, so the cost of a wrong branch prediction is very high.
Another case is when you simply need a lot of data, such as tens of gigabytes.
Suppose for your 10GB operation, you&#8217;re only going to look at each byte once (a
common occurrence in data warehousing queries).  This renders your caches
useless, because caches work on the principle that you&#8217;re likely to look at
recently accessed data again soon.</p>

<p>In these cases, the speed of the computation is constrained by the <a
href="http://en.wikipedia.org/wiki/Von_Neumann_architecture">Von Neumann
bottleneck</a>: the inefficient fetch-compute-wait cycle of constantly
going to the memory (or disk) for more data, a teeny bit at a time.  Remember,
even in-memory data is very slow compared to data that&#8217;s in the registers.
Having a lot of fast memory is not a <strong>solution</strong> to the Von
Neumann bottleneck.  It&#8217;s a <strong>workaround</strong> to reduce the cost.</p>

<h3>Kickfire&#8217;s architecture</h3>

<p>Kickfire is designed to work well where today&#8217;s general-purpose computing
architectures run queries slowly because they&#8217;re sitting on their thumbs much of
the time.  Think data warehousing: complex queries with lots of data.</p>

<p>What is the industry&#8217;s answer to this?  So-called massively
parallel processing, or MPP.  Current MPP data-warehousing solutions are special-purpose
database software that runs queries on dozens or hundreds of CPUs, which occupy
a lot of storage space and require lots of power, hardware, and
cooling.  &#8220;If you throw enough Von Neumann machines at the problem
simultaneously, they can answer your questions faster,&#8221; or so the thinking goes.
In other words, the current state of the art is to arrange conventional
computers in new ways.</p>

<p>Kickfire takes the opposite approach: <em>stream processing</em>.  This is a
fundamentally different computing architecture.  Stream processing is to Von
Neumann machines as LISP is to C.</p>

<p>For those of you who aren&#8217;t LISP programmers, here&#8217;s another analogy: In
stream processing, you take a bunch of data and you shove it through the chip
without stopping.  Rather than the chip asking for data from the storage
subsystem as needed, the data actually gets pushed at the chip.  That is, it&#8217;s
push-processing instead of the conventional pull-processing.</p>

<p>Conventional processing is like trying to fill your bathtub
from the sink with a paper cup.  Stream processing is like putting your tub
under the sink and opening the drain.</p>

<p>I&#8217;m taking some liberties here, to illustrate the differences.  As I said, I
haven&#8217;t seen the wiring diagrams of the Kickfire chip.  But hopefully you get
the concept.</p>

<p>This is not a new idea.  If you&#8217;ve worked with modern graphics cards, you&#8217;ve
seen this in action.  Programming languages like <a
href="http://en.wikipedia.org/wiki/Cg_%28programming_language%29">Cg</a> express
the stream-processing concepts elegantly.  If you&#8217;ve ever been in a classroom
full of C++ programmers trying to learn Cg, you&#8217;ve seen how hard it is to grasp
this different approach.  Essentially, graphics programming on one of these
chips is a series of transformations, not a series of instructions.  You input
some vertexes at one end of the processor, and you tell the chip to do some
matrix multiplies and so on.  Out pops the result at the other end.</p>

<p>If this doesn&#8217;t sound much different from instructions&#8230; well, meditate on
it.  It&#8217;s like an assembly line, but nobody leaves their station along the
conveyor belt.  In a traditional CPU, the &#8220;person&#8221; at the conveyor
<em>constantly</em> leaves to go get the materials he needs.</p>

<p>Kickfire runs in commodity hardware, and it is just one or two servers, not
racks full.  Like many other systems designed for large amounts of data, it uses
a column data store.  Unlike many other systems, it uses an industry standard
interconnect and a custom pluggable MySQL storage engine.</p>

<h3>What took so long?</h3>

<p>Stream processing is the obvious way to run SQL queries.  Some readers may
never have thought about it this way, but my guess is that a lot of you already
think of SQL in a stream-processing way, even though you might know that
computers today really implement it in conventional ways.  I have always tried
to think of it this way, and I <a
href="http://www.xaprb.com/blog/2005/10/03/understanding-sql-joins/">always try
to explain SQL as a stream</a>, too.</p>

<p>So when I was on a call with the Kickfire engineers and it finally sunk in, I
felt really silly.  Why didn&#8217;t I think of that?  It&#8217;s so obvious.</p>

<p>But then again, most breakthroughs are really obvious in hindsight.</p>

<h3>Performance</h3>

<p>I have seen initial benchmark results, but I&#8217;m under NDA about them.  I can&#8217;t
say any more yet.  And I haven&#8217;t run any benchmarks myself yet, nor have I had
access to the hardware.  So this is all theoretical until I get my hands on the
system.  Caveat emptor, your mileage may vary, etc etc.</p>

<p>One thing I&#8217;m interested in is how well the system performs for general-purpose
queries.  When you take it away from complex queries on lots of data, does it still have
an advantage?  I&#8217;ll be trying to get an answer to that question next week.</p>

<h3>About Kickfire</h3>

<p>They are still in stealth mode and my NDA prevents me from being able to
tell you a lot or answer all your questions yet.  But someday they will no
longer be in stealth mode, and you&#8217;ll find out everything you want to then.</p>

<p>Hint: they are going to be giving a <a
href="http://en.oreilly.com/mysql2008/public/schedule/detail/3286">keynote
address</a> on their technology, but there&#8217;s not much detail in the description.
Come to the keynote and find out.</p>

<h3>Why am I writing this?</h3>

<p>Well, they promised me chocolate&#8230;</p>

<p>Seriously: I do have an agenda, but there are actually several motivations
here.  The first is that they initially contacted me because of my involvement
with the MySQL community.  Of course they&#8217;re hoping to gain publicity through
me, but they also wanted to let the community have some input.  I&#8217;ve been sort
of a secret liason for you, representing your interests to Kickfire.  I&#8217;ve
advocated pretty strongly for certain things I&#8217;ll go into in a later post.</p>

<p>The other reason I&#8217;m working with them is that I&#8217;m excited about their
technology, even though I don&#8217;t have hard evidence about their claims and
benchmarks yet.  If what they&#8217;re saying is true, their product will be very good
for the environment.  It will let people save a lot of energy (power, cooling,
the need to build data centers) and it will help avoid the need to build a bunch
of servers.  Computers are extremely
toxic to manufacture.</p>

<p>I&#8217;m also interested in seeing them succeed because I anticipate that even if
this product isn&#8217;t what it claims to be, they&#8217;ll prove the concept and there
will be a competitive rush into this space.  That is guaranteed to produce a lot
of changes in how people build computers, probably in more areas than just data
warehousing.  So I&#8217;m happy that they&#8217;re starting this, because others will
finish it whether they do or not.  And that&#8217;s good news for the environment,
too.</p>

<p>Stay tuned.  More details are forthcoming.</p>

<p>PS: if you have questions you&#8217;d like me to look into while I&#8217;m onsite with the engineers, feel free to post them in the comments.  But I probably can&#8217;t answer them yet.</p>

<p><strong>Further Reading:</strong><ul><li><a href='http://www.xaprb.com/blog/2008/04/14/kickfire-relational-algebra-in-a-chip/' rel='bookmark' title='Permanent Link: Kickfire: relational algebra in a chip'>Kickfire: relational algebra in a chip</a></li>
<li><a href='http://www.xaprb.com/blog/2008/04/09/kickfire-is-not-ssd-based/' rel='bookmark' title='Permanent Link: Kickfire is not SSD-based'>Kickfire is not SSD-based</a></li>
<li><a href='http://www.xaprb.com/blog/2009/08/18/how-to-find-un-indexed-queries-in-mysql-without-using-the-log/' rel='bookmark' title='Permanent Link: How to find un-indexed queries in MySQL, without using the log'>How to find un-indexed queries in MySQL, without using the log</a></li>
<li><a href='http://www.xaprb.com/blog/2009/12/31/a-simple-way-to-make-birthday-queries-easier-and-faster/' rel='bookmark' title='Permanent Link: A simple way to make birthday queries easier and faster'>A simple way to make birthday queries easier and faster</a></li>
<li><a href='http://www.xaprb.com/blog/2009/11/01/catching-erroneous-queries-without-mysql-proxy/' rel='bookmark' title='Permanent Link: Catching erroneous queries, without MySQL proxy'>Catching erroneous queries, without MySQL proxy</a></li>
</ul>]]></content:encoded>
			<wfw:commentRss>http://www.xaprb.com/blog/2008/04/04/kickfire-stream-processing-sql-queries/feed/</wfw:commentRss>
		<slash:comments>11</slash:comments>
		</item>
		<item>
		<title>MySQL Archiver can now archive each row to a different table</title>
		<link>http://www.xaprb.com/blog/2007/11/05/mysql-archiver-can-now-archive-each-row-to-a-different-table/</link>
		<comments>http://www.xaprb.com/blog/2007/11/05/mysql-archiver-can-now-archive-each-row-to-a-different-table/#comments</comments>
		<pubDate>Mon, 05 Nov 2007 13:28:35 +0000</pubDate>
		<dc:creator>Xaprb</dc:creator>
				<category><![CDATA[archiving]]></category>
		<category><![CDATA[data archiving]]></category>
		<category><![CDATA[data warehousing]]></category>
		<category><![CDATA[Hockey Stick Graph]]></category>
		<category><![CDATA[InnoDB]]></category>
		<category><![CDATA[Kettle]]></category>
		<category><![CDATA[Kevin Burton]]></category>
		<category><![CDATA[myisam]]></category>
		<category><![CDATA[MySQL]]></category>
		<category><![CDATA[olap]]></category>
		<category><![CDATA[oltp]]></category>
		<category><![CDATA[Plugins]]></category>
		<category><![CDATA[SQL]]></category>

		<guid isPermaLink="false">http://www.xaprb.com/blog/2007/11/05/mysql-archiver-can-now-archive-each-row-to-a-different-table/</guid>
		<description><![CDATA[<p>One of the enhancements I added to MySQL Archiver in the recent release was listed innocently in the changelog as "Destination plugins can now rewrite the INSERT statement."  Not very exciting or informative, huh?  Keep reading.</p>


<strong>Further Reading:</strong><ul><li><a href='http://www.xaprb.com/blog/2007/06/09/mysql-archiver-092-released/' rel='bookmark' title='Permanent Link: MySQL Archiver 0.9.2 released'>MySQL Archiver 0.9.2 released</a></li>
<li><a href='http://www.xaprb.com/blog/2007/06/15/archive-strategies-for-oltp-servers-part-3/' rel='bookmark' title='Permanent Link: Archive strategies for OLTP servers, Part 3'>Archive strategies for OLTP servers, Part 3</a></li>
<li><a href='http://www.xaprb.com/blog/2007/06/06/mysql-archiver-091-released/' rel='bookmark' title='Permanent Link: MySQL Archiver 0.9.1 released'>MySQL Archiver 0.9.1 released</a></li>
<li><a href='http://www.xaprb.com/blog/2007/06/13/archive-strategies-for-oltp-servers-part-1/' rel='bookmark' title='Permanent Link: Archive strategies for OLTP servers, Part 1'>Archive strategies for OLTP servers, Part 1</a></li>
<li><a href='http://www.xaprb.com/blog/2007/06/14/archive-strategies-for-oltp-servers-part-2/' rel='bookmark' title='Permanent Link: Archive strategies for OLTP servers, Part 2'>Archive strategies for OLTP servers, Part 2</a></li>
</ul>]]></description>
			<content:encoded><![CDATA[<p>One of the enhancements I added to <a href="http://www.xaprb.com/blog/2007/11/04/mysql-toolkit-version-1204-released/">MySQL Archiver in the recent release</a> was listed innocently in the changelog as &#8220;Destination plugins can now rewrite the INSERT statement.&#8221;  Not very exciting or informative, huh?  Keep reading.</p>

<p>If you&#8217;ve used plugins with MySQL Archiver you know that I created a series of &#8220;hooks&#8221; where plugins can take some action: before beginning, before archiving each row, etc etc.  This lets plugins do things like create new destination tables, aggregate archived rows to summary tables during archiving (great for building data warehouses, though not as sophisticated as <a href="http://kettle.pentaho.org/">Kettle</a>), and so on.  Well, this release added a new hook for plugins: <code>custom_sth</code>.</p>

<p>This lets a plugin override the prepared statement the tool will use to insert rows into the archive.  By default the prepared statement just inserts into the destination table.  But the <code>custom_sth</code> hook lets the plugin inspect the row that&#8217;s about to be archived and decide what to do with it.  This lets it do interesting things like archive rows to different tables.</p>

<p>This came up because some of the tables I&#8217;m archiving to suddenly hit the bend in the <a href="http://en.wikipedia.org/wiki/Hockey_Stick_graph">hockey-stick curve</a>.  I diagnosed the problem very simply: inserts began taking most of the time during archiving.  As you might know, MySQL Archiver has a statistics mode where it profiles every operation and reports the stats at the end.  I&#8217;m archiving out of InnoDB into MyISAM; take a look at the stats:</p>

<pre>Action          Count       Time        Pct
inserting      800584 12722.8245      88.35
deleting       800584  1464.1040      10.17
print_file     800584    58.3453       0.41
commit           3204    29.4391       0.20
select           1602     8.5654       0.06
other               0   116.5321       0.81</pre>

<p>Inserting suddenly took 88% of the time spent archiving, when it had been taking a very small fraction of the time.  I&#8217;d been meaning to split the archived data out by date and/or customer, and this convinced me it was time to stop procrastinating.  There are columns in the archived rows for both of these dimensions in the data, so it shouldn&#8217;t be hard.  So I added the custom_sth hook, wrote a 40-line plugin, and did it.  Results:</p>

<pre>Action             Count       Time        Pct
deleting           51675   525.2777      87.62
inserting          51675    49.3903       8.24
print_file         51675     4.4639       0.74
commit               208     2.1553       0.36
custom_sth         51675     1.4575       0.24
select               104     0.6714       0.11
before_insert      51675     0.1135       0.02
before_begin           1     0.0001       0.00
plugin_start           1     0.0000       0.00
after_finish           1     0.0000       0.00
other                  0    15.9868       2.67</pre>

<p>(You can see the effect of having a plugin, because the time taken for all the hooks is listed in the stats.  There was no plugin previously.)</p>

<p>Now inserting takes only 8% of the time required to archive.  Put another way, it used to insert 63 rows per second, now it inserts 1046 rows per second.  This is single-row inserts.  (It is not intended to archive fast; it is intended to <a href="http://www.xaprb.com/blog/2006/05/02/how-to-write-efficient-archiving-and-purging-jobs-in-sql/">archive without disturbing the OLTP processes</a>.  Obviously this server can do a lot more inserts and deletes than this.)</p>

<p>What had happened?  The MyISAM tables on the destination end had just gotten too big for their indexes to fit in memory, and the inserts had suddenly slowed dramatically.  I didn&#8217;t want to give them a lot more memory, because I want the memory to be used for the InnoDB data on that machine.  This is the same kind of thing, I&#8217;d guess, that <a href="http://feedblog.org/2007/11/04/mysql-and-disk-transfers-per-second/">Kevin Burton just wrote about</a>.</p>

<p>Oh yeah, while I was at it, I totally rewrote the archiver with unit-tested, test-driven, test-first, other-buzzword-compliant code.  I added a lot of other improvements, too.  For example, it can now archive tables that have much harder keys to optimize efficiently, such as nullable non-unique non-primary keys.</p>

<p><strong>Further Reading:</strong><ul><li><a href='http://www.xaprb.com/blog/2007/06/09/mysql-archiver-092-released/' rel='bookmark' title='Permanent Link: MySQL Archiver 0.9.2 released'>MySQL Archiver 0.9.2 released</a></li>
<li><a href='http://www.xaprb.com/blog/2007/06/15/archive-strategies-for-oltp-servers-part-3/' rel='bookmark' title='Permanent Link: Archive strategies for OLTP servers, Part 3'>Archive strategies for OLTP servers, Part 3</a></li>
<li><a href='http://www.xaprb.com/blog/2007/06/06/mysql-archiver-091-released/' rel='bookmark' title='Permanent Link: MySQL Archiver 0.9.1 released'>MySQL Archiver 0.9.1 released</a></li>
<li><a href='http://www.xaprb.com/blog/2007/06/13/archive-strategies-for-oltp-servers-part-1/' rel='bookmark' title='Permanent Link: Archive strategies for OLTP servers, Part 1'>Archive strategies for OLTP servers, Part 1</a></li>
<li><a href='http://www.xaprb.com/blog/2007/06/14/archive-strategies-for-oltp-servers-part-2/' rel='bookmark' title='Permanent Link: Archive strategies for OLTP servers, Part 2'>Archive strategies for OLTP servers, Part 2</a></li>
</ul>]]></content:encoded>
			<wfw:commentRss>http://www.xaprb.com/blog/2007/11/05/mysql-archiver-can-now-archive-each-row-to-a-different-table/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>High Performance MySQL, Second Edition: Backup and Recovery</title>
		<link>http://www.xaprb.com/blog/2007/09/19/high-performance-mysql-second-edition-backup-and-recovery/</link>
		<comments>http://www.xaprb.com/blog/2007/09/19/high-performance-mysql-second-edition-backup-and-recovery/#comments</comments>
		<pubDate>Wed, 19 Sep 2007 21:46:48 +0000</pubDate>
		<dc:creator>Xaprb</dc:creator>
				<category><![CDATA[backup]]></category>
		<category><![CDATA[data warehousing]]></category>
		<category><![CDATA[High Performance MySQL]]></category>
		<category><![CDATA[MySQL]]></category>
		<category><![CDATA[recovery]]></category>

		<guid isPermaLink="false">http://www.xaprb.com/blog/2007/09/19/high-performance-mysql-second-edition-backup-and-recovery/</guid>
		<description><![CDATA[<p>Progress on High Performance MySQL, Second Edition is coming along nicely.  You have probably noticed the lack of epic multi-part articles on this blog lately -- that's because I'm spending most of my spare time on the book.  At this point, we have significant work done on some of the hardest chapters, like Schema Optimization and Query Optimization.  I've been deep in the guts of those hard optimization chapters for a while now, so I decided to venture into lighter territory: Backup and Recovery, which is one of the few chapters we planned to "revise and expand" from the first edition, rather than completely writing from scratch.  I'd love to hear your thoughts and wishes -- click through to the full article for more details on the chapter and how it's shaping up.</p>


<strong>Further Reading:</strong><ul><li><a href='http://www.xaprb.com/blog/2007/10/02/progress-on-high-performance-mysql-backup-and-recovery-chapter/' rel='bookmark' title='Permanent Link: Progress on High Performance MySQL Backup and Recovery chapter'>Progress on High Performance MySQL Backup and Recovery chapter</a></li>
<li><a href='http://www.xaprb.com/blog/2007/10/18/high-performance-mysql-second-edition-replication-scaling-and-high-availability/' rel='bookmark' title='Permanent Link: High Performance MySQL, Second Edition: Replication, Scaling and High Availability'>High Performance MySQL, Second Edition: Replication, Scaling and High Availability</a></li>
<li><a href='http://www.xaprb.com/blog/2007/10/07/high-performance-mysql-second-edition-query-performance-optimization/' rel='bookmark' title='Permanent Link: High Performance MySQL, Second Edition: Query Performance Optimization'>High Performance MySQL, Second Edition: Query Performance Optimization</a></li>
<li><a href='http://www.xaprb.com/blog/2007/10/05/high-performance-mysql-second-edition-advanced-sql-functionality/' rel='bookmark' title='Permanent Link: High Performance MySQL, Second Edition: Advanced SQL Functionality'>High Performance MySQL, Second Edition: Advanced SQL Functionality</a></li>
<li><a href='http://www.xaprb.com/blog/2007/08/30/coming-soon-high-performance-mysql-second-edition/' rel='bookmark' title='Permanent Link: Coming soon: High Performance MySQL, Second Edition'>Coming soon: High Performance MySQL, Second Edition</a></li>
</ul>]]></description>
			<content:encoded><![CDATA[<p>Progress on High Performance MySQL, Second Edition is coming along nicely.  You have probably noticed the lack of epic multi-part articles on this blog lately &#8212; that&#8217;s because I&#8217;m spending most of my spare time on the book.  At this point, we have significant work done on some of the hardest chapters, like Schema Optimization and Query Optimization.  I&#8217;ve been deep in the guts of those hard optimization chapters for a while now, so I decided to venture into lighter territory: Backup and Recovery, which is one of the few chapters we planned to &#8220;revise and expand&#8221; from the first edition, rather than completely writing from scratch.</p>

<p>Since we decided to take that approach, I began by following the outline from the first edition, and figured I&#8217;d re-read the first edition&#8217;s chapter and re-outline, then add more material as appropriate.  To my surprise, I found this chapter in the first edition is one of the most cursory (I don&#8217;t mean to criticize too much &#8212; you&#8217;ll see where I&#8217;m going with this in a second).  It&#8217;s quite short and doesn&#8217;t really discuss recovery at all, despite the chapter title.  There&#8217;s one sub-section titled &#8220;Recovery,&#8221; but it&#8217;s only a few paragraphs, and mostly discusses <em>dumping</em>, not recovery!  [<strong>Edit</strong>: whoops, I see each subsection in the "Tools and Techniques" has a few words about how to restore backups created with that specific tool.  But there's still not much general advice about how to restore backups.]</p>

<p>The chapter devotes a lot of space to code listings and such, and not enough on how to do high-performance backups in a high-performance application, in my opinion.  I quickly decided it needs to be <em>significantly</em> expanded, not just updated, and I scrapped the original text and became more liberal with the outline.  I&#8217;m referring to the first edition as I write, but I&#8217;m not keeping any of the text.  Chalk it up to perfectionism.</p>

<p>The outline, as I have it so far, is as follows.  If you compare it to the first edition, you&#8217;ll see I&#8217;ve rearranged it quite a bit:</p>

<pre>1  Why Backups?
   (very brief, even more so than the first edition)
2 Considerations and Tradeoffs
   2.1 How Much Can You Afford to Lose?
   2.2 Online or Offline?
   2.3 Dump or Raw Backup?
   2.4 Onsite or Offsite?
   2.5 What to Back Up
   2.6 Storage Engines and Consistency
   2.7 Replication
3 Restoring from a Backup
   3.1 Copying Files Across the Network
   3.2 Starting MySQL
   3.3 Point-In-Time Recovery
4 Tools and Techniques
   4.1 mysqldump
   4.2 mysqlhotcopy
   4.3 Zmanda Recovery Manager
   4.4 InnoDB Hot Backup
   4.5 Offline Backups
   4.6 Filesystem Snapshots
   4.7 MySQL Global Hot Backup
   4.8 Automating and Scripting Backups
5 Rolling Your Own Backup Script</pre>

<p>At this point, I have written sections 1, 2 and 3, which are about 11 pages in OpenOffice.org (compare to 6 pages on paper in the first edition).  I&#8217;m sure this will only grow as other things occur to me.  The outline of section 4 is completely open to change, and section 5 might not even happen; if you can script, you can script.  Otherwise, you might want to use one of the tools listed in section 4.  All in all, I&#8217;d say we&#8217;re looking at about 25 to 30 pages, just based on what&#8217;s in my head and not yet written down.</p>

<p>Now, to come to my point: what would be helpful to you?  Are there any challenges you&#8217;d like me to cover, such as how you back up a data warehouse with terabytes of data?  (I&#8217;ve already done that, in What To Back Up, but feel free to ask anyway.)  Are there challenges <em>you</em> have had to solve, which you think would be very helpful to others?  This chapter is largely open to suggestion at this point.  If you tell me/us what you&#8217;d like to see, this is your opportunity to get at least four experts to solve your problems in-depth.</p>

<p>The usual disclaimers apply: no guarantees, this is all open to change, this is top-secret pre-production material anyway and you never saw this web page. What is the first rule of Fight Club, again?</p>

<p>I&#8217;m looking forward to your feedback.</p>

<p><strong>Further Reading:</strong><ul><li><a href='http://www.xaprb.com/blog/2007/10/02/progress-on-high-performance-mysql-backup-and-recovery-chapter/' rel='bookmark' title='Permanent Link: Progress on High Performance MySQL Backup and Recovery chapter'>Progress on High Performance MySQL Backup and Recovery chapter</a></li>
<li><a href='http://www.xaprb.com/blog/2007/10/18/high-performance-mysql-second-edition-replication-scaling-and-high-availability/' rel='bookmark' title='Permanent Link: High Performance MySQL, Second Edition: Replication, Scaling and High Availability'>High Performance MySQL, Second Edition: Replication, Scaling and High Availability</a></li>
<li><a href='http://www.xaprb.com/blog/2007/10/07/high-performance-mysql-second-edition-query-performance-optimization/' rel='bookmark' title='Permanent Link: High Performance MySQL, Second Edition: Query Performance Optimization'>High Performance MySQL, Second Edition: Query Performance Optimization</a></li>
<li><a href='http://www.xaprb.com/blog/2007/10/05/high-performance-mysql-second-edition-advanced-sql-functionality/' rel='bookmark' title='Permanent Link: High Performance MySQL, Second Edition: Advanced SQL Functionality'>High Performance MySQL, Second Edition: Advanced SQL Functionality</a></li>
<li><a href='http://www.xaprb.com/blog/2007/08/30/coming-soon-high-performance-mysql-second-edition/' rel='bookmark' title='Permanent Link: Coming soon: High Performance MySQL, Second Edition'>Coming soon: High Performance MySQL, Second Edition</a></li>
</ul>]]></content:encoded>
			<wfw:commentRss>http://www.xaprb.com/blog/2007/09/19/high-performance-mysql-second-edition-backup-and-recovery/feed/</wfw:commentRss>
		<slash:comments>11</slash:comments>
		</item>
	</channel>
</rss>

