<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Xaprb &#187; InnoDB</title>
	<atom:link href="http://www.xaprb.com/blog/tag/innodb/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.xaprb.com/blog</link>
	<description>Stay curious!</description>
	<lastBuildDate>Thu, 09 Feb 2012 03:58:56 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.2.1</generator>
		<item>
		<title>More on measuring IO latency</title>
		<link>http://www.xaprb.com/blog/2011/05/24/more-on-measuring-io-latency/</link>
		<comments>http://www.xaprb.com/blog/2011/05/24/more-on-measuring-io-latency/#comments</comments>
		<pubDate>Tue, 24 May 2011 22:17:10 +0000</pubDate>
		<dc:creator>Xaprb</dc:creator>
				<category><![CDATA[SQL]]></category>
		<category><![CDATA[Brendan Gregg]]></category>
		<category><![CDATA[DTrace]]></category>
		<category><![CDATA[InnoDB]]></category>
		<category><![CDATA[Mark Leith]]></category>
		<category><![CDATA[MySQL]]></category>

		<guid isPermaLink="false">http://www.xaprb.com/blog/?p=2337</guid>
		<description><![CDATA[To follow on to my earlier links to Brendan Gregg&#8217;s blog posts on measuring I/O latency, there is a third one discussing DTrace, and then a very detailed response from Mark Leith showing how to do it with the PERFORMANCE_SCHEMA in MySQL 5.5. Further Reading:Disk latency versus filesystem latency Measuring the popularity of the Percona [...]


<strong>Further Reading:</strong><ul><li><a href='http://www.xaprb.com/blog/2011/05/15/disk-latency-versus-filesystem-latency/' rel='bookmark' title='Permanent Link: Disk latency versus filesystem latency'>Disk latency versus filesystem latency</a></li>
<li><a href='http://www.xaprb.com/blog/2009/02/19/measuring-the-popularity-of-the-percona-mysql-build/' rel='bookmark' title='Permanent Link: Measuring the popularity of the Percona MySQL build'>Measuring the popularity of the Percona MySQL build</a></li>
<li><a href='http://www.xaprb.com/blog/2011/07/04/measuring-open-source-success-by-jobs/' rel='bookmark' title='Permanent Link: Measuring open-source success by jobs'>Measuring open-source success by jobs</a></li>
<li><a href='http://www.xaprb.com/blog/2009/02/08/thoughts-on-the-new-performance_schema-in-mysql/' rel='bookmark' title='Permanent Link: Thoughts on the new PERFORMANCE_SCHEMA in MySQL'>Thoughts on the new PERFORMANCE_SCHEMA in MySQL</a></li>
<li><a href='http://www.xaprb.com/blog/2009/12/12/what-do-you-know-about-oracles-innodb-plus-storage-engine/' rel='bookmark' title='Permanent Link: What do you know about Oracle&#8217;s InnoDB+ storage engine?'>What do you know about Oracle&#8217;s InnoDB+ storage engine?</a></li>
</ul>]]></description>
			<content:encoded><![CDATA[<p>To follow on to my earlier links to Brendan Gregg&#8217;s blog posts on measuring I/O latency, there is <a href="http://dtrace.org/blogs/brendan/2011/05/18/file-system-latency-part-3/">a third one discussing DTrace</a>, and then a very detailed response from Mark Leith showing <a href="http://www.markleith.co.uk/?p=656">how to do it with the PERFORMANCE_SCHEMA in MySQL 5.5</a>.</p>

<p><strong>Further Reading:</strong><ul><li><a href='http://www.xaprb.com/blog/2011/05/15/disk-latency-versus-filesystem-latency/' rel='bookmark' title='Permanent Link: Disk latency versus filesystem latency'>Disk latency versus filesystem latency</a></li>
<li><a href='http://www.xaprb.com/blog/2009/02/19/measuring-the-popularity-of-the-percona-mysql-build/' rel='bookmark' title='Permanent Link: Measuring the popularity of the Percona MySQL build'>Measuring the popularity of the Percona MySQL build</a></li>
<li><a href='http://www.xaprb.com/blog/2011/07/04/measuring-open-source-success-by-jobs/' rel='bookmark' title='Permanent Link: Measuring open-source success by jobs'>Measuring open-source success by jobs</a></li>
<li><a href='http://www.xaprb.com/blog/2009/02/08/thoughts-on-the-new-performance_schema-in-mysql/' rel='bookmark' title='Permanent Link: Thoughts on the new PERFORMANCE_SCHEMA in MySQL'>Thoughts on the new PERFORMANCE_SCHEMA in MySQL</a></li>
<li><a href='http://www.xaprb.com/blog/2009/12/12/what-do-you-know-about-oracles-innodb-plus-storage-engine/' rel='bookmark' title='Permanent Link: What do you know about Oracle&#8217;s InnoDB+ storage engine?'>What do you know about Oracle&#8217;s InnoDB+ storage engine?</a></li>
</ul>]]></content:encoded>
			<wfw:commentRss>http://www.xaprb.com/blog/2011/05/24/more-on-measuring-io-latency/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>How InnoDB performs a checkpoint</title>
		<link>http://www.xaprb.com/blog/2011/01/29/how-innodb-performs-a-checkpoint/</link>
		<comments>http://www.xaprb.com/blog/2011/01/29/how-innodb-performs-a-checkpoint/#comments</comments>
		<pubDate>Sat, 29 Jan 2011 14:19:03 +0000</pubDate>
		<dc:creator>Xaprb</dc:creator>
				<category><![CDATA[PostgreSQL]]></category>
		<category><![CDATA[SQL]]></category>
		<category><![CDATA[Adaptive Flushing]]></category>
		<category><![CDATA[checkpoint]]></category>
		<category><![CDATA[InnoDB]]></category>
		<category><![CDATA[Mark Callaghan]]></category>
		<category><![CDATA[Vadim Tkachenko]]></category>

		<guid isPermaLink="false">http://www.xaprb.com/blog/?p=2174</guid>
		<description><![CDATA[InnoDB&#8217;s checkpoint algorithm is not well documented. It is too complex to explain in even a long blog post, because to understand checkpoints, you need to understand a lot of other things that InnoDB does. I hope that explaining how InnoDB does checkpoints in high-level terms, with simplifications, will be helpful. A lot of the [...]


<strong>Further Reading:</strong><ul><li><a href='http://www.xaprb.com/blog/2007/09/18/how-to-debug-innodb-lock-waits/' rel='bookmark' title='Permanent Link: How to debug InnoDB lock waits'>How to debug InnoDB lock waits</a></li>
<li><a href='http://www.xaprb.com/blog/2007/12/21/how-i-patched-innodb-to-show-locks-held/' rel='bookmark' title='Permanent Link: How I patched InnoDB to show locks held'>How I patched InnoDB to show locks held</a></li>
<li><a href='http://www.xaprb.com/blog/2009/10/25/what-do-the-innodb-insert-buffer-statistics-mean/' rel='bookmark' title='Permanent Link: What do the InnoDB insert buffer statistics mean?'>What do the InnoDB insert buffer statistics mean?</a></li>
<li><a href='http://www.xaprb.com/blog/2009/12/13/innodb-is-a-nosql-database/' rel='bookmark' title='Permanent Link: InnoDB is a NoSQL database'>InnoDB is a NoSQL database</a></li>
<li><a href='http://www.xaprb.com/blog/2010/03/04/a-growing-trend-innodb-mutex-contention/' rel='bookmark' title='Permanent Link: A growing trend: InnoDB mutex contention'>A growing trend: InnoDB mutex contention</a></li>
</ul>]]></description>
			<content:encoded><![CDATA[<p>InnoDB&#8217;s checkpoint algorithm is not well documented.  It is too complex to explain in even a long blog post, because to understand checkpoints, you need to understand a lot of other things that InnoDB does.  I hope that explaining how InnoDB does checkpoints in high-level terms, with simplifications, will be helpful.  A lot of the simplifications are because I do not want to explain the complexities of how the simple rules can be tweaked for optimization purposes, while not violating the ACID guarantees they enforce.</p>

<p>A bit of background: <a href="http://www.amazon.com/dp/1558601902?tag=xaprb-20">Gray and Reuter&#8217;s classic text on transaction processing</a> introduced two types of checkpoints beginning on page 605.  There is a <strong>sharp checkpoint</strong>, and there is a <strong>fuzzy checkpoint</strong>.</p>

<p>A sharp checkpoint is accomplished by flushing all modified pages for committed transactions to disk, and writing down the log sequence number (LSN) of the most recent committed transaction.  Modified pages for uncommitted transactions should not be flushed &#8212; that would violate the rule of write-ahead logging.  (This is a deliberate and gross over-simplification; I will not draw attention to further simplifications I make.)  Upon recovery, log REDO can start from the LSN at which the checkpoint took place.  A sharp checkpoint is called &#8220;sharp&#8221; because everything that is flushed to disk for the checkpoint is consistent as of a single point in time &#8212; the checkpoint LSN.</p>

<p>A fuzzy checkpoint is more complex. It flushes pages as time passes, until it has flushed all pages that a sharp checkpoint would have done.  It completes by writing down two LSNs: when the checkpoint started and when it ended.  But the pages it flushed might not all be consistent with each other as of a single point in time, which is why it&#8217;s called &#8220;fuzzy.&#8221;  A page that got flushed early might have been modified since then, and a page that got flushed late might have a newer LSN than the starting LSN.  A fuzzy checkpoint can conceptually be converted into a sharp checkpoint by performing REDO from the starting LSN to the ending LSN.  Upon recovery, then, REDO can begin from the LSN at which the checkpoint started.</p>

<p>It is often said that InnoDB does fuzzy checkpointing.  The truth is, it does both types.  When the database shuts down, it performs a sharp checkpoint.  During normal operation, it performs fuzzy checkpoints.  And InnoDB&#8217;s implementation of fuzzy checkpoints is not exactly the same as that described in Gray &amp; Reuter.</p>

<p>Here is where the weeds get deep: I will try to explain some of the subtleties that let InnoDB provide uniform quality of service by performing checkpoints almost constantly, instead of checkpoints being significant events that occur periodically.  It can be said that InnoDB actually never does a checkpoint during normal operation.  Instead, the state of the database on disk is a constantly advancing fuzzy checkpoint.  The advances are performed by regular flushing of dirty pages as a normal part of the database&#8217;s operation.  The details are far too many and complex to write here, in part because they have changed significantly as new versions have been released, but I will try to sketch the outline.</p>

<p>InnoDB maintains a large buffer pool in memory with many database pages, and doesn&#8217;t write modifications to disk immediately.  Instead, it keeps dirty pages in memory, hoping that they will be modified many times before they are written to disk.  This is called write combining, and is a performance optimization.  InnoDB keeps track of the pages in the buffer pool through several lists: the free list notes which pages are available to be used, the LRU list notes which pages have been used least recently, and the flush list contains all of the dirty pages in LSN order, least-recently-modified first.</p>

<p>These lists are all important, but for the simplified explanation here, I will focus on the flush list.  InnoDB has limited space in the buffer pool, and if there aren&#8217;t any free spots to store a page that InnoDB needs to read from disk, it must flush and free a dirty page to make room.  This is slow, so InnoDB tries to avoid the need for this by flushing dirty pages continually, keeping a reserve of clean pages that can be replaced without having to be flushed.  It flushes the oldest-modified pages from the flush list on a regular basis, trying to keep from hitting certain high-water marks.  It chooses the pages based on their physical locations on disk and their LSN (which is their modification time).</p>

<p>In addition to avoiding the high-water marks, InnoDB must avoid a very important low-water mark as well.  The transaction logs (aka REDO logs, WAL logs) in InnoDB are fixed-size, and are written in a circular fashion.  But spaces in the logs cannot be overwritten if they contain records of changes to a dirty page that hasn&#8217;t been flushed yet.  If that happened and the server crashed, all records of those changes would be lost.  Therefore, InnoDB has a limited amount of time to write out a given page&#8217;s modifications, because the ongoing transaction logging is hungry for space in the logs.  The size of the logs imposes the limit.  If the log writing activity wraps around in a circle and bumps into its own tail, it will cause a very bad server stall while InnoDB scrambles to free up some room in the logs.  This is why InnoDB generally chooses to flush in oldest-modification order: the oldest-modified pages are the furthest behind in the logs, and will be bumped into first.  The oldest unflushed dirty page&#8217;s LSN is the low-water mark in the transaction logs, and InnoDB tries to raise that low-water mark to keep as much room available in the transaction logs as it can.  Making the logs larger reduces the urgency of freeing up log space and and permits various performance optimizations to do the flushing more efficiently.</p>

<p>And now, with that simplified explanation in place, we can understand how InnoDB actually makes a fuzzy checkpoint.  When InnoDB flushes dirty pages to disk, it finds the oldest dirty page&#8217;s LSN and treats that as the checkpoint low-water mark.  It then writes this to the transaction log header.  You can see this in the functions log_checkpoint_margin() and log_checkpoint(). Therefore, every time InnoDB flushes dirty pages from the head of the flush list, it is actually making a checkpoint by advancing the oldest LSN in the system.  And that is how continual fuzzy checkpointing is implemented without ever &#8220;doing a checkpoint&#8221; as a separate event.  If there is a server crash, then recovery simply proceeds from the oldest LSN onwards.</p>

<p>When InnoDB shuts down, it does some additional work.  First, it stops all updates to data; then it flushes all dirty buffers to disk; then it writes the current LSN to the transaction logs.  This is the sharp checkpoint.  Additionally, it writes the LSN to the first page of each data file in the database, which serves as a signal that they have been checkpointed up to that LSN.  This permits further optimizations during recovery and when opening these data files.</p>

<p>There is a lot more to study if you want to learn how it&#8217;s really done in detail; there are many fine points to the process.  This is one area where the usually excellent manual is a bit lacking.  Some of the best resources are as follows:</p>

<ul>
<li> <a href="http://www.amazon.com/dp/1558601902?tag=xaprb-20">Gray and Reuter&#8217;s book</a></li>
<li><a href="http://www.facebook.com/note.php?note_id=408059000932">Mark Callaghan&#8217;s note on fuzzy checkpoints</a></li>
<li><a href="http://www.mysqlperformanceblog.com/2006/05/10/innodb-fuzzy-checkpointing-woes/">Peter Zaitsev&#8217;s post on why the flushing algorithm in older InnoDB used to cause spikes</a></li>
<li><a href="http://www.percona.com/ppc2009/PPC2009_Life_of_a_dirty_pageInnoDB_disk_IO.pdf">Mark Callaghan&#8217;s slides from Percona Performance Conference 2009</a></li>
</ul>

<p>A related topic that is equally as complex is how InnoDB flushes dirty pages at the right speed to keep up with the database&#8217;s workload.  Too fast and the server does too much work; too slow and it gets behind and hurries to catch up, causing spikes of furious flushing activity and degraded quality of service.  Percona Server has arguably the most advanced and effective algorithms for this, in the XtraDB storage engine (a variant of InnoDB).  Percona Server calls it &#8220;adaptive checkpointing.&#8221;  InnoDB followed suit by implementing something similar, but harder to tune correctly.  InnoDB calls it &#8220;adaptive flushing,&#8221; which is a more accurate name.  Much (and I do mean much!) has been written about this.  I know that Vadim has done hundreds of benchmarks to analyze how flushing and checkpointing works, some of them many hours long to study long-term performance characteristics.  I will point you to a couple of pages that I think are the most succinct summaries of the implementation and how it performs:</p>

<ul>
<li><a href="http://dimitrik.free.fr/blog/archives/2010/07/mysql-performance-innodb-io-capacity-flushing.html">Dimitri Kravtchuk&#8217;s blog post about adaptive flushing and the innodb_io_capacity variable</a></li>
<li><a href="http://www.mysqlperformanceblog.com/2011/01/03/mysql-5-5-8-in-search-of-stability/">Vadim&#8217;s benchmarks of Percona Server and MySQL 5.5.8</a>, showing how to tune so that &#8220;adaptive flushing&#8221; works well</li>
<li><a href="http://www.percona.com/docs/wiki/percona-server:features:innodb_io">Percona Server documentation for adaptive checkpointing</a></li>
<li><a href="http://www.xaprb.com/blog/2010/05/25/dirty-pages-fast-shutdown-and-write-combining/">My own blog post about balancing dirty page flushing and write combining</a></li>
</ul>

<p>If these types of topics interest you, you should attend <a href="http://www.percona.com/events/percona-live-san-francisco-2011/">Percona Live</a> in San Francisco in a couple of weeks.  Both Peter Zaitsev and Mark Callaghan will be speaking there on topics such as InnoDB internals, along with a variety of other speakers; there is a several-hour class on InnoDB internals.</p>

<p><strong>Further Reading:</strong><ul><li><a href='http://www.xaprb.com/blog/2007/09/18/how-to-debug-innodb-lock-waits/' rel='bookmark' title='Permanent Link: How to debug InnoDB lock waits'>How to debug InnoDB lock waits</a></li>
<li><a href='http://www.xaprb.com/blog/2007/12/21/how-i-patched-innodb-to-show-locks-held/' rel='bookmark' title='Permanent Link: How I patched InnoDB to show locks held'>How I patched InnoDB to show locks held</a></li>
<li><a href='http://www.xaprb.com/blog/2009/10/25/what-do-the-innodb-insert-buffer-statistics-mean/' rel='bookmark' title='Permanent Link: What do the InnoDB insert buffer statistics mean?'>What do the InnoDB insert buffer statistics mean?</a></li>
<li><a href='http://www.xaprb.com/blog/2009/12/13/innodb-is-a-nosql-database/' rel='bookmark' title='Permanent Link: InnoDB is a NoSQL database'>InnoDB is a NoSQL database</a></li>
<li><a href='http://www.xaprb.com/blog/2010/03/04/a-growing-trend-innodb-mutex-contention/' rel='bookmark' title='Permanent Link: A growing trend: InnoDB mutex contention'>A growing trend: InnoDB mutex contention</a></li>
</ul>]]></content:encoded>
			<wfw:commentRss>http://www.xaprb.com/blog/2011/01/29/how-innodb-performs-a-checkpoint/feed/</wfw:commentRss>
		<slash:comments>14</slash:comments>
		</item>
		<item>
		<title>Free webinar on MySQL performance this Thursday</title>
		<link>http://www.xaprb.com/blog/2010/08/23/free-webinar-on-mysql-performance-this-thursday/</link>
		<comments>http://www.xaprb.com/blog/2010/08/23/free-webinar-on-mysql-performance-this-thursday/#comments</comments>
		<pubDate>Mon, 23 Aug 2010 16:35:26 +0000</pubDate>
		<dc:creator>Xaprb</dc:creator>
				<category><![CDATA[Conferences]]></category>
		<category><![CDATA[SQL]]></category>
		<category><![CDATA[InnoDB]]></category>
		<category><![CDATA[ODTUG]]></category>
		<category><![CDATA[performance]]></category>
		<category><![CDATA[Webinar]]></category>

		<guid isPermaLink="false">http://www.xaprb.com/blog/?p=1984</guid>
		<description><![CDATA[ODTUG invited me to give a webinar and I said yes, so this Thursday you&#8217;re invited to join me as I talk about MySQL performance. We&#8217;ve come a very long way towards a MySQL that can perform well on modern hardware, and there really isn&#8217;t broad recognition of this. A lot of the best work [...]


<strong>Further Reading:</strong><ul><li><a href='http://www.xaprb.com/blog/2011/10/13/free-webinar-on-preventing-mysql-downtime/' rel='bookmark' title='Permanent Link: Free webinar on preventing MySQL downtime'>Free webinar on preventing MySQL downtime</a></li>
<li><a href='http://www.xaprb.com/blog/2012/01/16/free-webinar-wednesday-verifying-replication-integrity/' rel='bookmark' title='Permanent Link: Free webinar Wednesday: verifying replication integrity'>Free webinar Wednesday: verifying replication integrity</a></li>
<li><a href='http://www.xaprb.com/blog/2008/04/16/get-a-free-sample-chapter-of-high-performance-mysql-second-edition/' rel='bookmark' title='Permanent Link: Get a free sample chapter of High Performance MySQL Second Edition'>Get a free sample chapter of High Performance MySQL Second Edition</a></li>
<li><a href='http://www.xaprb.com/blog/2012/02/07/three-free-mysql-webinars/' rel='bookmark' title='Permanent Link: Three free MySQL webinars'>Three free MySQL webinars</a></li>
<li><a href='http://www.xaprb.com/blog/2010/08/16/speaking-at-novarug-on-thursday/' rel='bookmark' title='Permanent Link: Speaking at NovaRUG on Thursday'>Speaking at NovaRUG on Thursday</a></li>
</ul>]]></description>
			<content:encoded><![CDATA[<p>ODTUG invited me to give a webinar and I said yes, so this Thursday you&#8217;re invited to join me as I talk about MySQL performance.  We&#8217;ve come a very long way towards a MySQL that can perform well on modern hardware, and there really isn&#8217;t broad recognition of this.  A lot of the best work has gone into the InnoDB &#8220;plugin&#8221; storage engine, which was announced after my co-authors and I sent <a href="http://tinyurl.com/highperfmysql">High Performance MySQL</a> to the press.  I will explain what you should be doing differently now than you did two years ago, and suggest a performance-in-a-nutshell configuration baseline for MySQL that&#8217;s quite different from what I&#8217;d have said in 2008. You can <a href="https://www2.gotomeeting.com/register/470088995">register for free through GoToWebinar</a>.  See you there.</p>

<p><strong>Further Reading:</strong><ul><li><a href='http://www.xaprb.com/blog/2011/10/13/free-webinar-on-preventing-mysql-downtime/' rel='bookmark' title='Permanent Link: Free webinar on preventing MySQL downtime'>Free webinar on preventing MySQL downtime</a></li>
<li><a href='http://www.xaprb.com/blog/2012/01/16/free-webinar-wednesday-verifying-replication-integrity/' rel='bookmark' title='Permanent Link: Free webinar Wednesday: verifying replication integrity'>Free webinar Wednesday: verifying replication integrity</a></li>
<li><a href='http://www.xaprb.com/blog/2008/04/16/get-a-free-sample-chapter-of-high-performance-mysql-second-edition/' rel='bookmark' title='Permanent Link: Get a free sample chapter of High Performance MySQL Second Edition'>Get a free sample chapter of High Performance MySQL Second Edition</a></li>
<li><a href='http://www.xaprb.com/blog/2012/02/07/three-free-mysql-webinars/' rel='bookmark' title='Permanent Link: Three free MySQL webinars'>Three free MySQL webinars</a></li>
<li><a href='http://www.xaprb.com/blog/2010/08/16/speaking-at-novarug-on-thursday/' rel='bookmark' title='Permanent Link: Speaking at NovaRUG on Thursday'>Speaking at NovaRUG on Thursday</a></li>
</ul>]]></content:encoded>
			<wfw:commentRss>http://www.xaprb.com/blog/2010/08/23/free-webinar-on-mysql-performance-this-thursday/feed/</wfw:commentRss>
		<slash:comments>5</slash:comments>
		</item>
		<item>
		<title>The new hotness in open-core: InnoDB</title>
		<link>http://www.xaprb.com/blog/2010/07/02/the-new-hotness-in-open-core-innodb/</link>
		<comments>http://www.xaprb.com/blog/2010/07/02/the-new-hotness-in-open-core-innodb/#comments</comments>
		<pubDate>Fri, 02 Jul 2010 17:58:35 +0000</pubDate>
		<dc:creator>Xaprb</dc:creator>
				<category><![CDATA[Commentary]]></category>
		<category><![CDATA[Open Source]]></category>
		<category><![CDATA[Oracle]]></category>
		<category><![CDATA[SQL]]></category>
		<category><![CDATA[Domas Mituzas]]></category>
		<category><![CDATA[InnoDB]]></category>
		<category><![CDATA[Open Core]]></category>
		<category><![CDATA[Xtrabackup]]></category>

		<guid isPermaLink="false">http://www.xaprb.com/blog/?p=1914</guid>
		<description><![CDATA[There&#8217;s lots of buzz lately about the so-called &#8220;open-core&#8221; business model of Marten Mickos&#8217;s new employer. But this is nothing new. Depending on how you define it, InnoDB is &#8220;open-core,&#8221; and has been for a long time. The InnoDB Hot Backup (ibbackup) tool was always closed-source. Did anyone ever cry foul and claim that this [...]


<strong>Further Reading:</strong><ul><li><a href='http://www.xaprb.com/blog/2008/12/23/does-mysql-really-have-an-open-source-business-model/' rel='bookmark' title='Permanent Link: Does MySQL really have an open-source business model?'>Does MySQL really have an open-source business model?</a></li>
<li><a href='http://www.xaprb.com/blog/2009/06/08/xtrabackup-is-for-innodb-tables-too-not-just-xtradb/' rel='bookmark' title='Permanent Link: Xtrabackup is for InnoDB tables too, not just XtraDB'>Xtrabackup is for InnoDB tables too, not just XtraDB</a></li>
<li><a href='http://www.xaprb.com/blog/2011/07/04/measuring-open-source-success-by-jobs/' rel='bookmark' title='Permanent Link: Measuring open-source success by jobs'>Measuring open-source success by jobs</a></li>
<li><a href='http://www.xaprb.com/blog/2009/04/29/what-does-an-open-source-sales-model-look-like/' rel='bookmark' title='Permanent Link: What does an open source sales model look like?'>What does an open source sales model look like?</a></li>
<li><a href='http://www.xaprb.com/blog/2009/03/08/making-maatkit-more-open-source-one-step-at-a-time/' rel='bookmark' title='Permanent Link: Making Maatkit more Open Source one step at a time'>Making Maatkit more Open Source one step at a time</a></li>
</ul>]]></description>
			<content:encoded><![CDATA[<p>There&#8217;s lots of buzz lately about the so-called <a href="http://www.computerworlduk.com/community/blogs/index.cfm?entryid=3048&#038;blogid=41">&#8220;open-core&#8221; business model of Marten Mickos&#8217;s new employer</a>. But this is nothing new.  Depending on how you define it, InnoDB is &#8220;open-core,&#8221; and has been for a long time.  The InnoDB Hot Backup (ibbackup) tool was always closed-source.  Did anyone ever cry foul and claim that this made InnoDB itself not open-source, or accuse Innobase / Oracle of masquerading as open-source?  I don&#8217;t recall that happening, although sometimes people got suspicious about <a href="http://mituzas.lt/2010/05/08/on-hot-backups/">the interplay between the backup tool and the storage engine</a>.  Generally, though, the people I know who use InnoDB Hot Backup have no gripes about paying for it.</p>

<p>What is the difference between open-source with closed-source accessories, and crippleware?  I think it depends on how people define the core functionality of software.  Some might say that backup is core functionality for a database; and others would point to mysqldump and say that InnoDB isn&#8217;t crippleware as long as there is <em>some</em> alternative.</p>

<p>I think InnoDB is an interesting case that illustrates what can happen when commercial and GPL play together.  Part of that story is <a href="http://www.mysqlperformanceblog.com/2009/02/24/xtrabackup-open-source-alternative-for-innodb-hot-backup-call-for-ideas/">the appearance of XtraBackup</a>, an open-source competitor to InnoDB Hot Backup.  Everyone&#8217;s subject to the rules of the game, unless they restrict the &#8220;core,&#8221; which would make it non-open-source to begin with.</p>

<p><strong>Further Reading:</strong><ul><li><a href='http://www.xaprb.com/blog/2008/12/23/does-mysql-really-have-an-open-source-business-model/' rel='bookmark' title='Permanent Link: Does MySQL really have an open-source business model?'>Does MySQL really have an open-source business model?</a></li>
<li><a href='http://www.xaprb.com/blog/2009/06/08/xtrabackup-is-for-innodb-tables-too-not-just-xtradb/' rel='bookmark' title='Permanent Link: Xtrabackup is for InnoDB tables too, not just XtraDB'>Xtrabackup is for InnoDB tables too, not just XtraDB</a></li>
<li><a href='http://www.xaprb.com/blog/2011/07/04/measuring-open-source-success-by-jobs/' rel='bookmark' title='Permanent Link: Measuring open-source success by jobs'>Measuring open-source success by jobs</a></li>
<li><a href='http://www.xaprb.com/blog/2009/04/29/what-does-an-open-source-sales-model-look-like/' rel='bookmark' title='Permanent Link: What does an open source sales model look like?'>What does an open source sales model look like?</a></li>
<li><a href='http://www.xaprb.com/blog/2009/03/08/making-maatkit-more-open-source-one-step-at-a-time/' rel='bookmark' title='Permanent Link: Making Maatkit more Open Source one step at a time'>Making Maatkit more Open Source one step at a time</a></li>
</ul>]]></content:encoded>
			<wfw:commentRss>http://www.xaprb.com/blog/2010/07/02/the-new-hotness-in-open-core-innodb/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Dirty pages, fast shutdown, and write combining</title>
		<link>http://www.xaprb.com/blog/2010/05/25/dirty-pages-fast-shutdown-and-write-combining/</link>
		<comments>http://www.xaprb.com/blog/2010/05/25/dirty-pages-fast-shutdown-and-write-combining/#comments</comments>
		<pubDate>Wed, 26 May 2010 02:24:19 +0000</pubDate>
		<dc:creator>Xaprb</dc:creator>
				<category><![CDATA[SQL]]></category>
		<category><![CDATA[InnoDB]]></category>

		<guid isPermaLink="false">http://www.xaprb.com/blog/?p=1873</guid>
		<description><![CDATA[One of the things that makes a traditional transactional database hard to make highly available is a relatively slow shutdown and start-up time. Applications typically delegate most or all writes to the database, which tends to run with a lot of &#8220;dirty&#8221; data in its (often large) memory. At shutdown time, the dirty memory needs [...]


<strong>Further Reading:</strong><ul><li><a href='http://www.xaprb.com/blog/2007/05/12/how-fast-is-mysql-table-checksum/' rel='bookmark' title='Permanent Link: How fast is MySQL Table Checksum?'>How fast is MySQL Table Checksum?</a></li>
<li><a href='http://www.xaprb.com/blog/2008/03/09/a-very-fast-fnv-hash-function-for-mysql/' rel='bookmark' title='Permanent Link: A very fast FNV hash function for MySQL'>A very fast FNV hash function for MySQL</a></li>
<li><a href='http://www.xaprb.com/blog/2007/10/23/how-fast-is-mysql-replication/' rel='bookmark' title='Permanent Link: How fast is MySQL replication?'>How fast is MySQL replication?</a></li>
<li><a href='http://www.xaprb.com/blog/2008/05/21/get-maatkit-fast-command-line/' rel='bookmark' title='Permanent Link: Get Maatkit fast from the command line'>Get Maatkit fast from the command line</a></li>
<li><a href='http://www.xaprb.com/blog/2009/12/19/how-to-write-a-good-mysql-conference-proposal/' rel='bookmark' title='Permanent Link: How to write a good MySQL conference proposal'>How to write a good MySQL conference proposal</a></li>
</ul>]]></description>
			<content:encoded><![CDATA[<p>One of the things that makes a traditional transactional database hard to make highly available is a relatively slow shutdown and start-up time.  Applications typically delegate most or all writes to the database, which tends to run with a lot of &#8220;dirty&#8221; data in its (often large) memory.  At shutdown time, the dirty memory needs to be written to disk, so the recovery routine doesn&#8217;t have to run at startup.  And even upon a clean startup, the database probably has to warm up, which can also take a very long time.</p>

<p>Some databases let the operating system handle most of their memory management needs.  This has its own challenges, especially if the <a href="http://blog.2ndquadrant.com/en/2010/05/postgresql-freebsd-and-free-do.html">operating system&#8217;s design doesn&#8217;t align exactly with the database&#8217;s goals</a>.  Other databases take matters into their own hands.  InnoDB (the de facto transactional MySQL storage engine) falls into this category; when properly configured to take advantage of modern hardware, it will use basically all of the server&#8217;s memory in a huge buffer pool, with files opened in O_DIRECT mode, bypassing the operating system for I/O operations.</p>

<p>The design choices, and the results, are worth thinking about.  Assuming you shut down and restart infrequently, the choice to hold a lot of dirty memory has huge performance benefits, which has to be balanced against the desire for fast shutdown and recovery.  In InnoDB, there are a few things you can configure that change the startup and shutdown behaviors, but you should understand the performance effects during normal operation.</p>

<p>First, let&#8217;s look at why it&#8217;s nice to run with lots of dirty data in memory.</p>

<h3>Write combining</h3>

<p>Most databases have a concept called a page, buffer, or block.  This is a physical unit of data, which can typically store many logical units (rows).  InnoDB defaults to 16kb pages of data.  Imagine that your typical row is only 80 bytes long.  A lot of rows can fit into 16kb in most uses.</p>

<p>Suppose you insert, update, or delete a row.  Should InnoDB write the result to disk?  If it does, it has to write the entire 16kb page, and any other index pages and so forth, which can add up to a lot of pages.  That&#8217;s a lot of work for a little bitty 80-byte row!  InnoDB leaves the pages dirty in its memory.  When you commit the transaction, the write-ahead log ensures that if there&#8217;s a crash, the change is still permanent.  (The log has very compact entries and is not page-oriented.)</p>

<p>Now suppose you make another little change.  In many cases, there&#8217;s a decent probability that both of the changes touched the same page(s).  In fact, if you had the <a href="http://www.percona.com/docs/wiki/patches:innodb_io_pattern">statistics</a> to prove it, you would probably see that the vast majority of your changes focus on a small fraction of the total pages, or even <a href="http://www.facebook.com/note.php?note_id=392581440932">a small fraction of the rows</a>.  Most workloads have a very tall head and a very long tail.  Tens, hundreds, even thousands of times more changes go to those same pages and rows, as compared to the less-active ones.</p>

<p>Eventually, our favorite &#8220;hot page&#8221; gets written so a checkpoint can complete.  Tons of changes were written in a single write.  This is <em>write combining</em>, and it&#8217;s a huge efficiency.  Huge!  Servers can accept many tens of thousands of writes per second, and guarantee ACID properties, because of write combining.  If they didn&#8217;t combine writes, they&#8217;d be asked to do many more I/O operations per second.</p>

<h3>Dirty pages and the long tail</h3>

<p>The downside to this is the amount of dirty pages in memory, which have to be written out during shutdown.  Shutdown is equivalent to a forced checkpoint.  The server has been lazily delaying lots of work, because it knows it&#8217;s going to be able to combine writes.  Suddenly, all the bills come due at once &#8212; time to write tons of data to disk!  And the problem here is that the server&#8217;s memory can actually be mostly dirty data.  By default, InnoDB lets the buffer pool get up to 90% dirty before it starts to get worried and work hard to flush pages.</p>

<p>If most writes go to the hottest pages, why should there be so many dirty pages?  The answer is the long tail.  The few writes that don&#8217;t go to the tall head go to a very scattered long tail.  Again this is hard to prove, but many of those one-off writes are dirtying entire pages just for themselves, and those pages will not be dirtied by any other writes.  So the long tail is full of 16kb pages that had only 80 bytes written to them.  This ends up being a lot of pages of data.</p>

<h3>Fast shutdown on demand</h3>

<p>If you want your database to be able to shut down quickly if needed, what can you do about this?  This is a tough question to answer.  There are a few different strategies you might take.</p>

<ul>
<li>You can configure InnoDB to keep the dirty pages to a minimum.  The problem is, it starts to do a lot less write combining.  Take an average web application&#8217;s database and lower the dirty page percent, and watch the disk activity.  It will go through the roof.  It starts furiously flushing pages, only to turn around and flush the same pages again an instant later.  InnoDB isn&#8217;t particularly designed or optimized for this, by the way.  Things will suffer.  However, this is actually a useful technique for a planned fast shutdown.</li>
<li>You can lower the page size.  If you make the pages smaller, then in theory you&#8217;ll do less work flushing those long-tail pages.  Be careful with this!  There is research (actual math, mind you) indicating that InnoDB&#8217;s default page size is already too small, and there isn&#8217;t a lot of real-world experience with non-default page sizes.  The Tokutek folks know a lot about the math, by the way.</li>
<li>You can configure InnoDB not to flush dirty pages before a shutdown.  This is essentially the same thing as shutting down without a checkpoint, which is the same as crashing.  The recovery routine will have to run at startup before the database becomes available.  That is likely to be much slower than a clean shutdown, due to the mechanics of crash recovery.</li>
<li>You might be able to make InnoDB capable of doing a lot more flushing by upgrading to a version that has separate threads for this purpose, and/or using native asynchronous I/O.  This might not really help in shutdown; to tell the truth, I haven&#8217;t checked it.</li>
</ul>

<h3>No free lunch</h3>

<p>InnoDB is a complex system that is trying to balance a lot of different factors for efficiency, while giving nice ACID properties.  And it&#8217;s actually doing a pretty decent job of it by default.  When you say you&#8217;d like more or less of such-and-such performance characteristic, then something else gets traded off.  This is a really hard problem, and I&#8217;m not aware of anyone who has a brilliant solution to it, although I am far from a database research specialist.</p>

<p>Even the question of how much data to write, and how quickly, is a hard one.  It&#8217;s hard and expensive to really answer accurately because the <em>real</em> answer requires knowledge of things such the frequency and distribution of page dirtying.  Therefore, InnoDB kind of avoids this and lets you configure its &#8220;I/O capacity&#8221; and &#8220;dirty page percent&#8221; and maybe a few other things, depending on which version you use.  These are just models that approximate the true answers to the real questions.  All models are wrong.  Some models are useful.  InnoDB employs useful models that work a lot of the time.</p>

<p><strong>Further Reading:</strong><ul><li><a href='http://www.xaprb.com/blog/2007/05/12/how-fast-is-mysql-table-checksum/' rel='bookmark' title='Permanent Link: How fast is MySQL Table Checksum?'>How fast is MySQL Table Checksum?</a></li>
<li><a href='http://www.xaprb.com/blog/2008/03/09/a-very-fast-fnv-hash-function-for-mysql/' rel='bookmark' title='Permanent Link: A very fast FNV hash function for MySQL'>A very fast FNV hash function for MySQL</a></li>
<li><a href='http://www.xaprb.com/blog/2007/10/23/how-fast-is-mysql-replication/' rel='bookmark' title='Permanent Link: How fast is MySQL replication?'>How fast is MySQL replication?</a></li>
<li><a href='http://www.xaprb.com/blog/2008/05/21/get-maatkit-fast-command-line/' rel='bookmark' title='Permanent Link: Get Maatkit fast from the command line'>Get Maatkit fast from the command line</a></li>
<li><a href='http://www.xaprb.com/blog/2009/12/19/how-to-write-a-good-mysql-conference-proposal/' rel='bookmark' title='Permanent Link: How to write a good MySQL conference proposal'>How to write a good MySQL conference proposal</a></li>
</ul>]]></content:encoded>
			<wfw:commentRss>http://www.xaprb.com/blog/2010/05/25/dirty-pages-fast-shutdown-and-write-combining/feed/</wfw:commentRss>
		<slash:comments>4</slash:comments>
		</item>
	</channel>
</rss>

