<?xml version="1.0" encoding="UTF-8"?><!-- generator="wordpress/2.2.2" -->
<rss version="2.0" 
	xmlns:content="http://purl.org/rss/1.0/modules/content/">
<channel>
	<title>Comments on: MySQL&#8217;s FEDERATED storage engine: Part 2</title>
	<link>http://www.xaprb.com/blog/2007/01/31/mysqls-federated-storage-engine-part-2/</link>
	<description>Stay curious!</description>
	<pubDate>Fri, 08 Aug 2008 18:59:09 +0000</pubDate>
	<generator>http://wordpress.org/?v=2.2.2</generator>

	<item>
		<title>By: Xaprb</title>
		<link>http://www.xaprb.com/blog/2007/01/31/mysqls-federated-storage-engine-part-2/#comment-4899</link>
		<author>Xaprb</author>
		<pubDate>Fri, 09 Mar 2007 22:12:48 +0000</pubDate>
		<guid>http://www.xaprb.com/blog/2007/01/31/mysqls-federated-storage-engine-part-2/#comment-4899</guid>
		<description>&lt;p&gt;I think the closest thing MySQL offers is NDB cluster.  You may also look into partitioned tables (in version 5.1 only), but that is single-server only.  I have no experience with either approach myself, but I'm sure some people on #mysql IRC channel do.  I hope this helps!&lt;/p&gt;</description>
		<content:encoded><![CDATA[<p>I think the closest thing MySQL offers is NDB cluster.  You may also look into partitioned tables (in version 5.1 only), but that is single-server only.  I have no experience with either approach myself, but I&#8217;m sure some people on #mysql IRC channel do.  I hope this helps!</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Hongliu Li</title>
		<link>http://www.xaprb.com/blog/2007/01/31/mysqls-federated-storage-engine-part-2/#comment-4897</link>
		<author>Hongliu Li</author>
		<pubDate>Fri, 09 Mar 2007 19:28:19 +0000</pubDate>
		<guid>http://www.xaprb.com/blog/2007/01/31/mysqls-federated-storage-engine-part-2/#comment-4897</guid>
		<description>&lt;p&gt;I want to find a solution to do search on MySQL servers partitioned by user id (we might have tens millions of users). That is, each MySQL Servers have a user_info table.

For example, if a search for a user name is entered, how do I construct a query that will be executed on all MySQL servers, then join the result into one and return to client application. I think Federation is not a solution, but can you suggest me a practical solution.&lt;/p&gt;</description>
		<content:encoded><![CDATA[<p>I want to find a solution to do search on MySQL servers partitioned by user id (we might have tens millions of users). That is, each MySQL Servers have a user_info table.</p>
<p>For example, if a search for a user name is entered, how do I construct a query that will be executed on all MySQL servers, then join the result into one and return to client application. I think Federation is not a solution, but can you suggest me a practical solution.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Xaprb</title>
		<link>http://www.xaprb.com/blog/2007/01/31/mysqls-federated-storage-engine-part-2/#comment-3793</link>
		<author>Xaprb</author>
		<pubDate>Fri, 02 Feb 2007 17:10:39 +0000</pubDate>
		<guid>http://www.xaprb.com/blog/2007/01/31/mysqls-federated-storage-engine-part-2/#comment-3793</guid>
		<description>&lt;p&gt;Hi Patrick, thanks for writing in.&lt;/p&gt;

&lt;p&gt;About defining more indexes so WHERE clauses don't get stripped: that could help somewhat, but I think it will still choose only one of the possible indexes.  It doesn't have statistics to decide which is best.&lt;/p&gt;

&lt;p&gt;In general I would say push down every condition that is valid on the remote server, without regard as to whether it will be useful; let the remote server decide that.&lt;/p&gt;

&lt;p&gt;I don't fully understand the requirements for checking the remote server's data before update/delete; why not just push down the conditions and report back a row count affected?  I would say the read-before-write is unnecessary.  But I'm not an expert on data federation! (I'll read the article you referenced, thanks for the link).  I think I recall that storage engines are designed to always read before write in MySQL, so maybe this is non-trivial to solve.&lt;/p&gt;

&lt;p&gt;Thanks for your good work!&lt;/p&gt;</description>
		<content:encoded><![CDATA[<p>Hi Patrick, thanks for writing in.</p>
<p>About defining more indexes so WHERE clauses don&#8217;t get stripped: that could help somewhat, but I think it will still choose only one of the possible indexes.  It doesn&#8217;t have statistics to decide which is best.</p>
<p>In general I would say push down every condition that is valid on the remote server, without regard as to whether it will be useful; let the remote server decide that.</p>
<p>I don&#8217;t fully understand the requirements for checking the remote server&#8217;s data before update/delete; why not just push down the conditions and report back a row count affected?  I would say the read-before-write is unnecessary.  But I&#8217;m not an expert on data federation! (I&#8217;ll read the article you referenced, thanks for the link).  I think I recall that storage engines are designed to always read before write in MySQL, so maybe this is non-trivial to solve.</p>
<p>Thanks for your good work!</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Patrick Galbraith</title>
		<link>http://www.xaprb.com/blog/2007/01/31/mysqls-federated-storage-engine-part-2/#comment-3791</link>
		<author>Patrick Galbraith</author>
		<pubDate>Fri, 02 Feb 2007 16:39:21 +0000</pubDate>
		<guid>http://www.xaprb.com/blog/2007/01/31/mysqls-federated-storage-engine-part-2/#comment-3791</guid>
		<description>&lt;p&gt;Dear Xaprb,&lt;/p&gt;

&lt;p&gt;Being the author of Federated, I agree with you on how you describe how Federated works and some of its limitations. There remains a lot of work to get various things to enhance federated.One of the first things you bring up is how Federated retrieves all data. This is certainly true for select count() as well as queries not using indexes. However, if you were to utilise indexes, then you avoid retrieving all rows. You can define indexes on the federated table even if they aren't defined remotely. What happens is that the query being 'built' within Federated utilises this column in a where clause resulting in a query that returns only the rows you want. This is a bit of a hack, I admit, but gives you the ability to avoid getting all rows retrieved. With this all in mind, I'm currently working on a patch that implements pushdown conditions on non-indexed columns. This will make it so queries with where clauses on non-indexed columns result in specific rows being returned as opposed to the whole table. I'm not sure about how things like select count(*) are pushed down, but it would be great to be able to construct a query to run remotely that only ran the 'select count(*)' remotely and returned that simple number!&lt;/p&gt;

&lt;p&gt;The next issue you mention is update and delete, in how it has to retrieve the data first before modifying it. This has to do again with how the remote query to be issued by Federated is to be built. In a nutshell, there is a loop that goes through all the fields in the table, appends the field names, and then appends the values. It would be nice if it were to somehow only retrieve the specific row(s) it intends to either delete or update and use that/those row(s) to build it's query, and perhaps intelligently build a query that would append the ids into a query using ".. IN (...)" or using a range.&lt;/p&gt;

&lt;p&gt;Then there is INSERT, where you talk about how in this case it doesn't check the table first before inserting. This I admit could be problematic in not knowing if perhaps someone on the remote end inserts the same data, resulting in a conflict. I'm not sure how I would deal with this. One of the main features of Federated Databases is that they have to have to maintain the remote data source's autonomy "that the operation of the source is not affected when it is brought into a federation" (see the good article http://www-128.ibm.com/developerworks/db2/library/techarticle/0203haas/0203haas.html). How inserting data that might conflict with data that might be locally being inserted is an important consideration in how inserts should occur. Should the Federated storage engine make sure that there is no conflict? I would say most likely, which then requires more functionality be added to the engine to ensure that data being inserted is not going to conflict, just as the update and delete do currently.&lt;/p&gt;

&lt;p&gt;Another possible improvement that can be added to INSERT (write_row) is using bulk inserts for multiple rows.&lt;/p&gt;

&lt;p&gt;About the weaknesses, it's good to be critical. In being critical, there arises a list of improvements to be made with Federated.&lt;/p&gt;

&lt;p&gt;* The first weakness you mention will hopefully be addressed with the addition of pushdown conditions.&lt;/p&gt;

&lt;p&gt;* The second issue of index optimisations adds the need to perhaps abstract some federated to a higher level than the storage engine itself, perhaps the query processor. I can say that currently there is no easy way to pushdown joins because federated deals with data on a one table at a time basis because a storage engine by design is only for one table at a time.&lt;/p&gt;

&lt;p&gt;* The issue with EXPLAINing a query against a federated table resulting in a remote query - I'm not sure how you would handle this. EXPLAIN gives information on how a query will be executed. How can we know how this will be executed without having to do some sort of query on the remote server since Federated is all about knowing how to deal with remote data? Good question.&lt;/p&gt;

&lt;p&gt;* No "memory" of what data has been fetched. Again, this is something I'm not sure, offhand, how to deal with. Is this something that cursors could be employed to deal with? Or again, would having the Federation functionality be above the storage engine deal with this issue better? (I ask this not to you specifically but as a means of openly thinking aloud)&lt;/p&gt;

&lt;p&gt;Furthermore, the issue of moving huge amounts of data; I think that cursors would be a possible solution to this so that the issue of fetching all rows and running out of memory isn't as much an issue. Some of the poor query optimisation can be improved by using pushdown conditions so that the predicates aren't stripped from the where clause. Send as much information as is needed by the remote server.&lt;/p&gt;

&lt;p&gt;The other issue is auto-discovery of remote tables, which will allow one to create a Federated schema, and the federated tables to be created with the same exact definitions (except of course engine type) as the remote tables.&lt;/p&gt;

&lt;p&gt;About Marketing speak. I haven't read all of it, and will be the first to say that Federated as it is, is a first generation, first release that is intended to get the idea iout there in a simple working model, and generate feedback as your post here is doing.&lt;/p&gt;

&lt;p&gt;I intend to start doing more development on Federated than I had been doing over the past year. With ideas, criticisms, advice such as yours, it helps me to think about what users and developers would like to see out of it and helps me to prioritise what features need to happen sooner than later. I appreciate your article, and feel free to contact me any time with suggestions, patches, any for of help.&lt;/p&gt;

&lt;p&gt;Thanks much! &lt;/p&gt;

&lt;p&gt;--Patrick&lt;/p&gt;</description>
		<content:encoded><![CDATA[<p>Dear Xaprb,</p>
<p>Being the author of Federated, I agree with you on how you describe how Federated works and some of its limitations. There remains a lot of work to get various things to enhance federated.One of the first things you bring up is how Federated retrieves all data. This is certainly true for select count() as well as queries not using indexes. However, if you were to utilise indexes, then you avoid retrieving all rows. You can define indexes on the federated table even if they aren&#8217;t defined remotely. What happens is that the query being &#8216;built&#8217; within Federated utilises this column in a where clause resulting in a query that returns only the rows you want. This is a bit of a hack, I admit, but gives you the ability to avoid getting all rows retrieved. With this all in mind, I&#8217;m currently working on a patch that implements pushdown conditions on non-indexed columns. This will make it so queries with where clauses on non-indexed columns result in specific rows being returned as opposed to the whole table. I&#8217;m not sure about how things like select count(*) are pushed down, but it would be great to be able to construct a query to run remotely that only ran the &#8217;select count(*)&#8217; remotely and returned that simple number!</p>
<p>The next issue you mention is update and delete, in how it has to retrieve the data first before modifying it. This has to do again with how the remote query to be issued by Federated is to be built. In a nutshell, there is a loop that goes through all the fields in the table, appends the field names, and then appends the values. It would be nice if it were to somehow only retrieve the specific row(s) it intends to either delete or update and use that/those row(s) to build it&#8217;s query, and perhaps intelligently build a query that would append the ids into a query using &#8220;.. IN (&#8230;)&#8221; or using a range.</p>
<p>Then there is INSERT, where you talk about how in this case it doesn&#8217;t check the table first before inserting. This I admit could be problematic in not knowing if perhaps someone on the remote end inserts the same data, resulting in a conflict. I&#8217;m not sure how I would deal with this. One of the main features of Federated Databases is that they have to have to maintain the remote data source&#8217;s autonomy &#8220;that the operation of the source is not affected when it is brought into a federation&#8221; (see the good article <a href="http://www-128.ibm.com/developerworks/db2/library/techarticle/0203haas/0203haas.html" rel="nofollow">http://www-128.ibm.com/developerworks/db2/library/techarticle/0203haas/0203haas.html</a>). How inserting data that might conflict with data that might be locally being inserted is an important consideration in how inserts should occur. Should the Federated storage engine make sure that there is no conflict? I would say most likely, which then requires more functionality be added to the engine to ensure that data being inserted is not going to conflict, just as the update and delete do currently.</p>
<p>Another possible improvement that can be added to INSERT (write_row) is using bulk inserts for multiple rows.</p>
<p>About the weaknesses, it&#8217;s good to be critical. In being critical, there arises a list of improvements to be made with Federated.</p>
<p>* The first weakness you mention will hopefully be addressed with the addition of pushdown conditions.</p>
<p>* The second issue of index optimisations adds the need to perhaps abstract some federated to a higher level than the storage engine itself, perhaps the query processor. I can say that currently there is no easy way to pushdown joins because federated deals with data on a one table at a time basis because a storage engine by design is only for one table at a time.</p>
<p>* The issue with EXPLAINing a query against a federated table resulting in a remote query - I&#8217;m not sure how you would handle this. EXPLAIN gives information on how a query will be executed. How can we know how this will be executed without having to do some sort of query on the remote server since Federated is all about knowing how to deal with remote data? Good question.</p>
<p>* No &#8220;memory&#8221; of what data has been fetched. Again, this is something I&#8217;m not sure, offhand, how to deal with. Is this something that cursors could be employed to deal with? Or again, would having the Federation functionality be above the storage engine deal with this issue better? (I ask this not to you specifically but as a means of openly thinking aloud)</p>
<p>Furthermore, the issue of moving huge amounts of data; I think that cursors would be a possible solution to this so that the issue of fetching all rows and running out of memory isn&#8217;t as much an issue. Some of the poor query optimisation can be improved by using pushdown conditions so that the predicates aren&#8217;t stripped from the where clause. Send as much information as is needed by the remote server.</p>
<p>The other issue is auto-discovery of remote tables, which will allow one to create a Federated schema, and the federated tables to be created with the same exact definitions (except of course engine type) as the remote tables.</p>
<p>About Marketing speak. I haven&#8217;t read all of it, and will be the first to say that Federated as it is, is a first generation, first release that is intended to get the idea iout there in a simple working model, and generate feedback as your post here is doing.</p>
<p>I intend to start doing more development on Federated than I had been doing over the past year. With ideas, criticisms, advice such as yours, it helps me to think about what users and developers would like to see out of it and helps me to prioritise what features need to happen sooner than later. I appreciate your article, and feel free to contact me any time with suggestions, patches, any for of help.</p>
<p>Thanks much! </p>
<p>&#8211;Patrick</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Xaprb</title>
		<link>http://www.xaprb.com/blog/2007/01/31/mysqls-federated-storage-engine-part-2/#comment-3729</link>
		<author>Xaprb</author>
		<pubDate>Thu, 01 Feb 2007 12:52:14 +0000</pubDate>
		<guid>http://www.xaprb.com/blog/2007/01/31/mysqls-federated-storage-engine-part-2/#comment-3729</guid>
		<description>&lt;p&gt;Thanks Peter, I agree.  I only have experience with MS SQL Server otherwise, where "federated servers" are really just "distributed views" which can sometimes process queries badly and/or have many limitations depending on the application.  This is not very similar to MySQL's storage engine.  Can anyone else comment?  What about PostgeSQL or Oracle or Firebird?&lt;/p&gt;</description>
		<content:encoded><![CDATA[<p>Thanks Peter, I agree.  I only have experience with MS SQL Server otherwise, where &#8220;federated servers&#8221; are really just &#8220;distributed views&#8221; which can sometimes process queries badly and/or have many limitations depending on the application.  This is not very similar to MySQL&#8217;s storage engine.  Can anyone else comment?  What about PostgeSQL or Oracle or Firebird?</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Peter Zaitsev</title>
		<link>http://www.xaprb.com/blog/2007/01/31/mysqls-federated-storage-engine-part-2/#comment-3723</link>
		<author>Peter Zaitsev</author>
		<pubDate>Thu, 01 Feb 2007 10:26:35 +0000</pubDate>
		<guid>http://www.xaprb.com/blog/2007/01/31/mysqls-federated-storage-engine-part-2/#comment-3723</guid>
		<description>&lt;p&gt;Great summary. &lt;/p&gt;

&lt;p&gt;The problem with Federated is at large extent its name - people can compare it to what over vendors call Federated make appropriate assumptions and be surprised it does not work as it is expected. &lt;/p&gt;

&lt;p&gt;The limitations Federated storage engine has at large extent come from the fact it is simply storage engine and Storage engine does not handle group by or order by  this is why these operations are not passed by to remote server.&lt;/p&gt;

&lt;p&gt;In my opinion Federated Storage Engine is OK for light duty access of remote data or as a a source for conveniently importing data  from remote servers.  I would be however careful having heavy duty applications to relay on it.&lt;/p&gt;</description>
		<content:encoded><![CDATA[<p>Great summary. </p>
<p>The problem with Federated is at large extent its name - people can compare it to what over vendors call Federated make appropriate assumptions and be surprised it does not work as it is expected. </p>
<p>The limitations Federated storage engine has at large extent come from the fact it is simply storage engine and Storage engine does not handle group by or order by  this is why these operations are not passed by to remote server.</p>
<p>In my opinion Federated Storage Engine is OK for light duty access of remote data or as a a source for conveniently importing data  from remote servers.  I would be however careful having heavy duty applications to relay on it.</p>
]]></content:encoded>
	</item>
</channel>
</rss>
