Comments on: MySQL’s FEDERATED storage engine: Part 2 http://www.xaprb.com/blog/2007/01/31/mysqls-federated-storage-engine-part-2/ Stay curious! Fri, 10 May 2013 18:25:19 +0000 hourly 1 http://wordpress.org/?v=3.5.1 By: CaptTofu http://www.xaprb.com/blog/2007/01/31/mysqls-federated-storage-engine-part-2/#comment-17469 CaptTofu Mon, 21 Dec 2009 14:38:00 +0000 http://www.xaprb.com/blog/?p=299#comment-17469 I was asked to clarify this, despite this being an older post.

First of all, I would encourage anyone, if you want to use a more updated Federated and you are able to use newer versions of MySQL to use FederatedX which is now part of MariaDB, a branch of MySQL being developed by Monty AB. Please see: http://askmonty.org/wiki/index.php/MariaDB

As to the question you were asking about: What I mean is this:

When you define an index on a regular table such as innodb or myisam, it creates an actual on-disk index– whether or not that index is in an index file (myisam) or is a clustered index in a tablespace file (innodb). For federated, what this means is this:

Federated works by taking your SQL statement your are running against the federated table and constructing query fragments based off your query as well as the table definition. When you define an index in federated, this means that the column or columns you are creating that index for is used in the WHERE clause the federated engine is constructing that will inevitably be run against the remote database. To further clarify, assume the local federated table:

create table t1 (
id int not null auto_increment,
name varchar(32) not null defaul ”,
age int not null default 0,
primary key (id),
) engine=federated connection=”mysql://remote/test/t1″;

and the remote table:

create table t1 (
id int not null auto_increment,
name varchar(32) not null defaul ”,
age int not null default 0,
primary key (id),
key name (name)
) engine=innodb;

So, the federated table has an index on id (primary).

The remote table has indexes on id and name.

What this means is that when you write the query, even if you specify:

select * from t1 where name = ‘Subhadra’;

The query that the federated engine builds will be:

SELECT (`id`, `name`, `age`) FROM `t1` ;

A full table scan will be performed, and if t1 has a million records, you will be waiting a while for that whole result set to be shipped over the network!

However, say that you added an index to name on the federated table. The query would then be:

SELECT (`id`, `name`, `age`) FROM `t1` WHERE `name` = ‘Subhadra`;

Much better. Say then you wanted to specify age:

select * from t1 where name = `Subhadra` and age = 33;

The federated engine will still only use the name index:

SELECT (`id`, `name`, `age`) FROM `t1` WHERE `name` = ‘Subhadra`;

However, if you add an index on age to your federated table, the query that federated builds and runs becomes:

SELECT (`id`, `name`, `age`) FROM `t1` WHERE `name` = ‘Subhadra’ AND `age` = 33;

Even though the remote table t1 does not have an index on age, this “trick” of adding an index on the federated table will reduce the result set size (if there are more than one “Subhadra” records.

You won’t get the real benefit of an index on the remote table, however, this will help to reduce the result set size. The more data the remote table has, the more of an issue this would be!

I hope this helps to explain this!

]]>
By: Xaprb http://www.xaprb.com/blog/2007/01/31/mysqls-federated-storage-engine-part-2/#comment-4899 Xaprb Fri, 09 Mar 2007 22:12:48 +0000 http://www.xaprb.com/blog/?p=299#comment-4899 I think the closest thing MySQL offers is NDB cluster. You may also look into partitioned tables (in version 5.1 only), but that is single-server only. I have no experience with either approach myself, but I’m sure some people on #mysql IRC channel do. I hope this helps!

]]>
By: Hongliu Li http://www.xaprb.com/blog/2007/01/31/mysqls-federated-storage-engine-part-2/#comment-4897 Hongliu Li Fri, 09 Mar 2007 19:28:19 +0000 http://www.xaprb.com/blog/?p=299#comment-4897 I want to find a solution to do search on MySQL servers partitioned by user id (we might have tens millions of users). That is, each MySQL Servers have a user_info table.

For example, if a search for a user name is entered, how do I construct a query that will be executed on all MySQL servers, then join the result into one and return to client application. I think Federation is not a solution, but can you suggest me a practical solution.

]]>
By: Xaprb http://www.xaprb.com/blog/2007/01/31/mysqls-federated-storage-engine-part-2/#comment-3793 Xaprb Fri, 02 Feb 2007 17:10:39 +0000 http://www.xaprb.com/blog/?p=299#comment-3793 Hi Patrick, thanks for writing in.

About defining more indexes so WHERE clauses don’t get stripped: that could help somewhat, but I think it will still choose only one of the possible indexes. It doesn’t have statistics to decide which is best.

In general I would say push down every condition that is valid on the remote server, without regard as to whether it will be useful; let the remote server decide that.

I don’t fully understand the requirements for checking the remote server’s data before update/delete; why not just push down the conditions and report back a row count affected? I would say the read-before-write is unnecessary. But I’m not an expert on data federation! (I’ll read the article you referenced, thanks for the link). I think I recall that storage engines are designed to always read before write in MySQL, so maybe this is non-trivial to solve.

Thanks for your good work!

]]>
By: Patrick Galbraith http://www.xaprb.com/blog/2007/01/31/mysqls-federated-storage-engine-part-2/#comment-3791 Patrick Galbraith Fri, 02 Feb 2007 16:39:21 +0000 http://www.xaprb.com/blog/?p=299#comment-3791 Dear Xaprb,

Being the author of Federated, I agree with you on how you describe how Federated works and some of its limitations. There remains a lot of work to get various things to enhance federated.One of the first things you bring up is how Federated retrieves all data. This is certainly true for select count() as well as queries not using indexes. However, if you were to utilise indexes, then you avoid retrieving all rows. You can define indexes on the federated table even if they aren’t defined remotely. What happens is that the query being ‘built’ within Federated utilises this column in a where clause resulting in a query that returns only the rows you want. This is a bit of a hack, I admit, but gives you the ability to avoid getting all rows retrieved. With this all in mind, I’m currently working on a patch that implements pushdown conditions on non-indexed columns. This will make it so queries with where clauses on non-indexed columns result in specific rows being returned as opposed to the whole table. I’m not sure about how things like select count(*) are pushed down, but it would be great to be able to construct a query to run remotely that only ran the ‘select count(*)’ remotely and returned that simple number!

The next issue you mention is update and delete, in how it has to retrieve the data first before modifying it. This has to do again with how the remote query to be issued by Federated is to be built. In a nutshell, there is a loop that goes through all the fields in the table, appends the field names, and then appends the values. It would be nice if it were to somehow only retrieve the specific row(s) it intends to either delete or update and use that/those row(s) to build it’s query, and perhaps intelligently build a query that would append the ids into a query using “.. IN (…)” or using a range.

Then there is INSERT, where you talk about how in this case it doesn’t check the table first before inserting. This I admit could be problematic in not knowing if perhaps someone on the remote end inserts the same data, resulting in a conflict. I’m not sure how I would deal with this. One of the main features of Federated Databases is that they have to have to maintain the remote data source’s autonomy “that the operation of the source is not affected when it is brought into a federation” (see the good article http://www-128.ibm.com/developerworks/db2/library/techarticle/0203haas/0203haas.html). How inserting data that might conflict with data that might be locally being inserted is an important consideration in how inserts should occur. Should the Federated storage engine make sure that there is no conflict? I would say most likely, which then requires more functionality be added to the engine to ensure that data being inserted is not going to conflict, just as the update and delete do currently.

Another possible improvement that can be added to INSERT (write_row) is using bulk inserts for multiple rows.

About the weaknesses, it’s good to be critical. In being critical, there arises a list of improvements to be made with Federated.

* The first weakness you mention will hopefully be addressed with the addition of pushdown conditions.

* The second issue of index optimisations adds the need to perhaps abstract some federated to a higher level than the storage engine itself, perhaps the query processor. I can say that currently there is no easy way to pushdown joins because federated deals with data on a one table at a time basis because a storage engine by design is only for one table at a time.

* The issue with EXPLAINing a query against a federated table resulting in a remote query – I’m not sure how you would handle this. EXPLAIN gives information on how a query will be executed. How can we know how this will be executed without having to do some sort of query on the remote server since Federated is all about knowing how to deal with remote data? Good question.

* No “memory” of what data has been fetched. Again, this is something I’m not sure, offhand, how to deal with. Is this something that cursors could be employed to deal with? Or again, would having the Federation functionality be above the storage engine deal with this issue better? (I ask this not to you specifically but as a means of openly thinking aloud)

Furthermore, the issue of moving huge amounts of data; I think that cursors would be a possible solution to this so that the issue of fetching all rows and running out of memory isn’t as much an issue. Some of the poor query optimisation can be improved by using pushdown conditions so that the predicates aren’t stripped from the where clause. Send as much information as is needed by the remote server.

The other issue is auto-discovery of remote tables, which will allow one to create a Federated schema, and the federated tables to be created with the same exact definitions (except of course engine type) as the remote tables.

About Marketing speak. I haven’t read all of it, and will be the first to say that Federated as it is, is a first generation, first release that is intended to get the idea iout there in a simple working model, and generate feedback as your post here is doing.

I intend to start doing more development on Federated than I had been doing over the past year. With ideas, criticisms, advice such as yours, it helps me to think about what users and developers would like to see out of it and helps me to prioritise what features need to happen sooner than later. I appreciate your article, and feel free to contact me any time with suggestions, patches, any for of help.

Thanks much!

–Patrick

]]>