MySQL’s FEDERATED storage engine: Part 2

In part 1 of this series I examined the behavior of MySQL’s FEDERATED storage engine by running queries with the general query log enabled on the remote server. In this article I take a higher-level view. I summarize my findings from the first article, then give my thoughts on the engine’s strengths and weaknesses.

How the engine works

The FEDERATED engine accesses a remote table via a normal mysql client connection. Every constraint of normal connections applies, such as timeouts. The user account used for the connection is subject to normal grants on the remote table, though permissions can be further restricted for the local “copy” of the remote table. At least one connection to the remote server is opened per FEDERATED table. If a table needs to be opened several times at once (e.g. in a self-join), additional connections are opened, and cached for future use.

The storage engine issues SHOW TABLE STATUS against the remote table before (nearly) every query. There is one important exception, which I’ll get to.

The storage engine appears to follow only a few simple rules. It can fetch a single row, a range of rows, or all the rows in the table. If you ask it for anything else, such as a COUNT(*) query, it will execute your query by fetching all the rows over the network connection, then running the COUNT() over them. It always fetches entire rows. For example, if you ask for just one column, it’ll fetch all columns for each row, and discard the ones you didn’t ask for.

The engine reads all data before it writes, so if you ask it to delete or update a row it will fetch it first, then send it back to the remote server with every column in the WHERE clause and a LIMIT of 1 (apparently the FEDERATED engine would rather be safe than sorry). If you ask for many rows to be changed or deleted, it will fetch them all, then issue the queries back one row at a time. TIMESTAMP columns do not get auto-updated in this process.

The most interesting case is INSERT. The engine doesn’t do a read before an INSERT, nor does it issue SHOW TABLE STATUS first. This means INSERT statements have less overhead than other statements, though even INSERT is done a row at a time when there are many rows to do. Another nice thing about INSERT: if the remote table has an AUTO_INCREMENT column, LAST_INSERT_ID() will work normally.

The engine does not support REPLACE, ON DUPLICATE KEY UPDATE, or INSERT IGNORE. These get translated to INSERT.

Contrary to the manual, you don’t need to make the FEDERATED table’s definition match the table definition on the remote server. You can omit columns or indexes, re-order columns, add columns that don’t exist, and change data types. The engine copes well. It fills in missing columns with defaults, casts values when possible, and omits columns that don’t exist on the remote server. There are cases where you’ll get an error, but fewer than I expected. The engine doesn’t seem to honor column names, but only pays attention to the relative position of columns.

However, you mismatch table definitions at your own risk. The FEDERATED engine provides features based on what it knows about the table definition you give it, not the table definition that actually exists on the remote server. For example, if you don’t tell it the table has an AUTO_INCREMENT column, you won’t get any LAST_INSERT_ID() values after an INSERT. And if you don’t tell it which indexes exist, the query optimizer will strip columns out of the WHERE clause, so a query that might use an index on the remote server could be a table scan instead.

Weaknesses

Frankly, there are many weaknesses in the engine’s modus operandi. I’m not trying to be critical, but given the relatively inefficient way of fetching, updating and deleting rows, there are a lot of queries you might not want to execute against a FEDERATED table. For example:

  • If you expect your query to use a covering index because it only accesses certain columns, you’ll be disappointed; all columns will be fetched, whether needed or not.
  • Likewise, other index optimizations are useless. Indexes exist only on the remote server, not the FEDERATED copy of the table, which stores no data. Indexes can’t be used to optimize ORDER BY or GROUP BY queries. These clauses are stripped from the query before sending to the remote server.
  • Any query that doesn’t request rows exactly as they exist in the remote table will be converted to a query that fetches them as they exist, so plan accordingly. For example, a grouped query will fetch all the rows over the network, then do the work, probably using temp tables and filesorts.
  • Don’t try to EXPLAIN a query against a FEDERATED table if you don’t want it to actually be executed on the remote server.
  • There is no “memory” of what data has been fetched, even within one query; it appears that the engine only considers one row “current” at a time. For example, if you’re joining against a FEDERATED table and the join requires re-outputting a row that was fetched earlier, it’ll re-fetch it from the remote server. What if something changed the data on the remote server in the meantime? For that reason, you probably don’t want to do a join you expect to repeatedly output rows from the remote table.

This rules out some uses right away. For example, I started looking at the engine to see if I could move a very large table off a transaction-processing server, for read-only remote access. FEDERATED is not suited for this; it would probably cause remote table scans and run the local server out of memory.

The poor query optimization is troublesome. The query optimizer on the local machine seems to think it knows what’s best for the remote machine, even though it doesn’t have any index statistics (and might not even have the right index definitions!). As a result, it strips predicates from the WHERE clause when it ought to send them through and let the remote server decide what to do with them. You could easily get into trouble with this, especially if there are several indexes on the table and it chooses the wrong one.

The read-before-write, row-at-a-time strategy for DELETE and UPDATE also makes it impractical for these queries on large datasets.

Strengths

The engine’s simplicity makes it relatively easy to understand. It may not perform well on large queries, but you know what you’re dealing with.

That makes the engine best suited for very small, simple queries, in my opinion. And the way INSERT is implemented makes it the most optimal of the types of queries I tried (again, only for small data sets), especially since LAST_INSERT_ID() works correctly.

For example, pretend you have a set of distributed servers working on small parts of a large task, and their results need to be merged back together when done without conflict. Many solutions to this problem involve modulo arithmetic for generating primary keys. This could be a good use of a FEDERATED table: just federate one central table on all the servers, have the processes INSERT into the table, and they’ll get non-conflicting primary key numbers. That’s a trivially easy way to coordinate distributed resource requests.

The way it lets you mis-define tables holds great potential. For example, Giuseppe Maxia has already noted that you can define a FEDERATED table against a view. Views don’t have indexes (yet), but that shouldn’t stop you from telling the engine it does! That way, your WHERE clauses are sent through to the remote server unharmed, where the view can execute GROUP BY queries and the like. Giuseppe even outlines a way to get the remote server to execute arbitrary commands via a FEDERATED table!

What about combinations with replication, triggers and so forth? There must be many more cool hacks waiting to be discovered.

Marketing speak, deflated

The main article I’ve found on the MySQL site, Accessing Distributed Data with the Federated Storage Engine, somehow makes it sound as if FEDERATED is designed for unifying vast amounts of data all over an enterprisey-sounding corporate network. There are no real claims, but opening sentences like “corporations are swimming in more data than ever before. IDC has recently pegged data growth at 48% a year,” followed by a section heading titled “Federated to the Rescue,” gives an impression I don’t think is realistic. This is not a solution for such a problem, in my opinion.

Summary

This was mostly a fun exercise for me. Once I realized this storage engine wasn’t a viable choice for reading lots of data from a large table on another server, I decided to investigate it more and see how it works.

It turned out to be time well spent. I was able to answer many questions I see floating around on the Internet, which the manual doesn’t answer (I hope to contribute to the manual and help solve that problem). I learned a lot about how the engine works. And I hope I helped you, too.

I think this engine has an exciting (though limited) range of uses, and I’m keen to see what happens with it.

A broader context

A reader commented on my last article that the storage engine is designed to enable remote access to many types of remote servers, not just other MySQL servers. Accessing data on any generic “SQL table” outside the currently running MySQL instance is an ambitious aim, and many of the problems I see might be design decisions in support of that goal.

Point taken, but that’s only planned at the moment, not reality.

If this is MySQL’s goal for the FEDERATED engine, it needs to be made clearer. I don’t think an average reader of the manual would come to that conclusion otherwise. In the meantime, this series of articles is about understanding what this storage engine does and what it’s useful for, here and now. Future functionality is fair game for future articles.

Technorati Tags:No Tags

You might also like:

  1. MySQL’s FEDERATED storage engine: Part 1
  2. When to use surrogate keys in InnoDB tables
  3. SQL Server 2000 date and time puzzler
  4. How to do efficient forward-only SQL maintenance jobs
  5. How to exploit MySQL index optimizations

6 Responses to “MySQL’s FEDERATED storage engine: Part 2”


  1. 1 Peter Zaitsev

    Great summary.

    The problem with Federated is at large extent its name - people can compare it to what over vendors call Federated make appropriate assumptions and be surprised it does not work as it is expected.

    The limitations Federated storage engine has at large extent come from the fact it is simply storage engine and Storage engine does not handle group by or order by this is why these operations are not passed by to remote server.

    In my opinion Federated Storage Engine is OK for light duty access of remote data or as a a source for conveniently importing data from remote servers. I would be however careful having heavy duty applications to relay on it.

  2. 2 Xaprb

    Thanks Peter, I agree. I only have experience with MS SQL Server otherwise, where “federated servers” are really just “distributed views” which can sometimes process queries badly and/or have many limitations depending on the application. This is not very similar to MySQL’s storage engine. Can anyone else comment? What about PostgeSQL or Oracle or Firebird?

  3. 3 Patrick Galbraith

    Dear Xaprb,

    Being the author of Federated, I agree with you on how you describe how Federated works and some of its limitations. There remains a lot of work to get various things to enhance federated.One of the first things you bring up is how Federated retrieves all data. This is certainly true for select count() as well as queries not using indexes. However, if you were to utilise indexes, then you avoid retrieving all rows. You can define indexes on the federated table even if they aren’t defined remotely. What happens is that the query being ‘built’ within Federated utilises this column in a where clause resulting in a query that returns only the rows you want. This is a bit of a hack, I admit, but gives you the ability to avoid getting all rows retrieved. With this all in mind, I’m currently working on a patch that implements pushdown conditions on non-indexed columns. This will make it so queries with where clauses on non-indexed columns result in specific rows being returned as opposed to the whole table. I’m not sure about how things like select count(*) are pushed down, but it would be great to be able to construct a query to run remotely that only ran the ’select count(*)’ remotely and returned that simple number!

    The next issue you mention is update and delete, in how it has to retrieve the data first before modifying it. This has to do again with how the remote query to be issued by Federated is to be built. In a nutshell, there is a loop that goes through all the fields in the table, appends the field names, and then appends the values. It would be nice if it were to somehow only retrieve the specific row(s) it intends to either delete or update and use that/those row(s) to build it’s query, and perhaps intelligently build a query that would append the ids into a query using “.. IN (…)” or using a range.

    Then there is INSERT, where you talk about how in this case it doesn’t check the table first before inserting. This I admit could be problematic in not knowing if perhaps someone on the remote end inserts the same data, resulting in a conflict. I’m not sure how I would deal with this. One of the main features of Federated Databases is that they have to have to maintain the remote data source’s autonomy “that the operation of the source is not affected when it is brought into a federation” (see the good article http://www-128.ibm.com/developerworks/db2/library/techarticle/0203haas/0203haas.html). How inserting data that might conflict with data that might be locally being inserted is an important consideration in how inserts should occur. Should the Federated storage engine make sure that there is no conflict? I would say most likely, which then requires more functionality be added to the engine to ensure that data being inserted is not going to conflict, just as the update and delete do currently.

    Another possible improvement that can be added to INSERT (write_row) is using bulk inserts for multiple rows.

    About the weaknesses, it’s good to be critical. In being critical, there arises a list of improvements to be made with Federated.

    * The first weakness you mention will hopefully be addressed with the addition of pushdown conditions.

    * The second issue of index optimisations adds the need to perhaps abstract some federated to a higher level than the storage engine itself, perhaps the query processor. I can say that currently there is no easy way to pushdown joins because federated deals with data on a one table at a time basis because a storage engine by design is only for one table at a time.

    * The issue with EXPLAINing a query against a federated table resulting in a remote query - I’m not sure how you would handle this. EXPLAIN gives information on how a query will be executed. How can we know how this will be executed without having to do some sort of query on the remote server since Federated is all about knowing how to deal with remote data? Good question.

    * No “memory” of what data has been fetched. Again, this is something I’m not sure, offhand, how to deal with. Is this something that cursors could be employed to deal with? Or again, would having the Federation functionality be above the storage engine deal with this issue better? (I ask this not to you specifically but as a means of openly thinking aloud)

    Furthermore, the issue of moving huge amounts of data; I think that cursors would be a possible solution to this so that the issue of fetching all rows and running out of memory isn’t as much an issue. Some of the poor query optimisation can be improved by using pushdown conditions so that the predicates aren’t stripped from the where clause. Send as much information as is needed by the remote server.

    The other issue is auto-discovery of remote tables, which will allow one to create a Federated schema, and the federated tables to be created with the same exact definitions (except of course engine type) as the remote tables.

    About Marketing speak. I haven’t read all of it, and will be the first to say that Federated as it is, is a first generation, first release that is intended to get the idea iout there in a simple working model, and generate feedback as your post here is doing.

    I intend to start doing more development on Federated than I had been doing over the past year. With ideas, criticisms, advice such as yours, it helps me to think about what users and developers would like to see out of it and helps me to prioritise what features need to happen sooner than later. I appreciate your article, and feel free to contact me any time with suggestions, patches, any for of help.

    Thanks much!

    –Patrick

  4. 4 Xaprb

    Hi Patrick, thanks for writing in.

    About defining more indexes so WHERE clauses don’t get stripped: that could help somewhat, but I think it will still choose only one of the possible indexes. It doesn’t have statistics to decide which is best.

    In general I would say push down every condition that is valid on the remote server, without regard as to whether it will be useful; let the remote server decide that.

    I don’t fully understand the requirements for checking the remote server’s data before update/delete; why not just push down the conditions and report back a row count affected? I would say the read-before-write is unnecessary. But I’m not an expert on data federation! (I’ll read the article you referenced, thanks for the link). I think I recall that storage engines are designed to always read before write in MySQL, so maybe this is non-trivial to solve.

    Thanks for your good work!

  5. 5 Hongliu Li

    I want to find a solution to do search on MySQL servers partitioned by user id (we might have tens millions of users). That is, each MySQL Servers have a user_info table.

    For example, if a search for a user name is entered, how do I construct a query that will be executed on all MySQL servers, then join the result into one and return to client application. I think Federation is not a solution, but can you suggest me a practical solution.

  6. 6 Xaprb

    I think the closest thing MySQL offers is NDB cluster. You may also look into partitioned tables (in version 5.1 only), but that is single-server only. I have no experience with either approach myself, but I’m sure some people on #mysql IRC channel do. I hope this helps!

Leave a Reply

Please do not use this blog to get help with problems or bugs in Maatkit or innotop: use the Sourceforge forums, mailing list, or bug trackers. If you're asking for help with MySQL, please use the MySQL mailing list instead. I'm writing a book and my time is extremely limited :-)