Xaprb

Stay curious!

How to eliminate temporary tables in MySQL

with 25 comments

I’ve written before about how to make MySQL replication reliable. One thing I think you need to do to make statement-based replication reliable is eliminate temporary tables. I found an elegant way to replace temporary tables with real tables in the systems I maintain. This article explains how.

The problem

Temporary tables are anathema to reliable MySQL slave servers. If you have a temporary table and the slave crashes in between accesses to the temporary table, when you restart replication the temporary table no longer exists, and you are in trouble.

This is only a problem with statement-based replication, which is how MySQL replication works until version 5.1, which is currently in beta.

If you want to be able to stop and start slave servers at will (for backups, failover, etc) or recover smoothly from crashes, in my opinion you must completely eliminate temporary tables on the master. Note that I’m talking about true temporary tables created with CREATE TEMPORARY TABLE, not temporary tables created internally by MySQL for sorting or processing subqueries.

Properties of temporary tables

Temporary tables have some nice properties:

  • They are private to a connection. No other connection can see them, and there are no naming conflicts when many connections create temporary tables with the same name.
  • They go away by themselves when the connection closes, so you don’t have to clean up after crashed connections.

These are compelling reasons to use temporary tables for many purposes. However, these very properties have drawbacks:

  • Since they’re private to a connection, you can’t debug an application easily if you need to see the data it has generated and is working on.
  • Since they don’t persist after a connection closes, they cause problems with replication, as I mentioned.

Any good solution to the problems should ideally offer some of the benefits of temporary tables. Applications need to be able to avoid naming conflicts, and cleanup needs to be easy. And it would be nice to avoid the problems too — I’d like to see a connection’s data, and I’d like to avoid replication issues. Can all this happen? I think so, if you’re willing to give up a small amount of convenience, and if there are no privacy issues.

One possible solution

One common solution I’ve seen is to use real tables, which the application will drop when it’s finished. I call these “scratch tables” to distinguish them from temporary tables.

The usual advice is to create randomly named scratch tables in a database reserved for the purpose. This is a way to avoid naming conflicts, though you still have to check before you create your table, on the off chance there is a conflict. Cleanup should usually be handled by the application itself doing a DROP TABLE when it’s done, but in case it crashes or something else happens, you can just run a periodic job to delete all tables in the scratch database that are over some age (two hours would usually suffice for my applications).

mysql-find can do the purging easily, so that’s no trouble. But there are some drawbacks to this method:

  • Randomly named tables, and checking for an existing table before creating, is a bit messy and is prone to race conditions.
  • Quick, which connection is using table d41d8cd98f00b204e9800998ecf8427e, and what is it used for? Suppose the application creates a scratch table of users, one of messages, and one of interests — what kind of table is that? I don’t know either.
  • I don’t like guessing when a table isn’t in use anymore. What if an application was running slowly because, for example, it’s talking to an API that’s responding slowly? Dropping its tables would be a bad thing to do.

My solution

I’ve solved all these problems by not naming tables randomly. Instead, I name scratch tables sensible names, just like normal tables, but I then append the connection ID to the table name. The resulting tables are named things like users_1234. This approach has some obvious benefits:

  • It’s easy to see what a table is for.
  • It’s easy to see which connection created the table.
  • The table name is guaranteed to be unique, since no two connections have the same ID (this is not true of UNIX process IDs if you have processes connecting from more than one host).
  • It’s really easy to generate the table name. Finding out your connection ID is trivial. In Perl, you can say $dbh ->{'mysql_thread_id'} on any database connection, and similar functionality exists for other programming languages. If you need to do it from within SQL, you can just use the CONNECTION_ID() function.
  • Finally, it’s really easy to know if a table is still in use.

That last point deserves a bit of explanation. You can run SHOW FULL PROCESSLIST and see a list of all current connections. If the number isn’t there, and you have the PROCESS privilege so you’re certain you can see all connections, the table is deadwood and should be pruned. I’m sure you saw this coming, but mysql-find can do all this for you, too; you just give it a regular expression pattern to capture the connection ID from the table name, and it takes care of the rest (including the privilege check):

mysql-find --pid '\D_(\d+)$' --exec_plus "DROP TABLE IF EXISTS %s"

Conclusion

What’s not to love?

  • Simple.
  • Elegant.
  • Solves world hunger.

Well, I’m still working on that last part…!

Written by Xaprb

May 11th, 2007 at 10:01 pm

Posted in Uncategorized

25 Responses to 'How to eliminate temporary tables in MySQL'

Subscribe to comments with RSS

  1. [...] Xarb: code, -often [...]

  2. [...] able to deal with some race conditions like a table being dropped between the time it’s discovered and the time it’s locked (this is very relevant for me because I avoid temporary tables so replication is restartable). [...]

  3. Topic: How to eliminate temporary tables in MySQL

    Regarding the statement
    ‘The table name is guaranteed to be unique, since no two connections have the same ID (this is not true of UNIX process IDs if you have processes connecting from more than one host)’,

    does it mean that it’s not advisable to use your method on UNIX/LINUX platforms?

    rodney

    31 Oct 07 at 12:48 pm

  4. Rodney,

    Thanks for pointing out this unclear sentence. Actually, what I was trying to say is my technique works no matter what. I’ve seen people name their temporary tables after the UNIX process ID, rather than the connection ID. Because it’s possible for two processes connecting from two separate UNIX machines to have the same UNIX process ID on their respective machines, the process ID isn’t safe to use. The connection ID will always be unique.

    Xaprb

    31 Oct 07 at 12:53 pm

  5. While your table names are guaranteed to be unique while things are running smoothly, you could get into BIG trouble if you experience any InnoDB corruption. Queries to corrupted InnoDB tables currently crash MySQL. The angel process will then restart the server resetting the connection count. The InnoDB table most likely to be corrupted is the one you write to most often so your server will be waffling a lot. Rapid up and down of the server will cause a lot of your connections to have the id of 1 (or 2 or 3…) which will cause major problems.

    I don’t really think that this workaround is that advisable. I think MySQL should fix their replication scheme. You imply they have with 5.1 but I’m not up on the internals. Do you have a link where these changes are detailed?

    Thanks, Rich(ard)

  6. Hi Richard,

    The changes I’m referring to are row-based replication, which you can read about in the manual.

    I partially disagree with some of your points. However, I can see your point of view, and I strongly agree with your view that fixing problems in replication is better than using workarounds.

    Some notes:

    Queries to corrupted InnoDB tables don’t always crash MySQL. It depends on the corruption.

    I’m not sure exactly what you mean by “have the id of 1… major problems.” I think you mean “crash, write to the same table again, crash, ad infinitum.” However, this won’t be a problem with the way I’m advocating using the tables. They are truly meant to be temporary and used only by one connection, so I do something like “drop table if exists foo_1, create table foo_1(a int), … drop table if exists foo1.” Dropping the table before using it should avoid the problems I think you’re worried about.

    Finally, an add-on to the original article. I now add both the connection’s ID and the server’s ID to the table name. This way it’s easy to know whether the table was created on the server I’m looking at (in case I’m looking at a slave). So my tables are now named like this: foo_21_198282. This is also integrated into mk-find (in Maatkit).

    Xaprb

    7 Dec 07 at 5:52 pm

  7. Oh — I meant to say, I’m in agreement with you that InnoDB corruption can cause major pain in general (though thankfully InnoDB is very reliable in general, bless Heikki and team), but I don’t understand why the issue you’re raising isn’t just as much of a concern for non-temporary tables.

    Xaprb

    7 Dec 07 at 5:54 pm

  8. I think you missed the point of what I was saying. When the angel process respawns the server, the connection number resets to zero so if the process keeps respawning, you’ll see a lot of connections with id 1. Yes, the IF EXISTS thing helps but maybe doing foo_$serverid_$timestamp_$connectionid would be even better.

    I’m probably just being overly paranoid here from some of the production disasters I’ve seen with MySQL, particularly involving InnoDB and/or replication. When things go bad, they go real bad…and quickly.

    Have you seen a performance hit using this scheme though? I wouldn’t want to writing to disk for temp tables.

  9. Sure, I understand the paranoia. I’ve seen bad problems too.

    I still feel that I’m missing your point though. Are you saying that my program might connect to the server, get connection ID 1, create table_1, and then there’s a crash? And then that another program would also get connection ID 1 and the two programs would conflict?

    Just to clarify what I’m using these tables for: it’s things like rolling up large amounts of performance data into summary tables, holding short lists of things to work on (like client numbers), generating reports, etc. Many of these would go to disk anyway because they’d exceed the size limit on in-memory tables. So no, I haven’t seen a performance hit; I’m much happier with the data on disk anyway, where it will survive a crash.

    Xaprb

    8 Dec 07 at 5:56 pm

  10. Yeah, that’s what I was saying.

    The situation I had occur frequently on one install (still a mystery to this day) was the server had a forum using InnoDB tables and it had some index table that liked to corrupt. Ppl were always on this forum and thus always accessing the index so whenever it corrupted (which was about once a week), the server rarely got it’s connection count over 2.

    My “solution” was to ditch all the transactional code (only a handful of changes) and change all the tables to MyISAM.

  11. I’m glad I figured out what you meant. It should not be a problem, because the connection dies when the server dies, so there will never be more than one connection with a given ID. Now, there may be more than one program that thinks its magical suffix should be 1, if the code is bad and the program doesn’t quit when the server bounces. But any program that thinks it can keep going when the server bounces is a problem. It should either die, or throw away its work and begin again. Certainly using “real” temporary tables wouldn’t avoid this issue either, because they won’t survive the restart either.

    Xaprb

    9 Dec 07 at 10:39 am

  12. why not just use –Replicate-Wild-Ignore-Table=DBNAME.TEMP_PREFIX%

    and force the scripts to use the TEMP_PREFIX you specify?

    e

    eDawg

    10 Jan 08 at 6:49 pm

  13. That won’t work if there are any multi-table updates or deletes, or an INSERT…SELECT query. If you do that, then the following queries will still cause trouble:

    create temporary table temp_prefix.foo as
       select * from main_db.bar;
    
    insert into main_db.fizzle
       select * from temp_prefix.foo;

    Xaprb

    10 Jan 08 at 7:15 pm

  14. Sorry to circle back to an old post but…

    What happens if you wrap your entire process inside of an explicit transaction, e.g. start transaction; create temporary table; populate it using various methods; further manipulate the data; use the data to update a real table; commit;

    If your slave crashes for some reason or another during the middle of this transaction, will it roll back correctly; and more importantly, when you restart the slave, where will it pick up?

    I’m looking at this from a difficult angle because I want to “manufacture” a result set inside of a stored procedure. The answer I keep getting as a replacement for Oracle’s varray is “use a temp table” and because I’m using replication I’m now looking at a work-around for a a work-around.

    Gregory Haase

    1 May 08 at 4:09 pm

  15. Gregory, I have not tested that to the point of exhaustion, but I don’t think your scenario is safe either. I know at least one other person doing the same workaround for a workaround, for the same reason :-)

    Xaprb

    1 May 08 at 5:21 pm

  16. Good idea to create a random name for new table and drop it after use. But in some case, I still like to use temporary table, especially when using PHP to collect some data from other table, and also after the PHP script done, the temporary table automatically deleted. This temporary table also can replace the use of array.

    I only use the temporary table only for one script of PHP. If I would like to use table data for multiple script, I think what Author ide is very good.

    trn

    16 Feb 09 at 10:25 pm

  17. Very good idea for replacing temporary tables.

    I am using temporary tables in Stored Procedures. The only thing is, by your method, I will have to give DROP privilege to the user. Till now, I have revoked DROP privilege and using DROP TEMPORARY TABLE (which skips the check for access rights) in my Stored Procs.

    Rishi Agarwal

    27 Feb 09 at 3:49 am

  18. Quick aside – Baron I enjoyed your talks @ Mysql 2009!

    Regarding using slave-side ignores on SBR, wouldn’t it be desireable in the case of tmp data used to satisfy queries only (selects) to not have the tmp table data replicated at all (since it serves no purpose on the slave)? I’m interested to avoid the IO on the master and for the slave thread when the result will always be discarded. Particularly since I’m running in the cloud, every bit of IO saved seems to count.

    Currently the only “safe” option I see for this is SET SESSION SQL_LOG_BIN=0/1 during the tmp table operations from the app, which has the unfortunate (but also reasonable) side effect of requiring giving your app user SUPER.

    Any thoughts or master-side solution that you would endorse?

    That said, your solution, variants of which I have also used in the past, seems to be more useful under row-based repl where it appears the master-side ignore* params seem to behave correctly.

    Ryan

    1 Aug 09 at 7:54 am

  19. Hi Ryan,

    I’m glad my talks were helpful at the conference. Whether or not the temporary table should be replicated depends on whether any queries depend on the data in it. If you’re inserting into the temporary table then selecting back out of it to insert into another table, or updating from a join, or any other statement that changes data depending on the data in the temporary table, the temporary table needs to be replicated to the slave, or replication will break on the slave.

    I’ve thought a lot about this over the years and had many interesting conversations with people who are doing different things with their data. I have yet to find a situation where temporary tables can be used safely in replication. Replication has many flaws, and it is not possible to make it reliable even in the best of circumstances, but I find the temporary tables are a pretty severe problem in many common scenarios. So in my mind, they are a top 10 problem and it’s well worth avoiding them entirely.

    Xaprb

    2 Aug 09 at 7:10 pm

  20. I’m curious how you drop the table after you’re done with it. Do you use a user variable for the table name? (i.e. CONCAT(‘table_’, CONNECTION_ID()) – then a prepared statement to create the table?

    At this point I can’t figure out how to drop the table since I didn’t think you could drop a table using a prepared statement?

    neil

    8 Jan 10 at 5:28 pm

  21. This article was written about a system that didn’t use prepared statements. The connection-id substitution was done client-side.

    Xaprb

    8 Jan 10 at 9:16 pm

  22. [...] harsh. There are a few workarounds that can be done from an application level that you can read in http://www.xaprb.com/blog/2007/05/11/how-to-eliminate-temporary-tables-in-mysql/ which I think they could be worth thinking [...]

  23. Are there any new schools of thought on this yet (2012)?

    Has MySql 5.5 changed this at all? I have ordered the 3rd edition of your MySql performance book but have yet to receive it and read any recent thoughts on temporary tables and replication.

    Can I assume that as row-based replication does not actually replicate temporary tables at all, unless I create ‘real’ tables to store this ‘temporary’ data then I will always be reading from the master and not the slave and thus negate the reason for using replication at all?

    In short… would it be true to say that if you are doing a lot of analysis/reporting work that reads its final data from a temporary table then unless you use the ‘real’ table alternative above you are never going to be able to scale such an application using replication?

    Termine

    8 Jun 12 at 7:26 am

  24. Hey Barron…long time no talk. I too would like to know if 5.5-6 (and specifically RBR) affects this discussion. It seems that whether you use a temp table or “real” tables inside a scratch-schema, if you are using row-based replication (RBR) AND either turn of bin-logging or block replication of your scratch-schema, you should be safe. What’s your perspective on this?
    Hope you are well!

    Dewey

    2 Aug 12 at 11:29 pm

  25. Temp tables are only a problem with statement-based replication.

    Xaprb

    3 Aug 12 at 1:10 pm

Leave a Reply