Archive for November, 2006

More GnuCash to MySQL tools and queries

Download gnucash2mysql and queries

I wrote a while ago about a program I wrote to export GnuCash data into a MySQL database, including a couple of queries against the resulting schema. I’ve made some improvements since then to allow a simple overlay of my wife’s categories onto the GnuCash hierarchy. This article explains the improved schema, and includes some more useful tools and queries.

The improved schema

The heart of the schema that holds the GnuCash data remains the same, but I’ve added a table to overlay our categorization system onto it. The fundamental issue is the old multiple-hierarchy problem: some transactions should live in different places in the GnuCash account hierarchy than they belong in our expense summary. To remedy this, I just created a table to define our own categories. The resulting system is very simple, and defines a single level of hierarchy, where everything is at the top level; there are no nested categories (though you can simulate nested categories with a naming convention, such as “Auto:Repair”).

This system is about as simple as possible. Here’s the query to create the new table:

create table account_category (
   account char(31) not null primary key,
   category char(20)
) ENGINE=InnoDB;

It would be more normalized to place this data in the account table itself, but that makes it too easy to wipe out your category data when you recreate the schema. For that reason I decided to separate the account and its category.

Category setup

There’s a required initial setup before this is usable, of course: you have to define the categories and associate accounts with them. To ease this step, you should go through your GnuCash file and check the “placeholder” properties checkbox for any accounts you know only exist to hold other accounts. This way you won’t have to worry about assigning them to categories. If you do this, you should run gnucash2mysql to re-create the schema and re-import the data.

Next, run the included setup-categories.pl script. It will ask you to enter a category name for each unassigned account. It prints the account hierarchy so you can easily tell what the account is. If you want to leave an account un-categorized, just press Enter and it’ll be skipped.

The categorization requires that you understand the double-entry accounting principles on which GnuCash is built. The updated queries I’ve included in the download will include transaction splits that transfer money both into and out of a GnuCash account, so think about that as you assign the categories. For example, our GnuCash account “Bills:Groceries” is associated with “Groceries.” When there is a transaction involving “Bills:Groceries,” the money has to come from somewhere — our checking account, for example. We don’t want to include the checking account in any categories, or that money will get counted twice — once as it leaves the checking account, once as it enters the “Bills::Groceries” expense account. This is the fundamental simplification I made to create a non-double-entry report of what we spend. On the other hand, maybe you want to do it differently. Suit yourself :-)

When you’re done

When you’re done associating categories with accounts, you will find new queries in the package for your enjoyment.

Miscellaneous improvements

In the time since I first wrote this program I’ve realized the benefits of using Perl libraries that are already installed on most systems. I’ve gotten rid of the dependencies on non-standard libraries in the new versions. That change should make the scripts easier to install and use. The scripts also read your .my.cnf file if it exists, so you don’t have to specify MySQL options.

At the end of the day, these little scripts and queries are about having more than one way to do things. I hope you find them useful.

Technorati Tags:No Tags

You might also like:

  1. How to flatten hierarchies with awk
  2. GnuCash to MySQL export script
  3. How to find data distributions with SQL
  4. How to use foreign key cascades in MySQL
  5. Advanced HTML table features, Part 2

How to write SQL JOIN clauses more compactly

Here’s a way to write join clauses so they are more compact, more readable, and less confusing.

Standard SQL:2003 defines a USING clause that can serve the same function as the ON clause in the familiar JOIN syntax. For example, the following join

select a.col1, b.col2
from a
   inner join b on a.col3 = b.col3

May be written as follows:

select a.col1, b.col2
from a
   inner join b using(col3)

That may not look like much of an improvement, but it is a big help in larger joins where many tables have columns with the same names. In these cases, not only is it tedious to write out every pair of columns that must match in the join, you often have to refer to the tables with aliases too. And it’s tough to read such a join and understand it, or debug it. For example, what’s wrong with this join?

select tbl1.col1, tbl2.col2, tbl3.col2, tbl4.col1
from apples as tbl1
   inner join oranges as tbl2 on tbl1.col3 = tbl2.col3
   inner join grapes as tbl3 on tbl3.col3 = tbl2.col3
   inner join peaches as tbl4 on tbl3.col3 = tbl2.col3

The statement is valid and will execute, but it won’t give the results you probably wanted (tbl4’s join clause doesn’t refer to any columns from tbl4). The bug is even harder to find if the statement isn’t neatly indented and consistently organized. That statement is better written with USING clauses:

select tbl1.col1, tbl2.col2, tbl3.col2, tbl4.col1
from apples as tbl1
   inner join oranges as tbl2 using(col3)
   inner join grapes as tbl3 using(col3)
   inner join peaches as tbl4 using(col3)

USING matches the specified columns from each table, eliminating the need to write them out twice explicitly with aliases. In MySQL 5, you can see how the statement gets rewritten by the optimizer with EXPLAIN EXTENDED followed by SHOW WARNINGS. The result shows that it gets rewritten as an old-style join with the column-matching done in the WHERE clause.

USING isn’t a drop-in replacement for the ON clause in normal join syntax. There are some differences, especially on different platforms (and in MySQL’s case, even differences between versions of the product — see MySQL JOIN Syntax). This probably makes relational purists hot under the collar. Of course, everything about SQL makes relational purists mad, because SQL isn’t relational, and database products are even less relational than the SQL standard… oh well.

In my opinion, use it if it makes your life easier. Programming is hard — use whatever your tools give you to ease the burden.

One more note: there’s also a NATURAL JOIN syntax that automatically discovers columns with the same names in both tables and matches them in the join. I don’t use this in programs, because if someone adds more columns to one of the tables involved in such a join, the join criteria will silently change. I think join criteria should always be explicit, for the same reason I avoid blind inserts. However, this syntax can be convenient for writing one-off queries at the command line.

Technorati Tags:No Tags

You might also like:

  1. How to simulate optional parameters in SQL
  2. Why large IN clauses are problematic
  3. How to write a SQL exclusion join
  4. My personal SQL coding standards
  5. How to simulate FULL OUTER JOIN in MySQL

Debian/Ubuntu package now available for innotop 0.1.160

A kind soul has contributed a Debian/Ubuntu package for the innotop MySQL and InnoDB monitor. Thanks Sebastien Estienne!

You can find the package in the source directory. Sebastien wrote me that he tested it on Ubuntu Dapper Drake, and I just verified that it works on Edgy Eft.

Since I use Ubuntu on my own machine now, I may integrate this into my build process for future releases of innotop. But for right now, at least there’s another installation option.

Technorati Tags:No Tags

You might also like:

  1. MySQL Toolkit released as one package
  2. I need your advice on how to package MySQL Toolkit as one file
  3. innotop version 1.0 released
  4. Innotop version 1.152 released
  5. How to install innotop on Microsoft Windows

To Gentoo or not to Gentoo?

Some people who know I’ve used Gentoo asked me my thoughts on using it for MySQL servers. Here are my opinions and experiences while using Gentoo, both for desktop systems and for servers.

This is long, but hopefully the brain dump is useful to you. I have so many thoughts on this subject that I’m having a hard time really organizing them well, and I’m censoring myself too because I know I represent only a very small, focused viewpoint on Gentoo.

This article is highly subjective, so please take what I say with a spoonful of salt. Your mileage may vary.

Finally, before I begin: I have a lot of reasons for moving away from Gentoo myself, and this article is mostly about why I don’t think Gentoo is a good idea for what I use computers for, but I don’t mean to criticize anyone or say it’s no good in general (in fact I know of some uses where it works great, and lots of places use it). Nothing here is personal. This is simply the train of thought I went through when deciding not to use a source-based, continuous upgrade distribution for my own systems.

The 50,000 foot view

I recently switched away from Gentoo, after using it since 2001 or 2002. I used Red Hat, Slackware and other distributions before that, and my main gripe with them was the unavailability of reasonably recent software; I felt they became stale and forced you to upgrade your entire system to get something other than an ancient version of [insert package name here]. There were also endless dependency conflicts. By contrast, Gentoo let me build a recent version of whatever I wanted, and I felt a lot happier with it.

In recent years though, I’ve become more aware of the pain of a constant, incremental upgrade process, which is what Gentoo gave me. Even though in theory it’s not necessary to constantly keep updated with the most recent version of all installed packages, in practice I find it is. I found if I didn’t keep abreast of changes, I couldn’t catch back up again without a lot of pain.

I also started to question some of the things people assert about Gentoo, such “it is small, light and focused on speed.” I find packages compiled under Gentoo, with the precisely right CFLAGS customized just for my processor and hardware, and with just the right USE flags customized to only install what’s needed, are often much less performant than software compiled for generic i386 architecture with no optimizations. Finally, some of the software I’ve installed under Gentoo seems to be broken.

The following are more details of my experiences.

Upgrading is painful

Upgrading can be an enormous pain:

  1. First you sync,
  2. Then you run emerge -avuD --newuse world and look at what’s going to be upgraded,
  3. Then you cancel and tweak some USE flags,
  4. Then you pull your hair out as you still can’t understand what package is trying to install 301MB of dependencies you don’t want,
  5. Then you use equery and friends to try to find out,
  6. Then you pull your hair out because these tools say there is no such package, or can’t find the files for it, or find the package but say nothing at all,
  7. Finally you get on the forums and figure it out, and get the flags/whatever right. It turns out there’s a new default USE flag for some package that doesn’t show up in the upgrade list, and it wants to install a bunch of “dependencies” it doesn’t need. For example, there are lots of things that want to install multiple versions of QT, including QT itself. If you let it install version 3 for just one program, it will want to simultaneously upgrade itself to 4 and keep the version 3, because that program needs it,
  8. Then you let it upgrade, which still has to download 194MB (hey! you saved 105MB, be happy!) over your DSL connection and takes 4 hours to compile on a fast 64-bit machine loaded with RAM — why? Why, because of course GCC has to be upgraded from 4.1.1-pre5_r16a to 4.1.1-pre5_r16b, which is probably three lines of source that changed. But as I say later, you don’t dare not upgrade GCC.

Now your system is upgraded, right? Wrong, everything is broken because it isn’t linked against the new glibc/QT/whatever. Time to revdep-rebuild — oh goody, it wants to rebuild your entire system. Another 7 hours of compiling. Cozy up to the back of your computer’s case, it’s cold in the basement and that thing is warm. Who needs a fireplace? Aaahhhhh. Watch a movie.

You won’t have to do this again for another three days, don’t worry.

I’m exaggerating for effect. It isn’t like this every three days. There is something to upgrade every three days, or even every day, but it’s usually small and incremental. However, if you let it slip, it becomes a nightmare, and even when you keep up with it, occasionally you still get this kind of mess. I am absolutely not exaggerating about the size of these. Recent memory contains painful un-repressed experiences with xorg and GCC, which took weeks to solve on certain machines (how much pain you experience totally varies from system to system, depending on what you’ve installed and how it’s configured). And even when this doesn’t happen, it is a royal pain to do this every few days. A constant upgrade process is just way too much work.

Upgrading breaks things

Upgrading can badly break things too frequently for my taste. For example, these are things that used to break on my desktop machine all the time:

  1. My atheros network card. Every time I upgraded anything to do with networking (such as my kernel, or the drivers), networking broke. I know it has to be this way, but when upgrading a kernel, it’s tough to remember everything else that has to be rebuilt against it. I usually figured it out by rebooting and noticing that the network wouldn’t come up, at which point I didn’t have a working network to fetch the latest madwifi drivers for building (oops, I should have been upgrading constantly; I’d have the .tar.gz files cached if I did).
  2. Samba. Any idea how many things in a typical Samba-enabled desktop system are linked against Samba? Me neither, I’m trying to forget. Upgrading a kernel means re-compiling all that stuff (if there’s a way to avoid it, I never found it, and that’s what revdep-rebuild told me to do — and the only thing that worked for me).

There is no way around this, as far as I know.

Even in the best case, it’s too slow

Downloading and building everything from source may promise “infinite flexibility” (more on this later), but it’s too slow. It doesn’t seem that way at first, but after a while, it gets tedious. Compiling everything is just a waste of time, for most people and most purposes. Why should everyone have to start from source code with every package? If we all compiled all our own software, we’d waste billions of hours of processor time every year. There’s something to be said for doing things once and then enjoying the benefits, and compilation is a good example, in my opinion.

Downloading the source is also usually a larger download than getting pre-compiled packages. It’s a waste of bandwidth.

Gentoo is not optimized for my hardware!

As I said above, the whole “Gentoo is optimized” train of thought doesn’t ring true for me. Here are some experiences I’ve had where the Gentoo “optimized” build ran very badly on my machine:

  1. Firefox. When I used an official binary package, Firefox ran like a dream. But if I did this, I couldn’t link against it for building the desktop environment, so then that desktop environment wanted to build the Mozilla suite. So one way or another, I ended up either compiling Firefox from source, or Mozilla from source. And Firefox from source ran like a dog. Even scrolling in a text field was slow. I know, I know — I read all the forum entries about locales and blah blah, and it helped, but it still ran badly. One note about desktop environments — I like to use Fluxbox or XFCE, but there are certain things that have at least some dependency on gnome-base. I don’t mind having that installed, except that it in turn depends on things… and that makes for a lot more things to upgrade.
  2. Any DVD playing software.
  3. OpenOffice.org
  4. MySQL server (I think — preliminary results discussed below).

These are the places where I most notice the speed difference. It beats me why packages I compile from source are slower. I bet I’ve spent a hundred hours trying to learn enough to fix these and other speed problems, but I never could.

The only thing I think might have been more responsive on Gentoo, before I switched to a binary distribution on one machine, is GTK+. On this machine now I notice a small lag in bringing up dialog boxes and so forth. (QT has always been lightning quick for me, no matter what).

Gentoo doesn’t have the latest and greatest

While Gentoo usually has more up-to-date versions of packages in its source tree than Red Hat sometimes used to in 5.1 (ah, the good old days), it doesn’t always have updated versions. I can’t think of a good example right now, except for how long it took to get XFCE 4.2 into the tree, but I know I’ve seen it — common desktop apps that are a year and a half out of date or something like that. Sorry I’m getting a bit hand-wavy here, but I’m sure if you browse Portage you’ll find something really stale, which binary distributions provide much more up-to-date. In general, I don’t think Gentoo is any better or worse than a binary distro.

Gentoo builds are sometimes really broken

Other things I’ve written about are painful, but this is worse. Slow is bad, incorrect is worse. I can’t tell you all the programs I’ve seen behave bizarrely or crash after being built from source on Gentoo (and I stress I don’t do anything exotic with USE flags or compiler optimizations). Most programs work fine, it’s true, but some don’t. Here’s a partial list:

  1. Kexi was completely and bizarrely broken on my Gentoo box. Even though it was supposedly version 1.6.something, dialog boxes popped up saying version 0.8, large sections of functionality simply didn’t exist or didn’t work, and everything crashed constantly. I installed a lower version number on a binary distribution and everything worked like a dream. This was really strange; it was as though the Gentoo build was mixing together source from different trees!
  2. MySQL Administrator and MySQL Query Browser were similarly busted beyond all recognition. Yet when I downloaded the official builds from MySQL and ran them, everything worked flawlessly. I tried this multiple times on many different machines. Why can’t the Gentoo version — built from the same source, I assume — be made to work? Granted, it was masked in portage, but why should it be if the official builds work great?
  3. Perl’s MySQL support broke, then unbroke, then broke… I don’t know whether it was MySQL, Perl, or the DBI or DBD libraries (or something else). I couldn’t connect with AutoCommit disabled; it would throw an error saying “transactions are not supported.” I found a number of places where other people complained of the same problem, but never found a solution, and recompiling endlessly never fixed it.
    • Update: this is not Gentoo’s fault, it’s MySQL’s fault (bug 21829). The reason I’ve never seen it elsewhere is because other distributions choose a non-broken MySQL version. In this case I shot myself in the foot by choosing a broken MySQL.
  4. GRUB mysteriously broke one day after an upgrade. I could never get it to work again. It would start and then have an error, which I forget now, and drop into a command-line. I had to manually type the boot commands at the GRUB command-line every time I booted. I probably spent five or six hours trying to solve this too. Again I saw many other people had the same problem, and none of the tutorials or FAQs fixed it, even when they described the problem exactly.
  5. The madwifi libraries were flakey at best. Half the time when I booted, the Atheros module wouldn’t find the card because the device didn’t exist. I suspected something was wrong with my hardware, but then I switched distributions. Nary a problem since then.

Experienced developers will likely see this list and see I’m confused about the source of some problem, or “oh, you just need to do X” or something. Why does it take an experienced Gentoo developer to properly configure a Gentoo machine? Have you ever noticed that’s who gives the answers to the sticky problems on the forums? Users shouldn’t have to be C programmers to know how to configure their system. Unfortunately, when you run into these issues you do, as far as I can tell — and that’s a risk.

Infinite flexibility and minimality are myths

Gentoo isn’t infinitely flexible, no matter how many USE flags there are. Here are two examples where you can’t get just some part of the required functionality.

The first is MySQL. You can’t just get the mysql client, or C libraries, which means you have to download and build the entire server if anything on your system needs MySQL support. For example, suppose you’re setting up a web server, which will run Perl web apps that talk to other servers running MySQL. You can’t just build Perl with MySQL support. You build dev-perl/DBD-mysql, and it depends on dev-db/mysql. Look at the ebuild:

DEPEND="dev-lang/perl
	dev-perl/DBI
	dev-db/mysql"

If you want anything to do with MySQL, you just installed a MySQL server, even if you wanted your web servers to be “minimal.” Other distributions let you get just the client programs or C libraries.

The next example is Samba. It’s much the same — you can’t just get Samba client libraries and programs, you install a Samba server too.

Even USE flags don’t give you full flexibility. For example, mytop will use ANSI color codes and hi-res time if there’s support for them, but it doesn’t really need them. However, they are listed as dependencies in Portage, so even though they’re optional for mytop, they’re not optional in Gentoo. In theory you could add USE flags for that, but in practice, you don’t get the choice. In theory, Gentoo is infinitely flexible, but in practice it’s not. This should not be a surprise.

I’m sure there are other examples, but I don’t want to get into it too much. I don’t mean to put down the developers’ hard work, but in these cases this just doesn’t do what I want. I want a system that can talk to MySQL or Samba servers, without being one itself, and without having to compile all that extra code.

It’s reinventing the wheel

As a distribution, Gentoo has been and continues to be great. I can’t thank the Gentoo people enough. But I personally don’t want to run any system that I can’t just install and run, with a modicum of knowledge and reasonable installation/configuration time. I’m not a “protect me from complexity because it scares me” person. I’m a “don’t make me do all this work, my time is scarce” person. I do not want to solve the same problems other people are solving, over and over. I want a distribution that solves the problems and gives me the result. If you don’t think Gentoo users reinvent the wheel thousands of times a day, go read the forums!

Package staleness vs. flexibility vs. stable release cycles

Gentoo is all about choice and flexibility. While that is true, as I said infinite flexibility isn’t a reality. What you get instead is the ability to choose exactly which versions of packages you want (theoretically).

For example, if I use a binary distribution with regular “snapshot” releases and only security upgrades, I might be limited to versions 1, 2 and 3 of certain packages. Maybe I want those packages to be versions 2, 2 and 3, but when I upgrade to a newer release of my binary distribution, I can only get versions 2, 3 and 3. Gentoo will likely let me install exactly the versions I want.

That’s the upside of being able to mix and match versions of software. The downside is “infinite flexibility” really means much greater system complexity, when you consider the number of combinations of packages, versions, and interdependencies between them.

Binary or “snapshot” releases are much simpler and the packages can be much better tested with each other. Simple math shows this: if you have N packages in a release snapshot, you have a graph with potentially (N-1)^2 edges. If you have a distribution that lets you install many versions of a given package, say 5 on average (I think this is about right for Gentoo), you have (5(N-1))^2 edges in the graph — 25 times as many. In real numbers, Gentoo currently has 11453 packages in the Portage tree. A hypothetical snapshot release would have about 131 million package inter-relationships, while Gentoo with an average of 5 versions per package would have about 3.27 billion inter-relationships.

In practice, most of these packages aren’t installed and don’t actually interact with each other, so the number of relationships to test and integrate is much smaller, but you can see how the combinatorial explosion makes it much less likely that any given set of packages has really been installed widely and tested together widely.

Combine this with the fact that most binary distro users are using basically the same system, starting from the same installation CD, and you get an idea how much less variability there is in, say, people running the stable version of Ubuntu. A person who installs Gentoo, doesn’t tweak USE flags or other settings at all, syncs her Portage tree, and runs emerge -uD world probably ends up with an entirely unique system, different from any other system anywhere in the world. Just like a snowflake. That’s bad, not good.

Gentoo is like a continuously variable function, whereas a snapshot-based distribution is like a step function.

There’s something else I haven’t added into the mix: USE flags and compiler and CPU settings. I doubt your web browser has enough zeroes to express the combinatorial explosion that causes.

For this reason, I feel far less comfortable with building everything from source and continuously upgrading, than accepting something that’s a few months old but better tested.

This isn’t just theory. Many packages won’t even build unless you get everything just right. For example, a lot of “package won’t compile” forum entries are solved with answers like “you can’t build that with GCC 4.1, you have to upgrade to 4.1.1.” Who knew?

While we’re on the subject…

Certain upgrades can be either difficult or impossible, depending on how much software you have installed. When the most recent “large” GCC upgrade appeared, I tried to put off upgrading my home desktop machine to it because I saw forum threads from people having huge problems. I thought maybe these would get ironed out over time. But then I needed to upgrade or install something myself, and it wouldn’t build with my current GCC version. The only option was to upgrade GCC, I think to 4.1.1.

“Nightmare” doesn’t really capture the experience. Everything broke, and I mean everything. It would have been far easier to just re-install the system from scratch — except that when you install your system, you get whatever GCC is on the CD, so you still have to complete the upgrade. When you go through one of these upgrades, you have to re-compile your entire system, all at once in the proper order. If it breaks partway through — which it will unless you’ve got a really minimal system (e.g. not a desktop machine) — you have to restart, because the dependencies and order of compilation really matters. And there’s no facility for only re-compiling the things that aren’t already re-compiled. After about three or four cycles of emerge --emptytree world, watch it break after eight or ten hours, read the forums and try something, restart only to have the same package break again, it was just too much for me. I switched to Ubuntu.

I’ve had a good bit of experience with other people’s installations too. I know of several servers at various sites that have bumped up against that GCC upgrade cliff. I have only seen it solved by rebuilding the machine from scratch, and upgrading the system before adding non-base-system packages in. I have seen servers remain in the queue a long time for that, as they can’t always be taken down and rebuilt at will. If you try to put it off, every time you try to change anything you’ll likely find its Portage tree is out of sync and refers to packages that don’t exist anymore, so you have to re-sync Portage, and then tons of stuff wants to be upgraded as a dependency. Sometimes I’ve seen machines get so stale they want to upgrade hundreds of packages, but that can’t happen or everything would bust. This is an ugly reality of life with Gentoo, as far as I’m concerned.

If you’re running a desktop machine on Gentoo, there are other painful upgrades too. My main home desktop machine somehow had no problems upgrading xorg to the modular distribution, but on another machine, I basically had to nuke xorg and install it from scratch — and then the modular builds failed, because they were really buggy and didn’t properly list all dependencies, etc (and when there are over 300 modular xorg packages alone to build, this is pretty ugly too). So everything failed and I didn’t have a graphical desktop. I’m okay working in that environment — I spend most of my day at a terminal anyway — but it’s nice to have a web browser. I could not resolve this, and spending days reading forums was simply not an option. I may have had that luxury in the past when I wasn’t really busy on other things (ha, ha), but not anymore. I ended up upgrading that machine to Ubuntu as well.

Configuration management is tough

It’s really tough to get all your machines uniform. There are ways around it, but you still have to expend that effort, and it can be significant. I know of people who set up a build machine that will host binaries for other machines, such as white-box workhorses. This machine is a proxy for the others, and they sync up with it. You can also use g-cpan to manage Perl code in a separate Portage overlay, so you’ll have uniform Perl module versions too. But this is a fair amount of work no matter what, at least with the tools I know about.

That might only be feasible with cheap white-box workhorses. What if you need this same process for really expensive machines? Can you spare a really expensive machine big-iron server just to act as a build host? Even if you do, do all your machines have the same hardware? That’s kinda doubtful. Virtual machines are only a partial solution — a virtual machine isn’t the same as the server it runs on, and installing a virtual machine by definition makes the server different from the others in its “class.”

You might try documenting your install process. I guess you could get machines reasonably close to identical, as long as you have a single Portage tree you’re syncing against. (Whatever you do, don’t sync against the main Gentoo mirrors, or you’re finished before you start. It changes constantly.) I’ve personally had mixed success with this type of process. I don’t think it works as well as it might. There are still too many places to mess up, too many things that introduce variability between machines.

There are ways to deal with all this, but they require extra work, and are basically imposing a “snapshot release cycle” on top of Portage’s stream-of-consciousness continuous change, which is a wheel that’s been invented in almost every other distro out there. I think Gentoo is enough work already.

When you have a problem, where is the problem?

When you’ve compiled everything from source, with your own USE flags, linked against your own libraries etc, you have a version of the software that’s different from anyone else’s anywhere else in the world. And when you have problems with that, you don’t know what the problem is. Is it a bug, or is it just your system?

You can’t always know. Based on my other experiences with things being broken, I am not prepared to say any Gentoo build isn’t subtly buggered.

For example, I’ve seen a business have trouble with MySQL replication. Is it MySQL, is it the way they’ve been doing replication, is it the setup and configuration, or is it that they have a broken MySQL build? I don’t know. They are not executing the same bits as anyone else. I don’t even feel comfortable submitting certain of their bugs as bug reports to MySQL because of this.

I suggest using the official MySQL builds

One site using Gentoo recently decided to download MySQL’s official build and untar it instead of building it from Gentoo’s source tree. Preliminary results indicate this was a wise decision. For one thing, the official build seems to be getting dramatically higher throughput. There were queries that ran in 90+ minutes on Gentoo’s build, and after switching to the official build, ran in 20 minutes (granted, they also went from 5.0.21-log to 5.0.26-standard, but that isn’t the cause of the increased performance!). This is not the only indicator of better performance, as the disk activity increased significantly (by about 2x) on the official build too. That might indicate the CPU is being used much more efficiently. In fact, the official build seems to saturate the I/O, so the disk is now the bottleneck, which is as it should be in that site’s workload.

Regardless of whether you run Gentoo or a binary distribution, I generally advocate running the same bits as other MySQL users. I’ve chatted with dozens of people on IRC who are having trouble with broken builds from third parties, not just Gentoo users. The official builds are tested by thousands of people already, so the bits are much more likely to be good. And if there’s a problem, it makes it dramatically easier to troubleshoot and/or give good bug reports.

This has its downsides too. Getting other things in Portage to link against MySQL libraries, etc etc, without installing the MySQL package from Portage, can be non-trivial. You probably don’t want to install both, or you’ll never know what your programs are linked against.

Conclusion

That’s a heck of a brain dump, but hopefully you can sort through it and glean something from it. I would summarize by saying any source-based, continuous-upgrade distribution is less likely to be well-tested and widely verified, the complexity is much higher, and it is probably much slower and more difficult to install, maintain and upgrade, and keep everything at the same configuration.

If you’re using Gentoo for a MySQL server, I would at a minimum install the official builds, not build from source. I don’t want to say what distribution to use for a MySQL server, but my personal experience has made me want to stay away from Gentoo.

But that’s just my opinion :-)

Technorati Tags:No Tags

You might also like:

  1. Why I (still) like Gentoo
  2. How to update a GCC profile on Gentoo
  3. Installing innotop on FreeBSD and Gentoo
  4. I need your advice on how to package MySQL Toolkit as one file
  5. Debian/Ubuntu package now available for innotop 0.1.160

MySQL Camp 2006

I recently attended the MySQL Camp 2006 un-conference at Google’s headquarters in Mountain View, California. This article is a high-level overview of the event. If you didn’t go, you really missed something good. Go to the next one!

The event

The event brought together the MySQL community in a distinctly un-conference setting, to gather in groups to discuss or hack on something, rather than sitting in chairs with donuts and coffee listening to someone show Powerpoint slides branded with corporate logos. There were no vendor booths, very few projected presentations, and no boredom — at least not for me.

The conference was organized via a wiki, and there was no set schedule. We were editing the list of sessions constantly, even during the event. Likewise, attendance was free and “registration” only meant writing your name down so Google could issue you a visitor pass. Google provided a bunch of rooms with tables and comfy chairs, catered meals, and helpful and friendly organizing staff (seriously, they went above and beyond).

As of a few days before the event, there were fewer than a hundred attendees registered, but in the last few days everyone and their cousin signed up. At last count there were 225 participants from all over the world. This included people from all walks of life. There were government employees, folks from startups, Google employees, interested hackers, both of MySQL’s founders, MySQL developers and the CEO, and lots more. Many of the names were well-known to wider communities. These people founded and architected companies who are household names today: Friendster, YouTube, Flickr, and so on.

Despite that, the atmosphere was generally mutual respect and fun. There were no wannabe rock star blogger personalities parading around. I was happy being able to talk with the MySQL founders and developers and be treated as an equal. For example, at one point Monty Widenius invited me to hop in his car and explain an idea on the way to supper with the group. While I talked on the drive, I felt completely listened to and respected, and learned a lot. I’ve never met a group of more down-to-earth, unassuming programmers. I find the smartest programmers are sometimes impatient and egotistical, but not in this group. It’s proof that people go far by being really nice.

The sessions

Sessions were impromptu, often written on the wiki but sometimes just posted on a whiteboard. I didn’t make it to every session, but here are some of the highlights of those I did attend (you can look around on other blogs for notes on the sessions: try Sheeri Kritzer’s blog, Ronald Bradford’s blog, and James Briggs’ blog).

Friday started off with everyone in one big room introducing themselves and getting a sense of what the group was interested in. I had some prior commitments, so I didn’t make it to sessions much of Friday, but even the morning session was great. I met a ton of people with whom I’d emailed or chatted with on IRC. I also met authors of some of my favorite books, people running projects and companies about which I’ve read with great interest (Greenplum, for example), and lots of people from the wider community. Also present was Oracle’s Ken Jacobs, who is responsible for InnoDB (Oracle acquired InnoDB last year).

In the afternoon I attended two sessions by Google employees. In case you didn’t know, Google runs “very large” clusters of “many” MySQL servers behind such applications as Adwords. Whenever anyone asks how many, they just pause and say “many.” They’ve developed tools to help manage servers, such as killing queries simultaneously on many servers, dumping and reloading data at high speed, and so forth. Some of these tools are available to the public. They also explained how they scale to mind-boggling size (but not what for — that is a secret). And finally, they don’t run a stock MySQL server; they have their own internal patches to both the server and InnoDB (they use InnoDB exclusively). Mark Callaghan explained some of these. The MySQL developers and Mr. Jacobs listened intently.

On Saturday I gave a demo of innotop and asked people to discuss monitoring strategies in general. I listened for ideas, as we haven’t really scaled our operation to where we need many tools to help us monitor a bunch of servers. I heard a lot of experience with tools like cacti and Groundwork. I also heard a lot of nice feedback on innotop. Some people told me things like it “pulled them out of the fire” several times. One of MySQL’s Professional Services consultants told me he considers it “indispensable,” and has recommended it to all his clients, as well as sending it around among the rest of the Professional Services team. The people in the session also gave me good ideas on where to go with future features, which was probably the best part of all. I got a better sense of how people use it and what they want to do with it than I could ever get from this blog, and I’ll be acting on that feedback. I also learned that it is time to write a book or other significant documentation for innotop, as I got requests for several features that are already included, but I had to think hard to remember how to get to them!

I attended half of Sheeri Kritzer’s session on boolean optimizations, and then sat in on half of the session on GIS extensions in the MySQL server. I found this particularly interesting, as my wife’s archaeology projects often store spatial data, and this could be a more powerful way to represent and manipulate that data. Several other attendees have used or are using the spatial extensions in real applications, such as MotionBased. Another attendee mentioned his plans to scale a database of astronomical objects to astronomical size: 100 petabytes. Nobody in the room knew if MySQL could scale to that size, but some were certain Oracle and Postgres couldn’t. The GIS extensions are in a funny place right now, as they were implemented in 2001 (as Monty recalled) but have stagnated since then because no one gave any GIS feedback to MySQL. Here again the MySQL developers asked what the users needed, and gave every evidence of willingness to move things in that direction. Apparently there are some low-hanging fruit that could be plucked to start with, and then take it from there.

Later that evening Jeremy Cole and others set up a general discussion about MySQL replication. The discussion ranged all over the map — anything about replication, and many things not about replication, were fair game. When you have that many smart, experienced people in a room, you can bring up just about anything and get either an answer or some options you didn’t know about before.

On Sunday the attendance was lower, so we re-organized everything into one big room for convenience. We had a discussion about several new storage engines for MySQL, including one the community is watching eagerly: Falcon, which is the brainchild of a relational database pioneer Jim Starkey. Monty also discussed the features in the upcoming MyISAM++, and Brian Aker went over his recent work on an engine that uses memcache for its storage. Finally, we heard about ScaleDB, which is an upcoming storage engine that uses Patricia Tries instead of standard B-Trees. I’m no expert on this, but it was a fascinating talk and sounds like a solution to certain problems. We also had a lively and technical discussion with Nitin Borwankar about designing schemas to support tagging and folksonomy applications like del.icio.us. Finally, we rounded it out with a demo from Sheeri, who showed us her company’s production installation of the new MySQL Enterprise Dashboard, which is a monitoring system that’s part of the new MySQL Network offerings (for certain license levels). It looks like a useful tool, and there was lots of feedback from the participants, which the MySQL developers noted down.

Storage engines and the pluggable architecture

One thing that really clicked for me at this conference is the MySQL architecture. Have you ever thought about what a breakthrough it is to have a database server that supports many storage engines, each with their own strong points? Take a look around the market — you find specialized, single purpose servers, or servers that try to be all things to everyone. MySQL is the only product I know of that lets you keep the same server, the same management and tuning, same tools and administration and query language etc etc — and plug an entirely different back-end into it. This is truly an amazing architectural decision. Look at the choices you already have:

  • MyISAM
  • Cluster
  • Federated
  • Archive
  • Merge
  • Memory
  • CSV
  • Blackhole
  • InnoDB

This positions MySQL extremely well to compete in a market that has bloated servers that are not quite what anyone actually needs. I think MySQL the company has a bright future for this reason alone, but that’s not the only thing going for them. I’ve been reading about pluggable storage engines for a while now, probably much like you, and never thought much about it, but at this event I sensed an almost tangible electricity in the air. What if you took an innovative product that does what nobody’s done before (your choice of storage engines) and made it possible to just plug storage engines in without recompiling the server? That’s a quantum leap for both users and developers. I may have missed the significance till now, but developers haven’t. Do you know how many new storage engines are being developed for MySQL? In addition to the existing ones, I’m aware of at least the following, in various degrees of development, some of which are pluggable:

  • Falcon: designed specifically for modern enterprise and Web 2.0 applications.
  • solidDB: a multi-threaded, transactional storage engine for MySQL Server. It is designed for mission-critical implementations that require a robust, transactional database.
  • OpenOLAP: An open source OLAP (On-Line Analytical Processing) tool.
  • PrimeBase XT (PBXT): a new transactional database engine for MySQL. It has been designed for modern, web-based, high concurrency environments.
  • NitroEDB: an extreme performance and highly scalable relational database technology.
  • Filesystem: will allow you to access your filesystem through SQL, doing such things as querying and updating attributes of image files.
  • memcache: access memcache via SQL.
  • ScaleDB: “currently in stealth mode. Of course, the name itself speaks volumes.”
  • I’ve heard discussion of a storage engine whose physical storage is on Amazon’s S3.
  • BrightHouse: a highly-compressed engine designed for archiving data from multi-terabyte data warehouses. BrightHouse compresses data at an average ratio of 10:1 (peak compression ratios exceed 30:1), whilst maintaining immediate and comprehensive query capability.

Notice how nobody’s developing a storage engine for okay performance on moderate amounts of data? I guess that’s up to me. I’ll call it SoSoDB.

What I got out of it

I got a lot out of this conference:

  • I met lots of great people.
  • I learned a lot. Mostly I learned about new options I didn’t know about before, which is something I always love. I believe there are always more options in life overall than I know about, and it’s great when I find out about them. It opens up so many more possibilities. I have so many ideas now about better ways to architect our systems for another round of scale-up before we scale out, and how to do that better when the time comes.
  • I got a better sense of what the MySQL community and company are like, and I’m prouder than ever to be part of it.
  • I learned a lot about what MySQL and its users do well and not so well. For example, the MySQL developers are probably getting less feedback from their users than they would if the users paid for the product, in my opinion, and that hurts everyone in a sense. On the other hand, I’ll be the first to say MySQL listens to feedback as much or more than if they were charging for the product — more, actually.

I’d absolutely go to the next one, and I encourage you to do so also. By the way, I’m planning to be at the bigger conference in April as well, so maybe I’ll see you there.

It was pretty eye-opening to attend this conference. I’ve been using MySQL since about 2000 I think, and have bought and read many books. I’m studying for certification. I’ve been blogging about it for a while, obviously, and reading other people’s blogs as well. But I gained an entirely new perspective on the company, the product, and the community around it.

Technorati Tags:No Tags

You might also like:

  1. Sessions I want to see at the MySQL Conference
  2. I’m going to the upcoming MySQL Camp
  3. MySQL Camp 2007
  4. How to get your session accepted to MySQL Conference 2008
  5. Summary of beCamp 2008

How to install innotop on Microsoft Windows

I recently tested innotop on Microsoft Windows. There was one slight glitch, but I changed a couple lines of code, and now it runs out of the box under ActivePerl. Version 0.1.156 contains those changes for Windows compatibility.

How to install

I installed it under ActivePerl 5.8.8 build 819:

  1. Download and install ActivePerl
  2. After installation, open the Perl Package Manager from Start/Programs/ActivePerl 5.8.8 Build 819. Select Time-HiRes, DBD-mysql, and install them.
  3. Download and extract innotop and run it.

I also watched another person run it under ActivePerl 5.8.7.

That’s it. Here is a screenshot of innotop on Windows XP:

innotop no Windows XP

Differences under Windows

If you have term-ansicolor , it is disabled because Windows terminals don’t like the formatting characters. That makes the display a bit less compact, but such is life.

If you have any experience with other ways to run innotop under Windows (cygwin?), please let me know.

My next project is to take a number of samples of SHOW INNODB STATUS from MySQL running under Windows and make sure they’re parsed correctly. I’ve read the source, but reading the source doesn’t mean I’ve done it right. (Update: yep, I did it right).

Update: version 0.1.159 has additional fixes. A new Windows-specific configuration variable, max_height, defines how high the viewable window area is. Otherwise innotop can’t figure out how much space it has to work with. I also fixed the code to work with MySQL 5.1’s new SHOW ENGINE output (there is an extra column in the output).

Update 2006-11-14: version 0.1.160 has more Windows compatibility. An astute Perl hacker showed me how to get ANSI formatting codes to work in a Windows console, which allows bold and highlighted text and a more compact display (so you can get more on the screen). To get this to work, just install Win32::Console::ANSI. Yay!

Technorati Tags:No Tags

You might also like:

  1. Version 0.1.132 of innotop released
  2. How to install innotop
  3. Innotop version 1.152 released
  4. Installing innotop on FreeBSD and Gentoo
  5. MySQL Toolkit released as one package

Installing innotop on FreeBSD and Gentoo

I recently got a message letting me know FreeBSD users will soon be able to install the innotop MySQL and InnoDB monitor through ports. Gentoo GNU/Linux users can find innotop in Portage.

FreeBSD adds innotop to ports

When this is finalized, FreeBSD users will be able to install innotop with the following commands:

cd /usr/ports/databases/innotop
make all install

This is great news. It makes innotop easier to find, install and use. It also means a lot to me that a FreeBSD maintainer thinks innotop is worth including in ports.

You can track the status of the PR at ports/104722: New port: databases/innotop.

innotop ebuilds for Gentoo

I’ve had an open request to add innotop on the Gentoo bug system for a long time, and it appears to have stagnated, but you can download ebuilds for innotop anyway. Since I’m no Gentoo developer, please do give feedback if you have any.

Since the Gentoo bug doesn’t seem to be getting any attention, I’ve also opened an innotop new-ebuild bug report with the fine folks at breakmygentoo. This might result in innotop being included in their Portage overlay.

Edit One of the Gentoo developers saw this post and added innotop to the main Portage tree! Thank you! Gentoo users can now install innotop like this:

emerge innotop

Yay!

Technorati Tags:No Tags

You might also like:

  1. Why I (still) like Gentoo
  2. How to install innotop
  3. How to prelink mozilla-firefox-bin

How to create input masks in HTML

Download HTML Input Mask

Note that this is not compatible with all browsers, has known problems and limitations, and I am not maintaining it or replying to requests for help. Thanks! (But also note that you are free to change and redistribute under the license terms, which you should read after downloading)

Have you ever wanted to apply an input mask to an HTML form field? Input masks are common in traditional GUI applications, but HTML has no such feature. This article introduces a library that adds input masks to form fields with unobtrusive JavaScript.

What’s an input mask?

View the Demo

Input masks are guides to help users enter data in the correct format. They typically do not actually validate data; they just ensure the right types of characters are entered in the right places. Typical uses are for dates, times, social security numbers, phone numbers, and credit card numbers. The user enters un-formatted input, and the mask takes care of adding dashes and other separators in appropriate places.

For example, in the United States most people use MM/DD/YY format to write dates. A well-written GUI application honors the user’s locale and creates an appropriate input mask, such as ##/##/##, for date entry. The user types the numbers, and the program inserts the slashes. If the user types something other than a number, that character is discarded, not entered into the field.

How to do this with JavaScript

There are several problems you need to solve to simulate this in a web browser. First things first: let’s state the requirements.

  1. Help the user avoid entering invalid characters.
  2. Automatically insert separators as the user types.
  3. Constrain the length of the input.

Second, let’s create a spec for the masking syntax. In Windows Forms programming, controls have a Mask property, and other GUI libraries have similar functionality. The full behavior of these masks is complex. For an example, see the MSDN documentation for masked edit controls. You can get a lot of that functionality with a simpler specification, though. The following will suffice for many uses:

  1. The mask only allows one type of character for the entire mask. For example, the mask can allow either all digits or all alphanumerics, but you can’t constrain one character to be a digit while letting other characters accept alphanumerics.
  2. The mask specifies the placeholders for input with spaces, and separators as non-spaces.

An example mask, then, has two parts: the format, which says which places can accept user input, and the type, which says what type of character can go in those places. We’ll see how to actually do this later.

The third problem is to unobtrusively attach the masking functionality to input fields, with gracefully degrading behavior if the browser doesn’t support it, and without adding a lot of markup to your forms to specify the mask format and type. This is easy, using the principles I laid out in an earlier article on using classes to specify data types. This technique is 100% appropriate because classes aren’t just hooks for CSS, they’re general-purpose processing information. This lets you easily specify a) which inputs get masks, and b) which type of mask they get.

How it works

To add masks to form fields, reference my library, then make the page’s load event fire the Xaprb.InputMask.setupElementMasks() function in my library. This will find all elements with the class input_mask, which specifies that the element should get a mask. Each element should also have a mask_??? class, where the ??? specifies which mask to attach. The library takes care of the rest.

By the way, this library depends on the Prototype library, so you will also need to reference that in your page. If you don’t, you won’t get an error, but nothing will happen.

The setup function iterates over the elements and connects a callback to the onkeypress event. The callback is created by another function. To decide which mask to apply, it does a regular expression match against the element’s className. If the element’s class is “input_mask mask_date_us“, the regular expression captures “date_us,” and looks up the date_us mask. Here’s how that is defined:

      date_us: {
         format: '  /  /    ',
         regex:  /\d/,
      }

The format property is a string with spaces where input should go, and other characters get inserted automatically. The regex property is a regular expression that matches a valid character, in this case a digit.

Here’s how the callback function works: when it fires, it checks each character in the form field’s value. If there’s a space in that place in the mask’s format string, it looks to see if the character matches the mask’s regular expression. If so, the character is valid for that place in the input; if not, the character is rejected. If there isn’t a space in that place in the format string, the character from the format string is copied into the form field (this is how separators are automatically inserted).

Demonstration

Enough talk, let’s see it in action. This demonstration of Javascript form input masks shows a few of the masks I discussed above: US date, time, and phone number.

If you like the way the form input fields look, you can thank the fine folks at Particletree. I borrowed the styling from their article on how to make forms suck less (it makes the borders of the input areas easier to see).

Limitations

Since this is really just a hack on top of existing HTML form inputs, there are some things that will never work quite as well as a natively designed widget (the same is true for my JavaScript Combo Box widget). Here are some of the limitations:

  • No unicode or international characters (this might be easy to fix).
  • No spaces as placeholders. Sometimes you might want spaces between user input, rather than non-space separators.
  • Only one type of character for the entire input; you can’t constrain the first character to be a digit, and the second a letter.
  • It doesn’t show the mask ahead of time and let the user ‘fill in’ the missing characters; instead, it reveals the mask as the user types.
  • You can’t have two adjacent separators.
  • You can’t type into the middle of the text; all input you type is appended to the end.
  • It hijacks things like Ctrl+A to select all.

Despite the length of that list, these are such minor things (except for maybe international characters) that it’s practically a complete implementation. And as far as I know, everything here could be solved easily. I just haven’t done it, because you haven’t yet told me which things are problems for you (hint, hint: leave a comment, and patches are very welcome). I deliberately kept things really simple in this first version. Future versions can get fancier, or not.

Conclusion

So that’s it! Simple, lightweight, intuitive input masks. With a proper form validation library on the back-end, you should be able to use this to help your users enter data in the format you desire. Again, let me know what you think, and by all means improve this, and send me the results!

Technorati Tags:No Tags

You might also like:

  1. JavaScript date chooser
  2. Javascript date parsing and formatting, Part 2
  3. JavaScript number-formatting library updated
  4. JavaScript date parsing and formatting, Part 1
  5. How to find and fix invalid character data in MySQL