Archive for December, 2006

A shell for a new Perl MySQL program

MySQL Perl script skeleton

Brian Aker recently wrote about a “skeleton project” for quickly boot-strapping a development environment for a new software project. I do something similar for Perl programs that I want to connect to MySQL.

Brian’s skeleton project is different from mine. Mine is just a skeleton for a Perl script, not an entire software project. But perhaps you’ll find it useful anyway.

The code included in the file takes care of getting all the info you need to connect to a MySQL instance. By default, it starts by opening and reading your .my.cnf file if it exists. If it can’t find that, it asks you for any missing information (username, password, etc). There’s a section for generating a command-line help output, and a section where you can add more command-line options easily. And finally, it actually opens a DBI connection to a MySQL instance, in case you can’t remember the syntax (I never can).

I haven’t fully commented what it does, but if you’re familiar with Perl it ought to be pretty self-evident. Let me know if I need to clarify anything. Enjoy!

bread machine breadman bread machine toastmaster bread machine panasonic bread machine cuisinart bread machine sunbeam bread machine oster bread machine west bend bread machine westinghouse bread machine regal ware bread machine indoor grill george foreman indoor grill hamilton beach indoor grill cuisinart indoor grill star manufacturing indoor grill anvil indoor grill presto indoor grill delonghi indoor grill cadco indoor grill eurodib indoor grill krups indoor grill sanyo indoor grill farberware indoor grill oster indoor grill breville indoor grill chefmaster indoor grill electric indoor grill sunbeam indoor grill sandwich maker george foreman sandwich maker cuisinart sandwich maker anvil sandwich maker toastmaster sandwich maker breville sandwich maker presto sandwich maker hamilton beach sandwich maker krups sandwich maker tefal sandwich maker delonghi sandwich maker durabrand sandwich maker diablo sandwich maker eletric sandwich maker oster sandwich maker waffle maker villaware waffle maker nemco waffle maker chefs choice waffle maker cuisinart waffle maker star manufacturing waffle maker gold medal waffle maker hamilton beach waffle maker toastmaster waffle maker oster waffle maker krups waffle maker belgian waffle maker kitchenaid waffle maker delonghi waffle maker george foreman waffle maker westinghouse waffle maker sunbeam waffle maker waffle iron villaware waffle iron nemco waffle iron chefs choice waffle iron cuisinart waffle iron star manufacturing waffle iron waring waffle iron gold medal waffle iron hamilton beach waffle iron toastmaster waffle iron oster waffle iron krups waffle iron belgian waffle iron black and decker waffle iron kitchenaid waffle iron delonghi waffle iron george foreman waffle iron sunbeam waffle iron pancaker maker perfect pancake maker toastmaster pancake maker george foreman pancake maker deep fryer frymaster deep fryer presto deep fryer rival deep fryer cecilware deep fryer gold medal deep fryer delonghi deep fryer wells deep fryer star manufacturing deep fryer tefal deep fryer anvil deep fryer anetsberger deep fryer dean deep fryer garland deep fryer imperial deep fryer keating deep fryer pitco deep fryer waring deep fryer electric deep fryer bravetti deep fryer oster deep fryer nesco deep fryer ge deep fryer black and decker deep fryer cuisinart deep fryer euro pro deep fryer hamilton beach deep fryer durabrand deep fryer sunbeam deep fryer
Technorati Tags:No Tags

You might also like:

  1. How to create a VB6 console program
  2. SSH public-key forwarding
  3. How to convert MySQL output to HTML tables
  4. How to track what owns a MySQL connection
  5. MySQL Toolkit released as one package

Why I use explicit date math in SQL

I sometimes see advice to do SQL date operations with the + and - operators on platforms where they are overloaded for date types. I try to avoid that, because it can give unexpected results. I prefer to explicitly use the built-in date/time functions. I’ll show you an example where the operators cause problems, but the functions do the right thing.

My example is in MySQL, but it applies to some other systems too. Suppose you have a table with something keyed on date, such as a count of alien sightings per day. Now you want to see how the count has changed over time. Today is 11th December 2006. What does this query return?

select day, num from counter
where counter = 'aliens sighted'
   and day >= current_date - 15;

It doesn’t return the last 15 days, if that’s what you expected:

+------------+-----+
| day        | num |
+------------+-----+
| 2006-12-01 |  19 | 
| 2006-12-02 |  20 | 
| 2006-12-03 |  21 | 
| 2006-12-04 |  20 | 
| 2006-12-05 |  19 | 
| 2006-12-07 |  23 | 
| 2006-12-08 |  21 | 
| 2006-12-09 |  19 | 
| 2006-12-10 |  20 | 
| 2006-12-11 |  27 | 
+------------+-----+

Why not? Well, that current_date - 15 doesn’t result in a date 15 days ago. It results in an integer that is not a valid date:

select current_date - 15;
+-------------------+
| current_date - 15 |
+-------------------+
|          20061196 | 
+-------------------+

That’s because this operation casts the date to MySQL’s internal 3-byte integer representation (20061211) and subtracts 15 from it to get 20061196. What is the result?

select date(current_date - 15);
+-------------------------+
| date(current_date - 15) |
+-------------------------+
| NULL                    | 
+-------------------------+

It’s an invalid date. It is better to use the date-manipulation functions and a) do date math, not integer math b) get a date back, not an integer. The query can be written as follows in MySQL:

select day, num from counter
where counter = 'aliens sighted'
   and day >= date_sub(current_date, interval 15 day);
+------------+-----+
| day        | num |
+------------+-----+
| 2006-11-26 |  23 | 
| 2006-11-27 |  26 | 
| 2006-11-28 |  24 | 
| 2006-11-29 |  23 | 
| 2006-11-30 |  24 | 
| 2006-12-01 |  19 | 
| 2006-12-02 |  20 | 
| 2006-12-03 |  21 | 
| 2006-12-04 |  20 | 
| 2006-12-05 |  19 | 
| 2006-12-07 |  23 | 
| 2006-12-08 |  21 | 
| 2006-12-09 |  19 | 
| 2006-12-10 |  20 | 
| 2006-12-11 |  27 | 
+------------+-----+

Much better!

I continue to find that date math and date operations are confusing, and silently do something I don’t expect. You can find two examples of this in my past articles: one about SQL Server 2000 and one about BETWEEN in MySQL.

Both problems were very hard to solve. That’s why I’m careful with date operations. I find it’s safest to leave nothing to chance.

Note: I’m taking a break from computers and blogging. This is pre-recorded. I’ll moderate your comments shortly.

Technorati Tags:No Tags

You might also like:

  1. Type conversion semantics of MySQL’s BETWEEN operator
  2. How to subtract in SQL over samples that wrap
  3. How to avoid imprecise DECIMAL math in MySQL

Advanced MySQL user variable techniques

MySQL’s user variables have interesting properties that enable the useful techniques I wrote about in recent articles. One property is that you can read from and assign to a user variable simultaneously, because an assignment can be an r-value (the result of the assignment is the final value of the variable). Another property, which sometimes causes confusing behavior, is un-intuitive evaluation time. In this post I’ll show you how to make sure your variables get updated at the time they’re used, instead of potentially reading and updating them at different stages of query execution. This technique enables a whole new range of applications for user variables. As a bonus, it also avoids extra columns of output created by variable manipulations.

I will cover several things in this article: assignments as r-values and its side effects, lazy evaluation and its side effects, and finally a technique that lets you have non-lazy evaluation and avoid some side effects.

Setup

I’ll use the same data as in recent articles:

CREATE TABLE fruits (
  type varchar(10) NOT NULL,
  variety varchar(20) NOT NULL,
  price decimal(5,2) NOT NULL default 0,
  PRIMARY KEY  (type,variety)
) ENGINE=MyISAM DEFAULT CHARSET=latin1;

insert into fruits(type, variety, price) values
('apple',  'gala',       2.79),
('apple',  'fuji',       0.24),
('apple',  'limbertwig', 2.87),
('orange', 'valencia',   3.59),
('orange', 'navel',      9.36),
('pear',   'bradford',   6.05),
('pear',   'bartlett',   2.14),
('cherry', 'bing',       2.55),
('cherry', 'chelan',     6.33);

Simultaneous assignment and reading

MySQL lets you assign and read a variable at the same time. This is familiar in many programming languages where an assignment can be an r-value. For example,

set @test1 := 0;
set @test2 := @test1 := 5;
select @test1, @test2;
+--------+--------+
| @test1 | @test2 |
+--------+--------+
| 5      | 5      | 
+--------+--------+

The second statement sets @test1 to 5, and then sets @test2 to the result of that assignment, which is the current value of @test1. My previous articles have shown you how to exploit this to number rows in a result set, among other things. For example, you can keep a running count as MySQL processes rows, updating and returning the count at the same time.

Side effects

Unfortunately, it got a bit messy sometimes. For example, the following batch, which restarts the numbering every time type changes, spews an extra dummy column into the output, because that column is where the calculations are taking place:

set @type := '', @num := 1;

select type, variety,
   @num := if(@type = type, @num + 1, 1) as row_number,
   @type := type as dummy
from fruits
order by type, variety;

+--------+------------+------------+--------+
| type   | variety    | row_number | dummy  |
+--------+------------+------------+--------+
| apple  | fuji       |          1 | apple  | 
| apple  | gala       |          2 | apple  | 
| apple  | limbertwig |          3 | apple  | 
| cherry | bing       |          1 | cherry | 
| cherry | chelan     |          2 | cherry | 
| orange | navel      |          1 | orange | 
| orange | valencia   |          2 | orange | 
| pear   | bartlett   |          1 | pear   | 
| pear   | bradford   |          2 | pear   | 
+--------+------------+------------+--------+

In previous articles I suggested wrapping that query in a subquery so you can pick which columns you want in the output. That is a bit inefficient (it creates a temporary table internally) and feels kind of kludgey.

Lazy evaluation

MySQL doesn’t evaluate expressions containing user variables until they are sent to the client, so some expressions don’t work as expected. Setting a variable in one place (such as the SELECT list) and reading it another (such as the HAVING clause) might give weird results, like as those I demonstrated in my last article where every row was numbered 1 instead of getting incremented as expected.

Here’s further clarification from the manual:

In a SELECT statement, each expression is evaluated only when sent to the client. This means that in a HAVING, GROUP BY, or ORDER BY clause, you cannot refer to an expression that involves variables that are set in the SELECT list. For example, the following statement does not work as expected:

mysql> SELECT (@aa:=id) AS a, (@aa+3) AS b FROM tbl_name HAVING b=5;

The reference to b in the HAVING clause refers to an alias for an expression in the SELECT list that uses @aa. This does not work as expected: @aa contains the value of id from the previous selected row, not from the current row.

In other words, the “alias” in the HAVING clause is probably a pointer to a memory location, whose content is not determined for the current row until the current row is output to the client — at which point it’s too late to apply any HAVING criteria to the row.

Side effects of lazy evaluation

In my last article I showed you how to select the top N rows from each group with user variables. To make that work right, I had to group the query, use a HAVING clause, and force a certain index order for that query — because of lazy evaluation. Otherwise, I might have been able to just use the variable in a WHERE clause, right? Lazy evaluation is why this doesn’t work:

set @type := '', @num := 1;

select type, variety, price,
       @num := if(@type = type, @num + 1, 1) as row_number,
       @type := type as dummy
from fruits
where @num <= 2;

+-------+------------+-------+------------+-------+
| type  | variety    | price | row_number | dummy |
+-------+------------+-------+------------+-------+
| apple | gala       |  2.79 |          1 | apple | 
| apple | fuji       |  0.24 |          2 | apple | 
| apple | limbertwig |  2.87 |          3 | apple | 
+-------+------------+-------+------------+-------+

That last row gets output even though it seems @num should have the value 3, eliminating it from the results. However, you can infer from this behavior that @num really had the value 2 at the time the WHERE clause was evaluated, and was only incremented to 3 after the row was sent to the client.

This aspect of user variable behavior makes user variables significantly harder to understand. Sometimes the results are non-deterministic and/or hard to predict. It would be great if there were a way to update those variables in the context in which they’re declared, so they get assigned and read at the same time, instead of having to wait for rows to be sent to the client — a different step in the query execution plan.

Forcing variable evaluation with multi-staged queries

If you understand the order of the steps MySQL uses to execute a query, you can see there are opportunities to make MySQL “finish up” variable assignments before sending the query to the next step. In fact, perhaps it’s a bit misleading to say assignments in the SELECT are done when rows are sent. I think it’s more accurate to say they’re done when rows are generated for each stage in query execution.

You can see this in a subquery in the FROM clause, which is internally stored as an intermediate temporary table. Variable assignments are done before or as the rows are stored in the temporary table, so when results are read from the temporary table, there are no funny side effects.

Let me show you the previous query slightly rewritten, and you’ll see what I mean:

set @type := '', @num := 1;

select type, variety, price, row_number
from (
   select type, variety, price,
       @num := if(@type = type, @num + 1, 1) as row_number,
       @type := type as dummy
   from fruits
) as x
where row_number <= 2;

+--------+----------+-------+------------+
| type   | variety  | price | row_number |
+--------+----------+-------+------------+
| apple  | gala     |  2.79 |          1 | 
| apple  | fuji     |  0.24 |          2 | 
| orange | valencia |  3.59 |          1 | 
| orange | navel    |  9.36 |          2 | 
| pear   | bradford |  6.05 |          1 | 
| pear   | bartlett |  2.14 |          2 | 
| cherry | bing     |  2.55 |          1 | 
| cherry | chelan   |  6.33 |          2 | 
+--------+----------+-------+------------+

Just by introducing an intermediate step in the query, I forced the variables to be evaluated so the results, when they get to the outer WHERE clause, are deterministic. But as I mentioned before, this is kind of kludgey, and depending on the data, it might not be very efficient to create an intermediate temporary table for the results.

Are there better ways? You bet!

Try 1: Use functions to force immediate evaluation

Here’s an idea: what if certain functions evaluate their arguments immediately? You could exploit that to create a context that has to be evaluated first, sort of like parenthesizing an expression in an equation. You know, a = (a + b) * (b + c) means “do the additions first,” which wouldn’t be the case without the parentheses — normally multiplication comes before addition.

For this to work, you’d need a function that guarantees the expression is evaluated. For example, COALESCE() might be a good choice as long as you put the expression first in the argument list, since COALESCE() shortcuts and doesn’t evaluate any more arguments as soon as it find a non-NULL argument.

Theoretically, then you could write something like the following and get the desired results:

set @type := '', @num := 1;

select type, variety, price,
   coalesce(@num := if(@type = type, @num + 1, 1)) as row_number
...

It doesn’t work. Why not? Because the COALESCE itself isn’t evaluated until the rows are generated. So much for that idea.

What about a scalar subquery, then?

set @num := 0, @type := '';

select type, variety, price,
   (select(@num := if(@type = type, @num + 1, 1))) as row_number,
...

Sorry, no dice. This gives exactly the same results.

This idea will not work, period. Each and every expression in the SELECT list is evaluated as the rows are generated. Functions are expressions, scalar subqueries are expressions… the only things that will work are operations that result in rows being evaluated for a final value.

Try 2: Force my will on the query

One thing I do know: subqueries in the FROM clause are materialized to a temp table, so this will definitely result in rows being generated. This might do what I want, at the expense of generating temporary tables willy-nilly:

set @type := '', @num := 1;

select type, variety, price,
   (select n from (select @num := if(@type = type, @num + 1, 1) as n) as x) as row_number,
   (select t from (select @type := type as t) as x) as dummy
from fruits
where @num <= 2;

That won’t work either, as it turns out. The subqueries are correlated — they refer to columns from the outer table. That isn’t allowed because of the intermediate step, which insulates the inner queries from the outer. This is a limitation of correlated subqueries: you can’t nest a subquery in the FROM clause inside them.

This is really getting silly. It’s time to stop trying to force this to work.

Try 3: Work with me, son

What if I stop trying to get the SELECT clause to be evaluated at the same time as the WHERE clause? What if I work with the server’s order of operations, and do all the evaluating and updating in the WHERE clause instead of in two places? Maybe it looks like this:

set @num := 0, @type := '';

select type, variety, price, @num
from fruits
where
   2 >= @num := if(@type = type, @num + 1, 1)
   and @type := type;

+--------+------------+-------+------+
| type   | variety    | price | @num |
+--------+------------+-------+------+
| apple  | gala       |  2.79 | 0    | 
| apple  | fuji       |  0.24 | 0    | 
| apple  | limbertwig |  2.87 | 0    | 
| orange | valencia   |  3.59 | 0    | 
| orange | navel      |  9.36 | 0    | 
| pear   | bradford   |  6.05 | 0    | 
| pear   | bartlett   |  2.14 | 0    | 
| cherry | bing       |  2.55 | 0    | 
| cherry | chelan     |  6.33 | 0    | 
+--------+------------+-------+------+

Hmm, that was not really what I wanted. It looks like the variable is never getting updated at all! I’m not sure why not. Maybe if I ‘parenthesize’ the variable expression like I tried before? I’ll use the GREATEST() function, which I know will evaluate all its arguments instead of short-cutting like COALESCE():

set @num := 0, @type := '';

select type, variety, price, @num
from fruits
where
   2 >= @num := greatest(0, if(@type = type, @num + 1, 1))
   and @type := type;

No, that gives the same result. I feel like I’m getting close, though. What if I separate out the assignment and comparison?

set @num := 0, @type := '';

select * from fruits
where @num := if(type = @type, @num + 1, 1)
   and @type := type
   and @num <= 2;
Empty set (0.00 sec)

select @num, @type;
+------+-------+
| @num | @type |
+------+-------+
| 0    | 0     | 
+------+-------+

That didn’t work either. How did @type get assigned an integer? It should be a string. It turns out the := operator has the lowest possible operator precedence, so that WHERE clause is actually equivalent to

where @num := (
   if(type = @type, @num + 1, 1)
      and (@type := (
         type and @num <= 2)));

If I use parentheses right, maybe I can get it to do what I want:

select * from fruits
where (@num := if(type = @type, @num + 1, 1))
      and (@type := type)
      and (@num <= 2);
Empty set (0.00 sec)

select @num, @type;
+------+--------+
| @num | @type  |
+------+--------+
| 9    | cherry | 
+------+--------+

Now I’ve gotten the variables to be assigned, but the WHERE clause is still eliminating all the rows. This feels so close to being right. What’s missing?

Pay dirt: do the assignment inside the function

In fact, I was very close. All I need to do is move the entire assignment and the evaluation inside the function. It seems the variable expressions need to be sealed away from the comparison operator. In the example below, I’ve put everything inside the GREATEST() function, but the expression that updates @type has an incompatible type (string), so I convert it to a number with LENGTH() and mask its value with LEAST().

set @num := 0, @type := '';

select type, variety, price, @num
from fruits
where 2 >= greatest(
   @num := if(@type = type, @num + 1, 1),
   least(0, length(@type := type)));

The entire GREATEST() expression evaluates to the resulting value of @num, which is what I want on the right-hand side of the comparison. And guess what? This works:

+--------+----------+-------+------+
| type   | variety  | price | @num |
+--------+----------+-------+------+
| apple  | gala     |  2.79 | 1    | 
| apple  | fuji     |  0.24 | 2    | 
| orange | valencia |  3.59 | 1    | 
| orange | navel    |  9.36 | 2    | 
| pear   | bradford |  6.05 | 1    | 
| pear   | bartlett |  2.14 | 2    | 
| cherry | bing     |  2.55 | 1    | 
| cherry | chelan   |  6.33 | 2    | 
+--------+----------+-------+------+

After playing with more and more combinations, I found another way that works too:

select *, @num
from fruits
where
   (@num := if(type = @type, @num + 1, 1)) is not null
   and (@type := type) is not null
   and (@num <= 2);

I confess, I don’t fully understand this. I figured it out through trial and error. If the user manual explains it well enough for me to have gotten there by reason, I don’t know where. Can someone make it make sense please? I don’t want to have to read the source…

What’s so great about this?

Two words: one pass. One pass through the table — no quadratic-time algorithms, no grouping or sorting. This is highly efficient. I showed you another technique with UNION in my last article, which might be more efficient in some cases. But if you have lots of types of fruit, each of which has just a few varieties, you will be hard-pressed to find a more efficient algorithm to output the first two rows from each group. In fact, I doubt it can be done.

Spurious columns are gone

Putting the variable assignments inside functions not only let me put everything into the WHERE clause, it also got rid of the extra columns in the output — without kludges like subqueries. You can use this technique to clean up your output whenever you’re doing row-by-row calculations.

Notice the order of rows!

As in previous articles, rows are processed and numbered in order. I never really stated what I was trying to accomplish in the example above. The query I showed you will just output a maximum of two consecutive rows of the same type, in the order they’re read from the table (actually, I guess that’s the order they pass through the WHERE filter, which might not be the same). If I want to do something specific, such as get the two cheapest varieties from each type of fruit, I need to add an explicit ORDER BY to get the rows in order of price:

set @num := 0, @type := '';

select type, variety, price, @num
from fruits
where 2 >= greatest(
   @num := if(@type = type, @num + 1, 1),
   least(0, length(@type := type)))
order by type, price;

Exercise for the reader: run this query without an index that can be used for ordering. What’s in the @num column? Why? Add an index on (type, price) and try again. How does it change? Why? EXPLAIN the queries to find out.

Is that all?

Nope. If you can put user-variable evaluations inside a function, you can put them anywhere you can put a function. That means you could, for example, put them in the ORDER BY clause, in the JOIN clause, in the HAVING clause… anywhere. Now that you know you can do this, you can manipulate variables in lots of places you couldn’t do otherwise.

Conclusion

In this article I showed you how two properties of MySQL’s user variables (assignment is an r-value, and lazy evaluation) simultaneously cause side effects and give you great power. I showed you why you simply can’t get around the fact that the WHERE clause and the SELECT list are evaluated at different times (I proved it by figuratively banging my head against a wall). I then showed you how you can tuck variable manipulations inside functions, masking out the manipulations and just getting the result, which can be used in a WHERE clause or anywhere else. You now have the tools you need to avoid the side effects of those properties I mentioned.

Finally, I showed you one place you might want to use such a technique to get the first N rows from each group. In certain cases, I think this is the most efficient algorithm possible, requiring just one pass through the table.

I don’t know about you, but this opens up a lot of interesting possibilities. I have one particular use in mind that I’ll write about next — another way to linearize a query that’s normally extremely expensive.

What do you think? Leave a comment and let me know!

Note: I’m taking a break from computers. This is pre-recorded. I’ll moderate your comments shortly.

Technorati Tags:, ,

You might also like:

  1. How to number rows in MySQL
  2. How to simulate the SQL ROW_NUMBER function
  3. How to select the first/least/max row per group in SQL
  4. How to select the first or last row per group in SQL
  5. What is a SQL blind insert?

Upcoming innotop features

It’s been a while since I released an update to the innotop InnoDB and MySQL monitor, but I have not been idle. I’m currently working hard to add major new features and functionality. Here’s a quick list of what’s coming, much of which is done already but still slightly broken:

  1. Arbitrary user-defined expressions can be the source of a column in a tabular view. You are not limited to choosing from the columns I’ve defined; you can add your own, and base it directly on the available data or write an expression to calculate what you want.
  2. Connect to and monitor multiple servers simultaneously.
  3. More powerful filtering to show only the data you care about. Built-in filters are applied by default, but you can define your own of course.
  4. Highlighting based on user-defined criteria. Maybe you want a transaction to appear red if it has been active for more than a minute, or you want queries running on different servers to have different colors.
  5. Easily monitor replication across many machines; watch master and slave status.
  6. Readline support for easier configuration (tab-completion, etc).

Those are just the major features. Many of you have requested features by e-mail, and I’m either building those now, or they’re in the TODO list.

Many of these features require gutting and re-building significant portions of the code (For example, going from one to many servers is a complete paradigm change. Most of what I’m doing right now is cleaning up the mess that caused). I’m not sure how to do some of the planned changes, but a few weeks ago I didn’t know how I’d handle multiple servers. I’m sure it will all come to me.

If you have specific requests, please be sure to ask! I’ll get you started: would it be helpful to monitor NDB status on many machines at once, if you run a large NDB cluster? I personally don’t monitor any such clusters, so this isn’t an itch I’ve scratched, but it would be very little work for me to transform that ugly semicolon-separated stuff into a nice tabular display.

My current TODO list should be finished up in a month or so, and then I’ll release a new version. Right now I’m going to take a little break from programming and blogging, so I’ll read and reply to your comments when I return.

Technorati Tags:No Tags

You might also like:

  1. A look at innotop’s new features
  2. A new home for innotop in the new year
  3. The innotop MySQL and InnoDB monitor
  4. Version 1.6.0 of the innotop monitor for MySQL released
  5. innotop 1.4.0 released

Why I (still) like Gentoo

I wrote a post recently that focused only on things I see as shortcomings or problems with Gentoo GNU/Linux. That was the intent of the article, to explain why I switched to Ubuntu for my personal systems. On the flip side, nothing’s perfect, but nothing’s perfectly flawed, either. There are still many things I like about Gentoo.

For one thing, the developer and user community are great (I have written about this several times before). Two developers took the time to reply to my article, and the bug I filed on MySQL is fixed now. Gentoo has great people behind it, and I really appreciate it.

There are also many things you can do with a Gentoo system you can’t do with most (any?) other systems. Portage is a great tool. You can learn about how things really work with Gentoo. You can use it as a meta-distribution to build a custom binary distribution. It’s really a blank slate. That’s part of the point.

So, while I continue to enjoy the easy installation and low maintenance of Ubuntu on my desktop computers, I don’t want to leave my appreciation for Gentoo, especially the Gentoo people, unsaid. Thanks.

Technorati Tags:No Tags

You might also like:

  1. How to update a GCC profile on Gentoo
  2. How to prelink mozilla-firefox-bin
  3. Installing innotop on FreeBSD and Gentoo
  4. Ubuntu on Dell Inspiron 1501
  5. How to switch from linuxthreads to NPTL on Gentoo

Version 3.0 of mysqlreport released

Daniel Nichter has released version 3.0 of mysqlreport, one of my favorite tools for quickly comprehending the overall state of a MySQL server. The new version prints out the most important information about InnoDB.

It looks like this:

 $ perl mysqlreport --innodb-only
MySQL 5.0.26-standard-l  uptime 3 9:57:51       Fri Dec  8 17:29:07 2006

__ InnoDB Buffer Pool __________________________________________________
Usage           1.25G of   1.25G  %Used: 100.00
Read ratio      0.002
Pages
  Free              1            %Total:   0.00
  Data         78.94k                     96.37 %Drty:   0.01
  Misc           2976                      3.63
  Latched           0                      0.00
Reads           3.47G   11.7k/s
  From file     6.30M    21.4/s            0.18
  Ahead Rnd    216772     0.7/s
  Ahead Sql    181211     0.6/s
Writes        811.05M    2.7k/s
Flushes         4.16M    14.1/s
Wait Free           0       0/s

__ InnoDB Lock _________________________________________________________
Waits             680     0.0/s
Current             0
Time acquiring
  Total        492478 ms
  Average         724 ms
  Max            5182 ms

As always, very helpful… just the facts, nothing more. I have 1.25 GB of buffer pool, 100% used, very small percentage of dirty pages, etc etc. You can see it all at a glance.

It pulls the data from SHOW STATUS, which means it only works on newer versions of MySQL. Those variables are available in 5.0.3 and later, if memory serves me.

Technorati Tags:No Tags

You might also like:

  1. Progress on High Performance MySQL, Second Edition
  2. A case study in profiling queries in MySQL

How to select the first/least/max row per group in SQL

Here are some common SQL problems, all of which have related solutions: how do I find the most recent log entry for each program? How do I find the most popular item from each category? How do I find the top score for each player? In general, these types of “select the extreme from each group” queries can be solved with the same techniques. I’ll explain how to do that in this article, including the harder problem of selecting the top N entries, not just the top 1.

This topic is related to numbering rows, which I just wrote about (see my articles about MySQL-specific and generic techniques to assign a number to each row in a group). Therefore I’ll use nearly the same table and data as I used in those articles, with the addition of a price column:

+--------+------------+-------+
| type   | variety    | price |
+--------+------------+-------+
| apple  | gala       |  2.79 | 
| apple  | fuji       |  0.24 | 
| apple  | limbertwig |  2.87 | 
| orange | valencia   |  3.59 | 
| orange | navel      |  9.36 | 
| pear   | bradford   |  6.05 | 
| pear   | bartlett   |  2.14 | 
| cherry | bing       |  2.55 | 
| cherry | chelan     |  6.33 | 
+--------+------------+-------+

Selecting the one maximum row from each group

Let’s say I want to select the most recent log entry for each program, or the most recent changes in an audit table, or something of the sort. This question comes up over and over on IRC channels and mailing lists. I’ll re-phrase the question in terms of fruits. I want to select the cheapest fruit from each type. Here’s the desired result:

+--------+----------+-------+
| type   | variety  | price |
+--------+----------+-------+
| apple  | fuji     |  0.24 | 
| orange | valencia |  3.59 | 
| pear   | bartlett |  2.14 | 
| cherry | bing     |  2.55 | 
+--------+----------+-------+

There are a few common solutions to this problem. All involve two steps: finding the desired value of price, and then selecting the rest of the row based on that.

One common solution is a so-called self-join. Step one is to group the fruits by type (apple, cherry etc) and choose the minimum price:

select type, min(price) as minprice
from fruits
group by type;
+--------+----------+
| type   | minprice |
+--------+----------+
| apple  |     0.24 | 
| cherry |     2.55 | 
| orange |     3.59 | 
| pear   |     2.14 | 
+--------+----------+

Step two is to select the rest of the row by joining these results back to the same table. Since the first query is grouped, it needs to be put into a subquery so it can be joined against the non-grouped table:

select f.type, f.variety, f.price
from (
   select type, min(price) as minprice
   from fruits group by type
) as x inner join fruits as f on f.type = x.type and f.price = x.minprice;
+--------+----------+-------+
| type   | variety  | price |
+--------+----------+-------+
| apple  | fuji     |  0.24 | 
| cherry | bing     |  2.55 | 
| orange | valencia |  3.59 | 
| pear   | bartlett |  2.14 | 
+--------+----------+-------+

Another common way to do this is with a correlated subquery. This can be much less efficient, depending on how good your system’s query optimizer is. You might find it clearer, though.

select type, variety, price
from fruits
where price = (select min(price) from fruits as f where f.type = fruits.type);
+--------+----------+-------+
| type   | variety  | price |
+--------+----------+-------+
| apple  | fuji     |  0.24 | 
| orange | valencia |  3.59 | 
| pear   | bartlett |  2.14 | 
| cherry | bing     |  2.55 | 
+--------+----------+-------+

Both queries are logically equivalent, though they may not perform the same.

Select the top N rows from each group

This is a slightly harder problem to solve. Finding a single row from each group is easy with SQL’s aggregate functions (MIN(), MAX(), and so on). Finding the first several from each group is not possible with that method because aggregate functions only return a single value. Still, it’s possible to do.

Let’s say I want to select the two cheapest fruits from each type. Here’s a first try:

select type, variety, price
from fruits
where price = (select min(price) from fruits as f where f.type = fruits.type)
   or price = (select min(price) from fruits as f where f.type = fruits.type
      and price > (select min(price) from fruits as f2 where f2.type = fruits.type));
+--------+----------+-------+
| type   | variety  | price |
+--------+----------+-------+
| apple  | gala     |  2.79 | 
| apple  | fuji     |  0.24 | 
| orange | valencia |  3.59 | 
| orange | navel    |  9.36 | 
| pear   | bradford |  6.05 | 
| pear   | bartlett |  2.14 | 
| cherry | bing     |  2.55 | 
| cherry | chelan   |  6.33 | 
+--------+----------+-------+

Yuck! That can be written as a self-join, but it’s just as bad (I leave it as an exercise for the reader). This gets worse as you go to higher numbers (top 3, top 4…). There are other ways to phrase the statement, but they all boil down to the same thing, and they’re all pretty unwieldy and inefficient.

There’s a better way: select the variety from each type where the variety is no more than the second-cheapest of that type.

select type, variety, price
from fruits
where (
   select count(*) from fruits as f
   where f.type = fruits.type and f.price < fruits.price
) <= 2;

This is elegant, and lets you vary N without rewriting your query (a very good thing!), but it’s functionally the same as the previous query. Both are essentially a quadratic algorithm relative to the number of varieties in each type. And again, some query optimizers may not do well with this and make it quadratic with respect to the number of rows in the table overall (especially if no useful index is defined), and the server might get clobbered. Are there better ways? Can it be done with one pass through the data, instead of the many passes required by a correlated subquery? You know it can, or I wouldn’t be writing this, now would I?

Use UNION

If there’s an index on (type, price), and there are many more records to eliminate than to include in each group, a more efficient single-pass method (especially for MySQL, but also for some other RDBMSs) is to break the queries out separately and put a limit on each, then UNION them all back together. Here’s the syntax you need for MySQL:

(select * from fruits where type = 'apple' order by price limit 2)
union all
(select * from fruits where type = 'orange' order by price limit 2)
union all
(select * from fruits where type = 'pear' order by price limit 2)
union all
(select * from fruits where type = 'cherry' order by price limit 2)

Peter Zaitsev has written in detail about this technique, so I won’t go into it too much more here. If it suits your purposes, it can be a very good solution.

One note: use UNION ALL, not just UNION. It prevents the server sorting the results to eliminate duplicates before returning them. In this case there will be no duplicates, so I’m telling the server to skip that (useless, expensive) step.

Do it with user variables on MySQL

The UNION trick is an especially good idea when the results are a small fraction of the rows in the table and there is an index that can be used for sorting the rows. Another linear-time technique, which might be a good option in cases where you are selecting most of the rows from the table anyway, is user variables. This is MySQL-specific. Please refer to my previous post on how to number rows in MySQL for the gory details of why this works:

set @num := 0, @type := '';

select type, variety, price
from (
   select type, variety, price,
      @num := if(@type = type, @num + 1, 1) as row_number,
      @type := type as dummy
  from fruits
  order by type, price
) as x where x.row_number <= 2;

This isn’t one pass through the table, by the way. The subquery is implemented as a temporary table behind the scenes, so filling it with data is one pass; then selecting every row from it and applying the WHERE clause is another. However, twice through is still O(n) with respect to the table size. That’s a lot better than correlated subqueries, which are O(n2) with respect to the group size — even moderate group sizes cause bad performance (say there are five varieties of each fruit. That’s on the order of 25 passes through the table, all told).

One-pass technique on MySQL… maybe?

If you want to leave your queries up the the query optimizer’s whims, you can try this technique, which builds no temporary tables and makes just one pass through:

set @num := 0, @type := '';

select type, variety, price,
      @num := if(@type = type, @num + 1, 1) as row_number,
      @type := type as dummy
from fruits
group by type, price, variety
having row_number <= 2;

This theoretically ought to work if MySQL orders by the GROUP BY criteria, which it sometimes does for efficiency and to produce the expected results. Does it work? Here’s what it returns on MySQL 5.0.27 on Windows:

+--------+----------+-------+------------+--------+
| type   | variety  | price | row_number | dummy  |
+--------+----------+-------+------------+--------+
| apple  | gala     |  2.79 |          1 | apple  |
| apple  | fuji     |  0.24 |          3 | apple  |
| orange | valencia |  3.59 |          1 | orange |
| orange | navel    |  9.36 |          3 | orange |
| pear   | bradford |  6.05 |          1 | pear   |
| pear   | bartlett |  2.14 |          3 | pear   |
| cherry | bing     |  2.55 |          1 | cherry |
| cherry | chelan   |  6.33 |          3 | cherry |
+--------+----------+-------+------------+--------+

Look closely… it’s returning rows one and three from each group, and they’re not numbered in order of increasing price? Huh? But the HAVING clause says the row_number should be no greater than 2! Here’s what it returns on version 5.0.24a on Ubuntu:

+--------+------------+-------+------------+--------+
| type   | variety    | price | row_number | dummy  |
+--------+------------+-------+------------+--------+
| apple  | fuji       |  0.24 |          1 | apple  |
| apple  | gala       |  2.79 |          1 | apple  |
| apple  | limbertwig |  2.87 |          1 | apple  |
| cherry | bing       |  2.55 |          1 | cherry |
| cherry | chelan     |  6.33 |          1 | cherry |
| orange | valencia   |  3.59 |          1 | orange |
| orange | navel      |  9.36 |          1 | orange |
| pear   | bartlett   |  2.14 |          1 | pear   |
| pear   | bradford   |  6.05 |          1 | pear   |
+--------+------------+-------+------------+--------+

Look, this time everything is numbered 1 and every row is returned. Wonky. This is exactly what the MySQL manual page on user variables warns about.

This technique is pretty much non-deterministic, because it relies on things that you and I don’t get to control directly, such as which indexes MySQL decides to use for grouping. However, if you need to use it — and I know there are some folks out there who do, because I’ve consulted for them — you can still tweak it. We’re getting into the realm of really bastardizing SQL, but the results above came from a table without indexes other than the primary key on (type, variety). What happens if I add an index MySQL can use for grouping?

alter table fruits add key(type, price);

Nothing changes, and EXPLAIN shows why: the query doesn’t use the index I just added. Why? Because the grouping is on three columns, and the index is only on two. In fact, the query is using a temp table and filesort anyway, so this is still not achieving the once-through goal. I can force it to use the index:

set @num := 0, @type := '';

select type, variety, price,
      @num := if(@type = type, @num + 1, 1) as row_number,
      @type := type as dummy
from fruits force index(type)
group by type, price, variety
having row_number <= 2;

Let’s see if that works:

+--------+----------+-------+------------+--------+
| type   | variety  | price | row_number | dummy  |
+--------+----------+-------+------------+--------+
| apple  | fuji     |  0.24 |          1 | apple  | 
| apple  | gala     |  2.79 |          2 | apple  | 
| cherry | bing     |  2.55 |          1 | cherry | 
| cherry | chelan   |  6.33 |          2 | cherry | 
| orange | valencia |  3.59 |          1 | orange | 
| orange | navel    |  9.36 |          2 | orange | 
| pear   | bartlett |  2.14 |          1 | pear   | 
| pear   | bradford |  6.05 |          2 | pear   | 
+--------+----------+-------+------------+--------+

Ah, now we’re cooking! It did what I wanted, without a filesort or temporary table. Another way to do this, by the way, is to take variety out of the GROUP BY so it uses the index on its own. Because this selects a non-grouped column from a grouped query, this only works if you are running with ONLY_FULL_GROUP_BY mode turned off, which I hope you are not doing without good reason.

Other methods

Be sure to check the comments for user-contributed methods. There are some really novel approaches. I always learn so much from your comments… thank you!

Conclusion

Well, that’s it. I’ve shown you several ways of solving the common “get the extreme row from each group” query, and then moved on to how you can get the top N rows from each group in various ways. Then I dove into MySQL-specific techniques which some (including myself, depending on my mood) would regard as mildly foolish to utterly stupid. But if you need the last bit of speed out of your server, you sometimes have to know when to break the rules. And for those who think this is just MySQL foolishness, it’s not; I’ve seen people desperately do these types of things on other platforms too, such as SQL Server. There are hacks and tweaks on every platform, and people who need to use them.

I hope you’ve enjoyed and profited from this discussion. If you did, maybe you’d like to subscribe for convenient notification of future articles. Happy coding!

Technorati Tags:No Tags

You might also like:

  1. How to number rows in MySQL
  2. How to simulate the SQL ROW_NUMBER function
  3. Advanced MySQL user variable techniques
  4. How to select the first or last row per group in SQL
  5. What is a SQL blind insert?

How to number rows in MySQL

I wrote before about a generic, cross-platform way to simulate the SQL ROW_NUMBER() function in any RDBMS. There is a much more efficient way to do this on MySQL with user variables.

Background

Please see my previous article on how to simulate the ROW_NUMBER() function for the background. I’ll use the same table structure and data in this article.

Unfortunately, that’s a quadratic algorithm, so it’s not something I’d do much (though I once did it over small sets of data in SQL Server 2000 at a jobsite).

A more efficient method

In MySQL you can simultaneously assign to and read from user variables in a SELECT statement. This allows the following method of numbering rows:

set @type = '';
set @num  = 1;

select
   type,
   variety,
   @num := if(@type = type, @num + 1, 1) as row_number,
   @type := type as dummy
from fruits;

+--------+------------+------------+--------+
| type   | variety    | row_number | dummy  |
+--------+------------+------------+--------+
| apple  | fuji       |          1 | apple  | 
| apple  | gala       |          2 | apple  | 
| apple  | limbertwig |          3 | apple  | 
| cherry | bing       |          1 | cherry | 
| cherry | chelan     |          2 | cherry | 
| orange | navel      |          1 | orange | 
| orange | valencia   |          2 | orange | 
| pear   | bartlett   |          1 | pear   | 
| pear   | bradford   |          2 | pear   | 
+--------+------------+------------+--------+

How does that work? I’m restarting the row number each time the type column changes, by keeping track of the value it had in the last row. And I’m simultaneously incrementing and selecting the row number in each row.

The spurious dummy column has to be there, but if your version of MySQL supports it, you can use a subquery in the FROM clause to eliminate columns you don’t want in the results.

Efficiency

All I’m doing is maintaining a bit of extra memory and performing a few small comparisons and assignments for each row, so this technique is very efficient.

Playing with fire

You can refer to the generated row_number column in a HAVING or GROUP BY clause, but don’t burn your fingers. This technique is very much like playing with fire. The result of assigning to a variable and using it in the same statement (in the HAVING, for example) depends on the query plan the server chooses, the phase of the moon, and probably other things too. Before you use this technique, you should read and understand the section on user-defined variables in the MySQL Manual, and decide whether it’s safe for your query.

Now that you’ve read that section of the manual, particularly the part about the aliased expression, you should understand why the following query might be a safer paradigm when using the result in the HAVING clause, even though it produces another dummy column:

set @type = '';
set @num  = 1;

select
   type,
   variety,
   @num := if(@type = type, @num + 1, 1) as dummy_1,
   @type := type as dummy_2,
   @num as row_number
from fruits
group by type, variety
having row_number = 1;

+--------+----------+---------+---------+------------+
| type   | variety  | dummy_1 | dummy_2 | row_number |
+--------+----------+---------+---------+------------+
| apple  | fuji     |       1 | apple   | 1          | 
| cherry | bing     |       1 | cherry  | 1          | 
| orange | navel    |       1 | orange  | 1          | 
| pear   | bartlett |       1 | pear    | 1          | 
+--------+----------+---------+---------+------------+

(If I’m wrong about that, somebody please correct me).

A safer technique is to use a subquery in the FROM clause. This will cause the results to be materialized in a temporary table behind the scenes. It might be less efficient for some uses, though:

select type, variety
from (
   select
      type,
      variety,
      @num := if(@type = type, @num + 1, 1) as row_number,
      @type := type as dummy
   from fruits
) as x
where row_number = 1;

+--------+----------+
| type   | variety  |
+--------+----------+
| apple  | fuji     | 
| cherry | bing     | 
| orange | navel    | 
| pear   | bartlett | 
+--------+----------+

Conclusion

This is an efficient, flexible way to generate and use row numbers in MySQL. I’ll leave it to you to find uses for it for right now, but I’m going to show you at least one application for this in an upcoming article.

Technorati Tags:, ,

You might also like:

  1. How to simulate the SQL ROW_NUMBER function
  2. How to select the first/least/max row per group in SQL
  3. Advanced MySQL user variable techniques
  4. How to select the first or last row per group in SQL
  5. How to simulate the GROUP_CONCAT function