Archive for February, 2007

Introducing MySQL Table Checksum

Get MySQL Table Checksum

MySQL Table Checksum is a tool to efficiently verify the contents of any MySQL table in any storage engine. You can use it to compare tables across many servers at once. The output is friendly and easy to use, both by eyeball and in UNIX command-line scripts. The provided MySQL Checksum Filter helps you winnow output so you only see tables that have problems.

MySQL Table Checksum

Following up on my earlier article about how to calculate a table checksum in MySQL, I’ve integrated that methodology, with improvements suggested by the commenters and others, into a single easy-to-use tool. It is distributed as part of the MySQL Toolkit, available from SourceForge.net.

The tool takes server-side checksums using user-variables, so it is very efficient. It can checksum tables on many servers at once, running in parallel for speed. It has options to help you guarantee your tables are in the same state on your master and slave servers, and you can even checksum only some rows. These features can help you verify replication without locking tables or taking servers offline.

Here’s some sample output, in this case generated by comparing my server against itself:

DATABASE TABLE    HOST      ENGINE      COUNT                                 CHECKSUM TIME WAIT STAT  LAG
test     chapters localhost MyISAM         21                                218345624    0    0 NULL NULL
test     chapters 127.0.0.1 MyISAM         21                                218345624    0    0 NULL NULL
test     foo      localhost InnoDB          1 f14825835a0c07091c7b6a28c8a9f7120667815d    0    0 NULL NULL
test     foo      127.0.0.1 InnoDB          1 f14825835a0c07091c7b6a28c8a9f7120667815d    0    0 NULL NULL
test     samples  127.0.0.1 MyISAM          7                               2103838486    0    0 NULL NULL
test     samples  localhost MyISAM          7                               2103838486    0    0 NULL NULL

MySQL Checksum Filter

For efficiency reasons and to be as general-purpose as possible, the checksum tool itself doesn’t process its output, and in fact doesn’t even output in sorted order. However, the output is specifically designed to be easy to parse and manipulate with standard command-line tools like awk and sort.

It’s not in my nature to make you do that work yourself, so I included a tool that will do it for you. It sorts input and makes sure the checksums and row counts for a given table match on all servers. You can either pipe the checksums directly into it, or give it a list of files to process (handy when you need to run the checksum in different places, pipe their outputs to files, and then process the files).

If you use it to filter the output I showed above, you’ll see nothing by default, because the tables have identical checksums — thus there’s nothing to see.

About MySQL Toolkit

MySQL Toolkit is a new project I started on SourceForge to contain many of the MySQL utilities I’ve written and am writing (yes, there are more goodies in progress). Eventually these and other tools will all be bundled together so you can get them in one package.

About me

I like making you happy. Make me happy in return: donate.

Technorati Tags:No Tags

You might also like:

  1. Introducing MySQL Parallel Restore
  2. Introducing MySQL Duplicate Key Checker
  3. How to know if a MySQL slave is identical to its master
  4. Introducing MySQL Deadlock Logger
  5. Introducing MySQL Table Maintainer

How to subtract in SQL over samples that wrap

This article explains how to do subtraction in SQL over samples that wrap back to zero when they exceed a boundary.

A reader wrote in with a question about how to find how much traffic has passed through a network interface. The reader has a script that samples the interface’s statistics and stores them in a database. The statistics wrap back around when they exceed the maximum size of an integer, so it’s not a strictly increasing sequence. The question, paraphrased, is “how can I find out how much traffic has gone through the interface in any given time period?”

A key assumption is that the counter never wraps back to zero more than once between samples. If it does, all hope is lost.

Setup

To simplify the math, pretend the counter wraps at 1,000 and you have the following table:

create table samples(
   num int not null auto_increment primary key,
   bytes int not null
);

insert into samples(bytes) values
   (100), (900),
   (230), (700), (982),
   (163), (600);

select * from samples;
+-----+-------+
| num | bytes |
+-----+-------+
|   1 |   100 | 
|   2 |   900 | 
|   3 |   230 | 
|   4 |   700 | 
|   5 |   982 | 
|   6 |   163 | 
|   7 |   600 | 
+-----+-------+

How much traffic?

A manual calculation is easier than it looks, and solving this by hand is the key to solving it in SQL. You don’t have to do a bunch of nasty math, like subtracting 982 from 163 (that’s already too hard for me). You just have to notice where the counter wraps. You can find these places by seeing where the number decreases from one sample to the next. In the example, the counter wraps twice: from 900 to 230, and from 982 to 163. Here’s the data, graphed with wraps “unrolled.”

Graph of sample data

There are several ways to proceed from here. One way is to calculate the traffic as 1,000 times the number of wraps. Then you just do a little math to “clean up the edges:” subtract the first number in the sequence, and then add the last number. This gives (2 * 1000) - 100 + 600, which is 2500.

Another approach is to go row by row, summing the differences from the previous row and the last row. When the counter wraps, you add 1000 before taking the difference. This math gives the same answer. This is a lot harder to do by hand.

Either technique works given an arbitrary start and end point in the sequence. Now let’s see how to do these in SQL.

Problem: find the “previous” row

While these methods seem easy to humans, they resist many relational solutions because of the notion of “previous row.” SQL is set-oriented, and doesn’t do iterative row-by-row data manipulation. If you try to do this by grouping each strictly increasing set of data together and using aggregate functions like SUM, you’ll have trouble. You need the values from the “previous set” to do that, and that doesn’t work like you might want it to.

However, it’s not that hard to get the current and last row matched up side-by-side so you can operate upon them within the context of a single row:

select cur.num, cur.bytes, prev.bytes as prev_bytes
from samples as cur
   left outer join samples as prev on cur.num = prev.num + 1;
+-----+-------+------------+
| num | bytes | prev_bytes |
+-----+-------+------------+
|   1 |   100 |       NULL | 
|   2 |   900 |        100 | 
|   3 |   230 |        900 | 
|   4 |   700 |        230 | 
|   5 |   982 |        700 | 
|   6 |   163 |        982 | 
|   7 |   600 |        163 | 
+-----+-------+------------+

Once you think of “previous” in SQL terms, it becomes easy. Armed with this tool, we are ready to take on the queries.

Technique 1: count wraps and clean up the edges

Now that we’ve figured out how to find the “previous row,” how can we express the “count wraps and clean up edges” algorithm in SQL? Brace yourself:

select 1000 * sum(t1.wraps) - t2.start + o.bytes as total
from samples as o
   inner join (
      select cur.num, count(prev.num) as wraps
      from samples as cur
         left outer join samples as prev on cur.num = prev.num + 1
            and cur.bytes < prev.bytes
      group by cur.num
   ) as t1 on t1.num <= o.num
   cross join (
      select bytes as start from samples order by num limit 1
   ) as t2
where o.num = 7
group by o.num

Anything I say about that query will probably make it harder to understand, so I’ll just count on you reading it carefully. It may help to remove some of it so you can see the intermediate results:

select sum(t1.wraps) as wraps, t2.start, o.bytes
from samples as o
   inner join (
      select cur.num, count(prev.num) as wraps
      from samples as cur
         left outer join samples as prev on cur.num = prev.num + 1
            and cur.bytes < prev.bytes
      group by cur.num
   ) as t1 on t1.num <= o.num
   cross join (
      select bytes as start from samples order by num limit 1
   ) as t2
group by o.num;
+-------+-------+-------+
| wraps | start | bytes |
+-------+-------+-------+
|     0 |   100 |   100 | 
|     0 |   100 |   900 | 
|     1 |   100 |   230 | 
|     1 |   100 |   700 | 
|     1 |   100 |   982 | 
|     2 |   100 |   163 | 
|     2 |   100 |   600 | 
+-------+-------+-------+

You can also write it as a correlated subquery, instead of a subquery in the FROM clause:

select 1000 * (
      select count(*) from samples as cur
         inner join samples as prev on cur.num = prev.num + 1
            and cur.bytes < prev.bytes
      where cur.num <= samples.num
   )
   - (select bytes from samples order by num limit 1)
   + samples.bytes as total
from samples
where num = 7

Both queries need WHERE clauses in multiple places to make them behave if you want anything other than the full range of data summed up. For example, if you want to sum over rows 3 through 6, the first query becomes

select 1000 * sum(t1.wraps) - t2.start + o.bytes as total
from samples as o
   inner join (
      select cur.num, count(prev.num) as wraps
      from samples as cur
         left outer join samples as prev on cur.num = prev.num + 1
            and cur.bytes < prev.bytes
      where cur.num > 3
      group by cur.num
   ) as t1 on t1.num <= o.num
   cross join (
      select bytes as start from samples where num >= 3 order by num limit 1
   ) as t2
where o.num = 6
group by o.num

The problem with both queries is the <= predicate, which turns them into O(n2) algorithms. They’re essentially a cross-join. Plus, they’re hard to understand. It turns out that the simplest method by hand is complicated in SQL.

Method 2: Adjust when there’s a wrap

The second method I showed above is more complex for humans, but it’s actually simpler to do in SQL:

select sum(
   if (cur.bytes >= prev.bytes,
      cur.bytes - prev.bytes,
      cur.bytes - prev.bytes + 1000)) as total
from samples as cur
   inner join samples as prev on cur.num = prev.num + 1
-- optional WHERE clause for choosing start/end:
-- where cur.num > 3 and cur.num <= 6

A slightly more compact way to write this is

select sum(
   cur.bytes - prev.bytes + if(cur.bytes >= prev.bytes, 0, 1000)) as total
from samples as cur
   inner join samples as prev on cur.num = prev.num + 1
-- where cur.num > 3 and cur.num <= 6

This query is both simpler and more efficient than the first method I showed. If your platform doesn’t support IF(), use a CASE statement.

Method 3: do it with user-variables in MySQL

It’s possible to do even better than the simple join technique on MySQL. Using some MySQL-specific tricks, you can make this query a once-through, low-cost algorithm, much the way you might do it by hand or in a programming language that supports iteration. If you want to know how this works, and why the query has to be written in such a contorted way, read my article on advanced user-variable techniques in MySQL.

set @last_bytes := null;

select sum(greatest(
      if(bytes >= @last_bytes,
         bytes - @last_bytes,
         coalesce(bytes + 1000 - @last_bytes, 0)),
      least(0, @last_bytes := bytes)
   )) as bytes
from samples order by num;

This is a bit trickier to write than some of the other user-variable examples I’ve shown, because you can’t use @last_bytes is null in any IF() or CASE statement. If you do, the query optimizer will look at @last_bytes at compile time, see that the statement can be optimized out and replaced with a constant, and your query will not work as you expect it to.

Summary

In this article I’ve shown you three methods to do SQL math on a counter that wraps around to zero when it reaches a limit. I showed them to you in order of increasing efficiency, but the second method is both the simplest and the most platform-independent (and probably the most sane).

Can you think of any other ways to do this, or any other uses for these kinds of techniques? Write into the comments!

Technorati Tags:, ,

You might also like:

  1. How to find next and previous records in SQL
  2. Why I use explicit date math in SQL
  3. How to find contiguous ranges with SQL

Review of Pro Nagios 2.0 and Nagios System and Network Monitoring

Last week I read two books on Nagios. I found one easy to use and the other difficult.

The books

Cover of Nagios System and Network MonitoringNagios System and Network Monitoring (Wolfgang Barth, No Starch Press, 2006) is a delight. It explained Nagios briefly and clearly, showed me how to get it running, and continues to be a useful reference. I rate it 4 out of 5 stars.

Cover of Pro Nagios 2.0Pro Nagios 2.0 (James Turnbull, Apress, 2006) is hard for me to read, understand and use. Mr. Turnbull is clearly an expert on Nagios, but the book doesn’t communicate his knowledge effectively. I give it two stars.

Despite the differences, the books are identical in many ways. Both have chapters on the same topics. They cover exactly the same subject matter in the same level of detail, have the same target audiences, were written about the same time, are about the same size, and so on. The differences are in organization and writing style.

Mr. Barth’s book lets me do a breadth-first search of the subject matter, and Mr. Turnbull’s makes me do a depth-first search. Breadth-first is a better way to learn a topic like Nagios.

Pro Nagios 2.0 (James Turnbull)

I read James Turnbull’s Pro Nagios 2.0 first. I brought it home for the weekend, and read past page 200 — more than half the book — but I did not learn much about Nagios. Here’s why I found it hard to use:

  • It explains difficult concepts abstractly. Here are some of the first sentences from Chapter 2’s introductory paragraph:

    You do this by defining a series of objects that represent the characteristics of the environment being monitored. You begin by defining your assets to the Nagios server. Nagios calls them hosts. Then you define the attributes and functions of these assets…

    The book would have been clearer if it used the concrete Nagios metaphors (hosts, services, contacts), but it describes many things in object-oriented programming terms.

  • It’s hard to skim because it’s organized depth-first. It usually begins a topic with a few sentences, then interrupts with a tip, note or caution. Then the text resumes with specific details, and demonstrates them with code, usually a configuration sample or a shell script.

    There are few unbroken paragraphs of prose. This makes it hard to scan for headings, because they don’t stand out visually.

  • It puts unimportant material before important topics, but you don’t know that in advance. Here’s an example: if a service is okay, Nagios assumes the service’s host is too, so it’s generally unnecessary to monitor hosts. Turnbull writes about how to monitor the hosts in detail before saying it’s not important — for instance, beginning on page 39; then on page 42 he writes, “Remember: don’t schedule regular checks of your hosts!” Again in the beginning of Chapter 5, “Monitoring Hosts and Services,” the section on hosts is before the section on services.

  • Its confusing sentences and passive voice make it hard to understand. For instance,

    These directives can either be unique to the type of object being defined — for example, a host object has a directive that defines its address, or generic and applicable to a number of object definitions.

    This confuses me because I can’t immediately find the “or” that should correspond to the “either,” and when I do it seems to go with the “for example,” where it makes no sense.

  • It’s too verbose, often using a paragraph when a few words would be clearer. Here is an example from page 37:

    Table 2-2 contains four columns. The first two columns indicate the directive name and description. The third and fourth columns indicate if the directive’s value can be used as a macro and whether it is mandatory to the object definition. An object definition must have all the required mandatory directives included in the definition to be valid. An invalid object definition will result in Nagios failing to start. I’ll discuss macros in a bit and in the “Defining Commands” section later in this chapter.

    I can count the columns myself, so I don’t need the first sentence, and the columns have headings, so I don’t need the next two either. The fourth is redundant both to the column headings and to itself. The fifth is in the passive voice, and the sixth is a forward reference. The paragraph’s essence is “Nagios will not start if you omit a mandatory directive,” which states the obvious. I’d delete the whole paragraph.

  • It’s self-contradictory. For example, on page 44 the book shows how to enable “freshness” checks:

    In order to enable freshness checking, there are some other directives in the main nagios.cfg file that also need to be set. These control freshness checks for hosts. I’ve listed these directives in Example 2-8.

    Example 2-8. The nagios.cfg Freshness Items

    check_host_freshness=0
    host_freshness_check_interval=60

    This talks about enabling freshness checks and then shows the configuration setting that disables them. It makes me backtrack to be sure I read it right. This happens fairly often.

  • It interrupts with frequent references, which rarely have page or chapter numbers with them and usually introduce repeated material. Since they don’t have page numbers, you can’t find where they refer to; in one case I couldn’t find a reference even after looking for quite a while and consulting the index. Again, it’s hard to locate sections by scanning for headings, so the lack of page numbers makes the references nearly useless. Forward references are especially distracting.
  • It over-uses Note, Tip, and Caution sections. Chapter 2 has at least 47 of them in 52 pages, and there are seven in pages 220-223. They are highlighted with whitespace and two horizontal rules that span the page from edge to edge, so they’re very intrusive. Some examples: on page 59, while explaining equations that control a scheduling algorithm, a note says “ceil indicates that you need to round up the result to the nearest integer.” And on page 221, while discussing third-party tools, a note says “The NAN client and daemon are written in Perl,” which is irrelevant. Some of these would have been okay as footnotes, but many should have been omitted.

After a while I tried to skim through and read the book at a higher level, but I found it very difficult.

It’s not impossible to use, and I don’t mean to criticize too much. For example, once I was familiar with Nagios I found it useful for reference a few times. If you know what you want to find, it’s easy to navigate the book with the table of contents or the index (though the index doesn’t seem to be complete). I remembered seeing something about how to organize configuration files in various ways, and I referred back to that since Mr. Barth’s book didn’t mention it. This discussion, on pages 32-35, helped me think about how to arrange configuration files into subdirectories with the cfg_dir directive.

On the whole, I found the book a poor introduction to Nagios.

Nagios System and Network Monitoring (Wolfgang Barth)

Mr. Barth’s book succeeds through clarity, organization, detail, and ease of use. His first language is apparently German, so his English is not always elegant, but it is concise and communicates his knowledge of Nagios well.

The best thing about the book is the organization. The first part, “From Source Code to a Running Installation,” is three compact chapters on everything you need to get Nagios working. The first chapter is about compiling, installing, and basic web interface configuration. The second is an overview of general Nagios configuration — the stuff you need to actually monitor things. The third is about checking your configuration and starting Nagios.

By page 55 I understood more about Nagios than I was able to glean from Pro Nagios 2.0. When I finished the first three chapters, I had Nagios running and monitoring itself.

The second part is called “In More Detail,” and it was very helpful while I configured Nagios to monitor other servers. It was a useful reference when I looked up specific topics, and easy to skim for important instructions and pitfalls. The good introductory sentences make it easy to decide whether I need to read a section, and the uninterrupted text makes it easy to skip if I don’t. Large blocks of text make headings stand out, so there’s less need to use the table of contents. References have page numbers.

There are few footnotes, and no interruptions in the text. When there are configuration examples, the book shows the whole configuration definition for context, but makes the relevant parts bold, so they’re easy to see. There are lots of diagrams, which clearly show concepts that would require hundreds of words to explain. Captions are in the margins so they don’t interrupt the text.

I could go on, but suffice to say it’s a delight to read and use this book.

On the downside, I dislike the typeface the Barth book uses. Though I’m no expert on typefaces, I think it has a hard-on-the-eyes combination of shapes, heavy line-weights in odd places, and internal white-space. The typeface is sans-serif, so the letters are harder to recognize; for example, the “c” looks a lot like the “e.” It is difficult to speed-read. On the other hand, the text is fully justified and hyphenated (the publisher told me he believes it was typeset with LaTeX, which explains a lot — I love LaTeX). This makes it easier to read than Apress’s standard ragged-right typesetting.

The difference between the books

Let me show you the differences between the two books by comparing how they explain “flapping,” a Nagios feature to prevent a flood of notifications from services that are cycling between states.

The earliest index entry for flapping in the Turnbull book is on page 47. On pages 47 and 48, Turnbull explains the algorithm in detail:

So how does Nagios determine if a host or service is flapping? When you enable flap detection, Nagios keeps a record of the last 21 states of the host or service in an array. It then counts the number of times the states in the array have changed. This is calculated as a percentage. For example, with 21 states recorded we have a possible 20 state changes. For a normal-behaving host or service, it may be that the last recorded 21 states were OK. Hence, the percentage of states change is 0 percent. If, however, during the last 21 recorded states the host or service changed state 9 times, then the percentage of state change is 45 percent. But there is another added layer of complexity in this calculation. The states in this array are weighted — the newest state is considered 50 percent more important than the oldest state. This is because Nagios considers that the newer states are more indicative of the current behavior of the host or service than the older states.

Compare those 175 words to page 219 in the Barth book:

If a regular test shows that service or computer is changing its data continuously, this is called flapping in Nagios.

The Barth book says no more on the topic, but refers interested readers to the appendix on page 401, where it explains flapping with a nice diagram.

This illustrates the major differences between the books:

  1. The Barth book tells you where to learn more, if you want to know. The Turnbull book practically forces you to read through irrelevant details.
  2. The Barth book presents minor concepts like flapping later in the book (page 219 and 401). The Turnbull book puts the trivia in the beginning (page 47).
  3. The Barth book uses a dozen or so words, or a diagram. The Turnbull book uses hundreds of words instead.

Summary

I like Wolfgang Barth’s Nagios System and Network Monitoring better than James Turnbull’s Pro Nagios 2.0.

Subscribe to get my future articles conveniently delivered to you via email or feeds.

Technorati Tags:No Tags

You might also like:

  1. Progress on High Performance MySQL, Second Edition
  2. Progress on High Performance MySQL Backup and Recovery chapter
  3. Progress report on High Performance MySQL, Second Edition
  4. Organizing High Performance MySQL, 2nd Edition
  5. Get a free sample chapter of High Performance MySQL Second Edition

A review of Pro MySQL (Kruckenberg and Pipes, Apress 2005)

Pro MySQL, by Mike Kruckenberg and Jay Pipes, covers advanced MySQL topics, especially focusing on optimization and internals. I give it a solid 4 stars.

Overview

Pro MySQL book cover image Pro MySQL covers a lot of topics you need to know, and probably will not learn about anywhere else.

The book’s target audience is MySQL professionals. The authors assume significant depth of background in the subject area, and strive to put the finishing touches on your education — to take you from apprentice to journeyman, so to speak. I think they do a great job.

The writing is lucid and well-organized. It’s also well-edited; apparently this book was not rushed to press as so many technical books are. It’s exceptionally easy to read and understand. There are plenty of examples, but they do not interfere with the writing. Paragraphs are long enough that the pages are filled with large, solid chunks of text, without interruptions or distractions. Even footnotes are unusual. Unlike many other technical books, the page count is not inflated with endless code listings! Oh, and I don’t think I saw any typos — I can’t remember the last time I read a typo-free technical book.

Pro MySQL actually goes beyond “easy to read” and “well-organized.” In fact, it’s downright engaging. I found it so much fun to read that I put aside my wife’s Christmas present (two volumes on the history of World War II, which is super-interesting) and started carrying this book with me everywhere till I finished reading it. I read it over lunch at work, in the waiting room at the doctor’s office, and everywhere else I got a chance. It’s that interesting.

On the downside, there are a couple of places where the book restates the official manual, and a technical blip or three. However, these are few and relatively minor.

Organization

The book is organized in two parts. The first, “Design and Development,” comprises the first 13 chapters, which I would roughly group into advanced technical topics and new features in MySQL 5.0. The second part, “Administration,” covers the usual topics like installation, security and backup, but also has sections on replication and clusters.

I wanted to see how long and how many of each kind of chapter there are, so of course I threw it into a SQL table (spreadsheets are for wimps).

Note: the “section” is my own way of grouping the chapters, as I explained above. Here’s a graphical view of the book’s organization, using one of the techniques I learned from SQL Hacks:

select section,
   count(*) as num_chaps,
   avg(len) as avg_chap_len,
   repeat('#', sum(len)/15) as total_pages
from chapters
group by section
order by field(section, 'Background', 'MySQL 5', 'Administration');

You can see they devote significant space to explaining the background and theory of using MySQL (and indeed relational databases in general), but not so much to the new features in MySQL 5. This is to be expected, as MySQL 5 was brand-new at the time.

The following sections will explore these chapters in more detail.

Part 1

Part 1, “Design and Development,” has 13 chapters that pretty much fully cover the topics a well-rounded software engineer needs to master for professional MySQL usage. There are a few weak spots, but for the most part, this material is all necessary to the book — no long, boring chapters I wanted to skip to “get to the good stuff.” It’s almost all “good stuff.”

Chapter 1 might be considered skip-worthy, because it’s about team roles, why you might choose MySQL, and so forth. However, it still deserves reading because a) it’s short and b) you should be interested in what Mike and Jay have to say about it. They’re people you can learn from. They also give a very good overview of many RDBMSs, not just MySQL, and the strengths and weaknesses of each. If you’re trying to decide which products to consider for your business, you should read this section. It is very balanced, one of the best such I’ve seen.

Chapter 2 covers index concepts in significant technical detail. This is the missing link most engineers need: how indexes are built, how disks work, clustered indexes, etc. If you don’t know this material, you will make bad decisions about indexes.

Chapter 3 is about transactions, both in theory and how MySQL implements them. Topics include logging, recovery, and checkpointing. Many people consider this relevant only if you’re using the InnoDB storage engine, but in my opinion you’re pretty likely to need InnoDB for professional applications, so if you’re trying to learn what you’ll need to know, you should not only read this chapter, but come back to it after you finish the rest of the book.

Chapter 4 really digs in. It takes you on a guided tour of the MySQL system architecture, with frequent looks at tastefully abridged source code (and lots of references to where you should read more on your own). The chapter moves from an overview into resource management, storage engines and handlers, caching and memory, and so forth. It finishes up by tracing the execution path of a simple query, which is invaluable insight into how things work in the server; if you’ve ever tried to read the source code yourself, you’ll really appreciate this. MySQL is well-designed, but there’s a lot of code, and it helps a lot to have someone give such a clear overview.

Chapter 5 discusses storage engines in detail, especially MyISAM and InnoDB, which is appropriate given their popularity. For these two engines, the book explains everything down to the individual bits in the record format on disk. Other storage engines get about one page each. The chapter closes with a nine-page overview of what you need to think about when choosing a storage engine.

Chapter 6 covers benchmarking and profiling. The authors claim that these skills set professionals apart, and I agree. In fact, long before I read this book I wrote a MySQL query profiling tool, which as far as I know is still the only one of its kind (most developers seem to think it’s enough to run a query and see how long it takes).

Chapter 7, “Essential SQL,” is one of the most important chapters, but it isn’t as uniformly good as the rest of the book. It’s a bit too basic in places, and I think the authors should have assumed their readers were already trained in some of the topics they cover, such as the different types of joins. On the plus side, they open the chapter with recommendations for writing good SQL code, and I was happy to see that; this is something professionals care about, but few people advocate otherwise (the authors agree almost wholly with my own SQL coding standards, which made me smile).

This chapter also repeats some material from the official manual, and contains some technical bloopers, such as an erroneous explanation of dependent subquery optimizations on page 268, which might lead some unsuspecting programmers to write a subquery that will run longer than the half-life of hydrogen. Alas.

The more advanced parts of Chapter 7 also made me think “yes, but there’s more to it than that” a few times. On page 282, they show different ways to rewrite a “find the max row” type of query that aren’t really equivalent in all cases, but they don’t say so. Sometimes there are ways to write queries as joins when they claim it can only be done with subqueries. And on page 286, they actually give the correct answer to the situation they incorrectly analyzed on page 268. However, on the whole Chapter 7 is well worth spending a lot of time with.

Chapter 8, “SQL Scenarios,” is really good. It’s filled with the types of things Peter Zaitsev and I tend to write about — how to optimize so multiple indexes can be used, duplicate entries, and so forth. There’s a very clear explanation of the nested set model of storing hierarchical data, the best I’ve read in fact. This chapter has more code listings than others, but there’s still a high prose-to-code ratio, so your eyes won’t glaze over.

You should spend a lot of time reading chapters 7 and 8. These are probably the two chapters you’ll come back to most often, too. They’re the core of the book for most readers, in my opinion.

Chapters 9 through 13 cover features new in MySQL 5: stored procedures, functions, cursors, views and triggers. As I showed above, these chapters are quite a bit shorter on average than the others. The depth of coverage is correspondingly less. The tone tends to be a bit guarded, too; I think the authors wanted to explain the new features, but since they hadn’t been hammered on by the community very extensively yet, there just wasn’t enough practical experience to know and cover all the nuances.

Part 2

The second part of the book is about administration. While some of the material isn’t unique to the book, it is well-written and organized, making it useful anyway. Some parts cover topics I haven’t found well-presented elsewhere.

Chapter 14 is a fairly routine discussion of installation and the most basic configuration, which is at roughly the same level as Chapter 1 in the first part of the book. Chapter 15 covers user administration. The sections discuss privileges, how they work, and how to manage them from the command line and with the GUI MySQL Administrator tool (which unfortunately I have never been able to get to work for this purpose; it always hangs. Maybe I should file a bug on that). Chapter 16, “Security,” is also not terribly in-depth, but again it provides full coverage of topics a professional needs to know. Ditto Chapter 17, “Backup and Restoration.”

Chapter 18, “Replication,” gets into the more advanced topics again. There’s a lot of information about how replication really works; this can be gleaned from the manual, but it’s nicely presented here. Some of it I haven’t been able to find in the manual, for example, what each line in the master.info file means. Chapter 19, “Cluster,” is at a similar level of depth. If you want to know whether replication or clustering will solve some specific scenario, so you know whether to look into them in more detail, these chapters will probably answer that for you. They will probably not be sufficient information for you to successfully set up and manage these features, though. You will need to read the manual, and in the case of NDB Cluster, you should probably buy the book from MySQL (MySQL Clustering by Alex Davies and Harrison Fisk, MySQL Press, 2006). To be fair, at the time Pro MySQL was published, I think it was the only material on Cluster in print.

Chapter 20 is about troubleshooting, and explains where you should look and what to investigate when something goes wrong. This will help with basic trouble, but the real troubleshooting guide is the whole book — if you’ve read the rest of the book, you already know more than you’ll learn from this chapter. And finally, Chapter 21 introduces another feature new to MySQL 5.0, the INFORMATION_SCHEMA database, which is part of the SQL:2003 standard.

On the whole, Part 2 isn’t the strongest part of the book, but it’s definitely worth reading.

Summary

Pro MySQL belongs on your bookshelf if your company uses MySQL. It belongs in your hands if you are responsible for MySQL at your company.

This isn’t a typical 10-pound doorweight book full of “code listings” and screenshots. It does have a high page count, but for the most part it’s all packed full of information, most of which you either won’t find elsewhere or won’t find in any one place.

The best part of the book is the excellent writing and organization. It makes it clear and engaging. It’s an easy read, even when the material is highly technical and difficult to understand. Kudos to Mike and Jay for bucking the trend and writing a really good technical book!

Technorati Tags:No Tags

You might also like:

  1. Progress report on High Performance MySQL, Second Edition
  2. Progress on High Performance MySQL, Second Edition
  3. Organizing High Performance MySQL, 2nd Edition
  4. More progress on High Performance MySQL, Second Edition
  5. High Performance MySQL, Second Edition: Schema Optimization and Indexing

How to delete duplicate rows with SQL, Part 2

By reader request, this article explains ways to remove duplicate rows when they are completely identical, and you don’t have a primary key or other criterion to identify which rows to “save.”

This is a special case of deleting duplicates. I’ve written another article about the more general case, so I assume you have the background it gives. If not, you should probably go read my article about how to delete duplicate rows in SQL.

Introduction

In general, this is a hard problem. Suppose you have the following data, and you want to delete everything but the first row of its type (you don’t care which, because all duplicate rows are completely identical).

When you’re done, you want just two rows in the table:

Why this is hard

This is hard because there is no way to do this in standard SQL (correct me if I’m wrong). SQL is based on relational algebra, and duplicates cannot occur in relational algebra, because duplicates are not allowed in a set. That’s why SQL doesn’t give you tools to solve this problem.

No database product is truly relational, so in real life it’s possible for duplicates to occur. When it happens, you will have to resort to platform-specific methods to solve it. There should always be a way to do it, because there is always a difference between apparently identical rows. It might be an internal row ID, for example (as in Oracle). If nothing else, the rows have different memory and disk locations in the computer.

The easy way

The easiest thing to do is add a column with a unique number. This is called something different on every platform: it’s an IDENTITY column in SQL Server, an AUTO_INCREMENT column in MySQL, a SERIAL in PostgreSQL, and so on. Look at your platform’s documentation for instructions how to do it.

Once you’ve done that, you’re on easy street. Now go read my previous article to do the actual deleting.

If that won’t do…

Build a new table with distinct values from the old table, then drop and rename:

CREATE TABLE new_fruits ...;

INSERT INTO new_fruits(fruit)
   SELECT DISTINCT fruit FROM fruits;

DROP TABLE fruits;

RENAME TABLE new_fruits fruits;

If you can’t do that…

Perhaps you simply can’t do either of the above. Maybe your table is too large, for example. In that case you’re going to have to use some sort of iterative technique to do it; loop through the rows one at a time and delete every row you see more than once. This is also going to be a platform-specific solution; you may need to use a WHILE loop or server-side cursor. Consult your platform’s documentation for more; I can’t possibly cover all the bases here.

Two examples for MySQL

Here’s a quick technique that uses advanced user-variable techniques on MySQL to delete the rows. MySQL’s server-side cursors are read-only, so some other technique has to be used. User-variables can do the trick, if you write the statement just right — it’s very touchy.

set @num := 0, @type := '';

delete from fruits
where greatest(0,
   @num := if(type = @type, @num + 1, 0),
   least(0, length(@type := type))) > 1
order by type;

If you don’t understand that, go read the article :-) This can be very efficient because it doesn’t require any GROUP BY clause. If your rows are “naturally ordered” with all the duplicates adjacent to each other, you can even omit the ORDER BY clause (if your rows aren’t “sorted naturally,” you will miss some duplicate rows).

The other obvious option is to repeatedly identify a duplicated row, find how many times it’s duplicated, and delete one less than that many rows. You will need to either do this in a stored routine, or get help from some programming language. For example, in pseudo-code:

set @num := 0;

select @type := type, @num := count(*)
   from fruits
   group by type
   having count(*) > 1
   limit 1;

while @num > 0

   delete from fruits where type = 'type'
      limit @num - 1;

   set @num := 0;
   select @type := type, @num := count(*)
      from fruits
      group by type
      having count(*) > 1
      limit 1;

end while

That is pseudo-code, by the way; if you’re doing this in a stored procedure, you’re going to have to concatenate strings together to make an executable statement and execute it. If you’re using an external programming language, you’ll need to fetch the values that are duplicated and dynamically build a statement that deletes all but one row.

Summary

In this article I explained how to solve the special-case problem of removing duplicate rows with no distinguishing columns at all. It’s a harder case of the general problem, and SQL has no built-in way to solve it, so you have to learn your platform’s tricks to solve it. I showed you how to add a unique column so you can use the “easy” techniques I explained in an earlier article. You might also be able to put the rows into another table and drop the original table. Failing that, you have to use something like cursors. As a bonus, I explained two ways to do this in MySQL, one of them sneaky and the other not.

Technorati Tags:, ,

You might also like:

  1. How to implement a queue in SQL
  2. How to write multi-table, cross-database deletes with aliases in MySQL
  3. How to avoid many-to-one problems in SQL
  4. How to simulate FULL OUTER JOIN in MySQL
  5. What is a SQL blind insert?

innotop 1.3.5 released

innotop 1.3.5 is the latest release of the increasingly popular MySQL and InnoDB monitor. I recommend everyone upgrade to this release. Aside from incomplete documentation, it’s close to a stable 1.4 release (I’m counting on you to find the bugs!). There are many significant new features since version 1.3, which make it more powerful and easier to use. Here’s what’s new:

  • Support for colorizing rows. Default color rules are included for Q, T and M modes; it’s easy to write your own. Color rules are really simple. You don’t have to hack the source to customize colors!
  • Support for multi-column sorting so you can make sense of the output from many servers at once. The default sort rules will group things together sensibly, but of course you can customize them easily.
  • A more powerful, simpler syntax for doing awk-like calculations on column values. No need to know Perl. More about this in a future blog post.
  • Improvements to constrain column widths so the display is more stable and easier to watch. Headers are compacted by default in some modes so you can fit more on the screen.
  • A smoother config file upgrade from 1.x.
  • D mode (InnoDB deadlocks) is cleaned up and can handle multiple servers at once.
  • Lots of bug fixes.
  • Much better configuration-editing tools.

Many of the improvements are in default configuration settings, and the upgrade will not upgrade your configuration settings. So if you want (for example) the new default multi-column sorts, you will have to tweak your config file, or just delete those lines from the file and let innotop fill in the new defaults for you.

Acknowledgements

Many thanks to Sebastien Estienne, Christian Hammers, and many others for suggesting improvements, finding bugs, helping with packaging, helping me learn how to use version control software better, and especially for all the compliments and positive feedback!

The future

Lots of features are planned. You can always read the ROADMAP file to find out what’s coming next. And if you want something sooner, please ask. A user told me he likes mytop’s color display for easy readability, so I added color rules sooner than I’d planned. Let me know what’s important to you.

Technorati Tags:No Tags

You might also like:

  1. innotop 1.5.0 released
  2. Version 0.1.106 of innotop MySQL/InnoDB monitor released
  3. The innotop MySQL and InnoDB monitor
  4. What to do when innotop crashes
  5. innotop 1.4.2 released