Duplicate index checker version 1.8 released

What was a 5-minute script has become a bigger project. I’ve made more improvements to the duplicate index checker. Soon it will require product activation and have security vulnerabilities every week. Seriously: this is the third iteration, and three strikes and you automate, so I automated. I have a test suite now (your contributions welcome), and I addressed two shortcomings readers pointed out in comments on the original article. You now get better foreign key checking, and FULLTEXT indexes are ignored.

Several people pointed out that FULLTEXT indexes shouldn’t be considered duplicates of regular indexes. In fact, Peter even wrote that on his original article. I ignored it because I don’t use them on anything right now, so it’s not a priority for me. But it is for you, so I stopped ignoring them. It was a 7-character change to the script! I really should have done that sooner.

Roland pointed out that the order of columns in a foreign key doesn’t matter, so a foreign key from (a,b) to (a,b) is functionally the same as one from (b,a) to (b,a). This is where my simplistic string-comparison algorithm is too low-fidelity; it won’t catch duplicate foreign keys unless the columns are in the same order. Fortunately, it’s easy to solve and still use string matching: I can just sort the columns in the foreign key definition. Assuming your column names don’t have any commas in them, the following Perl code will find the columns inside parentheses, split them into an array, sort the array, and join them back together again:

$fk =~ s#(?<=\()([^\)]+)(?=\))#join(', ', sort(split(/, /, $1)))#ge;

It’s possible because of the magic of Perl substitutions. The final ge; says to do the search globally, and then execute the results (the part between the rightmost two # characters) as Perl code. Nasty, but it does the job quickly. Now

FOREIGN KEY (`seq`, `name`) REFERENCES `p` (`seq`, `name`)

canonicalizes to

FOREIGN KEY (`name`, `seq`) REFERENCES `p` (`name`, `seq`)

and I can continue to use string matching. You can probably tell it’s my favorite technique, and I’m stubbornly trying to use it as long as possible. To a man with a hammer, everything looks like a nail.

There’s a catch. If you have two foreign keys with “crossed columns” like this:

FOREIGN KEY (`name`, `seq`) REFERENCES `p` (`name`, `seq`)
FOREIGN KEY (`name`, `seq`) REFERENCES `p` (`seq`, `name`)

They aren’t duplicates, but they’ll canonicalize to the same thing and my tool will warn you. But if you’ve done that, you have probably made a very big mistake.

You can download it from the original article, as always.

Technorati Tags:No Tags

You might also like:

  1. Introducing MySQL Duplicate Key Checker
  2. Duplicate index checker improved
  3. Duplicate index checker version 1.9 released
  4. How to find duplicate and redundant indexes in MySQL
  5. Why does InnoDB create two indexes per foreign key?

6 Responses to “Duplicate index checker version 1.8 released”


  1. 1 Roland Bouman

    Wow - Baron I already explained I know nothing about perl, but I sure did not expect it could be written as compact as this. Phew!

    Lemme see if I get this:

    $fk =~ s#(?<=()([^)]+)(?=))#join(', ', sort(split(/, /, $1)))#ge;

    So,

    match a literal “(”:

    (
    ?<=
    (
    )

    followed by one or more characters not like “)”:

    (
    [
    ^)
    ]+
    )

    followed by a literal “)”:

    (?=))

    So far I get it. Now it works “inward-out” starting a the split

    split(/, /, $1)

    which probably means: split the matched group $1 at the comma’s , but I expected $1 to correspond with with the match made by (?< =\(), but clearly this is wrong as it should split whatever was matched by ([^\)]+). Or start the indices at 0?

    I think I know what sort and join then do, but I don’t understand the semantics of the # and the leading ~ s (ignore whitespace? guessing..)

    Anyway, thanks for this addition. I’m still not a perl user, but this kind of thing is seriously tempting me…:)

  2. 2 Xaprb

    Very close… I shouldn’t have just posted it without explaining it. I’ll try to explain it:

    $fk =~ s#(?<=()([^)]+)(?=))#join(’, ‘, sort(split(/, /, $1)))#ge;

    $fk =~ s means “match and replace on the $fk variable.” This alters the variable in place.

    The # is an alternate delimiter for the s operation, which is usually of the form s/pattern/replacement/modifiers. But you can choose your own delimiters if you’re going to have to backslash a lot of / characters within the pattern or something. I chose #.

    The whole line is then $fk =~ s#pattern#replacement#ge. The modifiers are g for ‘global’ and e for ‘execute’. The ‘execute’ option means to search for the pattern, interpolate $1 (the first capturing group) into the substitution pattern, then instead of just doing string replacement, execute the substitution pattern as code, and use the result as the replacement value.

    As for the pattern itself, there are three groups: (?<=\(), ([^\)]+), and (?=\)). The first is a non-capturing assertion that says “the pattern must be preceded by a left paren.” This is called a lookbehind assertion. This does not get included in the match, just asserts something must be true at the lefthand side of the match. The second is the actual capture, which finds every character that isn’t a right paren, and then there’s a lookahead assertion, which says “the pattern must be followed by a right paren.” So this pattern is grabbing whatever is inside the parentheses. That’s the column names.

    Finally, when the substitution code executes, it splits the column names on a comma followed by a space, which returns a list of the column names themselves; this gets sorted alphabetically, and then joined together with a comma and space again.

    Whew!

    Perl is a neat language. It can be the devil to maintain, but when it’s written with care, it’s as maintainable as any other language. That said, I should follow the suggestions in Perl Best Practices (the “Dog Book”) and make this pattern more readable with inline comments. There’s another modifier, x, which allows patterns to span multiple lines and have comments. That helps a lot.

  3. 3 Roland Bouman

    Thanks Baron, that is quite enlightning!

    I just assumed that the parenthesis would define a capturing group, but they don’t. If I understand it well, parenthesis are necessary, but not sufficient to form a capturing group.

    I have some experience with regular expressions but this perl dialect really is new to me (shame on Roland).

    If I understand well, that the non-capturing patterns are quite like the “anchors” $ and ^. So, they match a particular *position* in the text, *not* a piece of text as such, right?

    kind regards,

    Roland Bouman

  4. 4 Xaprb

    That’s right. There are several of these (?something) constructs. Whenever there’s a ? right inside the (, it’s magical :-) You can do “man perlre” to find out the full syntax.

  1. 1 Five great Perl programming techniques to make your life fun again at Xaprb
  2. 2 Five great Perl programming techniques to make your life fun again « Jfree’s dreaming

Leave a Reply

Please do not use this blog to get help with problems or bugs in Maatkit or innotop: use the appropriate forums, mailing list, or bug trackers. If you're asking for help with MySQL, please use the MySQL mailing list instead.