Xaprb

Stay curious!

Duplicate index checker version 1.8 released

with 6 comments

What was a 5-minute script has become a bigger project. I’ve made more improvements to the duplicate index checker. Soon it will require product activation and have security vulnerabilities every week. Seriously: this is the third iteration, and three strikes and you automate, so I automated. I have a test suite now (your contributions welcome), and I addressed two shortcomings readers pointed out in comments on the original article. You now get better foreign key checking, and FULLTEXT indexes are ignored.

Several people pointed out that FULLTEXT indexes shouldn’t be considered duplicates of regular indexes. In fact, Peter even wrote that on his original article. I ignored it because I don’t use them on anything right now, so it’s not a priority for me. But it is for you, so I stopped ignoring them. It was a 7-character change to the script! I really should have done that sooner.

Roland pointed out that the order of columns in a foreign key doesn’t matter, so a foreign key from (a,b) to (a,b) is functionally the same as one from (b,a) to (b,a). This is where my simplistic string-comparison algorithm is too low-fidelity; it won’t catch duplicate foreign keys unless the columns are in the same order. Fortunately, it’s easy to solve and still use string matching: I can just sort the columns in the foreign key definition. Assuming your column names don’t have any commas in them, the following Perl code will find the columns inside parentheses, split them into an array, sort the array, and join them back together again:

$fk =~ s#(?<=\()([^\)]+)(?=\))#join(', ', sort(split(/, /, $1)))#ge;

It’s possible because of the magic of Perl substitutions. The final ge; says to do the search globally, and then execute the results (the part between the rightmost two # characters) as Perl code. Nasty, but it does the job quickly. Now

FOREIGN KEY (`seq`, `name`) REFERENCES `p` (`seq`, `name`)

canonicalizes to

FOREIGN KEY (`name`, `seq`) REFERENCES `p` (`name`, `seq`)

and I can continue to use string matching. You can probably tell it’s my favorite technique, and I’m stubbornly trying to use it as long as possible. To a man with a hammer, everything looks like a nail.

There’s a catch. If you have two foreign keys with “crossed columns” like this:

FOREIGN KEY (`name`, `seq`) REFERENCES `p` (`name`, `seq`)
FOREIGN KEY (`name`, `seq`) REFERENCES `p` (`seq`, `name`)

They aren’t duplicates, but they’ll canonicalize to the same thing and my tool will warn you. But if you’ve done that, you have probably made a very big mistake.

You can download it from the original article, as always.

Written by Xaprb

September 17th, 2006 at 6:43 pm

Posted in SQL, Tools

6 Responses to 'Duplicate index checker version 1.8 released'

Subscribe to comments with RSS or TrackBack to 'Duplicate index checker version 1.8 released'.

  1. Wow – Baron I already explained I know nothing about perl, but I sure did not expect it could be written as compact as this. Phew!

    Lemme see if I get this:

    $fk =~ s#(?<=\()([^\)]+)(?=\))#join(', ', sort(split(/, /, $1)))#ge;

    So,

    match a literal “(”:

    (
    ?<=
    \(
    )

    followed by one or more characters not like “)”:

    (
    [
    ^\)
    ]+
    )

    followed by a literal “)”:

    (?=\))

    So far I get it. Now it works “inward-out” starting a the split

    split(/, /, $1)

    which probably means: split the matched group $1 at the comma’s , but I expected $1 to correspond with with the match made by (?< =\(), but clearly this is wrong as it should split whatever was matched by ([^\)]+). Or start the indices at 0?

    I think I know what sort and join then do, but I don't understand the semantics of the # and the leading ~ s (ignore whitespace? guessing..)

    Anyway, thanks for this addition. I'm still not a perl user, but this kind of thing is seriously tempting me...:)

    Roland Bouman

    19 Sep 06 at 9:37 am

  2. Very close… I shouldn’t have just posted it without explaining it. I’ll try to explain it:

    $fk =~ s#(?<=\()([^\)]+)(?=\))#join(', ', sort(split(/, /, $1)))#ge;

    $fk =~ s means “match and replace on the $fk variable.” This alters the variable in place.

    The # is an alternate delimiter for the s operation, which is usually of the form s/pattern/replacement/modifiers. But you can choose your own delimiters if you’re going to have to backslash a lot of / characters within the pattern or something. I chose #.

    The whole line is then $fk =~ s#pattern#replacement#ge. The modifiers are g for ‘global’ and e for ‘execute’. The ‘execute’ option means to search for the pattern, interpolate $1 (the first capturing group) into the substitution pattern, then instead of just doing string replacement, execute the substitution pattern as code, and use the result as the replacement value.

    As for the pattern itself, there are three groups: (?<=\(), ([^\)]+), and (?=\)). The first is a non-capturing assertion that says “the pattern must be preceded by a left paren.” This is called a lookbehind assertion. This does not get included in the match, just asserts something must be true at the lefthand side of the match. The second is the actual capture, which finds every character that isn’t a right paren, and then there’s a lookahead assertion, which says “the pattern must be followed by a right paren.” So this pattern is grabbing whatever is inside the parentheses. That’s the column names.

    Finally, when the substitution code executes, it splits the column names on a comma followed by a space, which returns a list of the column names themselves; this gets sorted alphabetically, and then joined together with a comma and space again.

    Whew!

    Perl is a neat language. It can be the devil to maintain, but when it’s written with care, it’s as maintainable as any other language. That said, I should follow the suggestions in Perl Best Practices (the “Dog Book”) and make this pattern more readable with inline comments. There’s another modifier, x, which allows patterns to span multiple lines and have comments. That helps a lot.

    Xaprb

    19 Sep 06 at 10:09 am

  3. Thanks Baron, that is quite enlightning!

    I just assumed that the parenthesis would define a capturing group, but they don’t. If I understand it well, parenthesis are necessary, but not sufficient to form a capturing group.

    I have some experience with regular expressions but this perl dialect really is new to me (shame on Roland).

    If I understand well, that the non-capturing patterns are quite like the “anchors” $ and ^. So, they match a particular *position* in the text, *not* a piece of text as such, right?

    kind regards,

    Roland Bouman

    Roland Bouman

    19 Sep 06 at 11:19 am

  4. That’s right. There are several of these (?something) constructs. Whenever there’s a ? right inside the (, it’s magical :-) You can do “man perlre” to find out the full syntax.

    Xaprb

    19 Sep 06 at 12:53 pm

  5. [...] If you want to know how that works, read the comments on my earlier post about a duplicate index and foreign key checker for MySQL. [...]

  6. [...] If you want to know how that works, read the comments on my earlier post about a duplicate index and foreign key checker for MySQL. [...]

Leave a Reply