Tag Archive for 'perl'

What it’s like to write a technical book, continued

My post on what it’s like to write a technical book was a stream-of-consciousness look at the process of writing High Performance MySQL, Second Edition. I got a lot of responses from it and learned some neat things I wouldn’t have learned if I hadn’t written the post. I also got a lot of questions, and my editor wrote a response too. I want to follow up on these things.

Was I fair, balanced and honest?

I really intended to write the post as just “here’s what it’s like, just so you’re prepared.” But at some point I got really deep into it and lost my context. That’s when I started to write about the things that didn’t go so smoothly with the publisher, and some of these things had a little extra sting in them that I would have done well to edit out.

All of us are human and the process wasn’t that bad, all things considered — the book was just a massive project that put huge demands on all of us and stressed everything from the capabilities of our chosen tools to our patience. As the editor points out in his response to my blog post, this is precisely why nobody else has ever been able to pull this off. This book stands head and shoulders above the crowd. It’s just hard to write, and very few people in the world actually have the knowledge to do it, much less the time, inclination, and ability.

Everything I said was (I believe) factual and correct, although as the editor points out there are different stories behind them. I also want to mention that I’d shared all those concerns with my editor; I avoid criticizing people behind their backs. In hindsight, throwing all of my concerns onto a blog post without warning isn’t the kind of thing I like to do either.

So I believe I was honest, but unfair to the editor. I’ve apologized to him. And by the way, yes I would work with him again, and I fully expect that it would be easier because I have learned more about the process.

I ran this post by my editor before publishing it.

A deeper explanation of my heuristics

Several people asked me to say more about my heuristics for improving the quality of the writing. I’ve already explained many of them, but here’s more:

(were|was|is|are|has been|be)( [a-zA-Z]+)? [a-zA-Z]+ed\>
This regular expression can help find some occurrences of passive voice. It finds a word or phrase that’s some variation on the verb “to be,” usually in the past tense; followed by an optional word (probably an adjective); followed by another word that ends in “-ed,” which is also potentially a verb in the past tense. This is not the only way to write in the passive voice, but it’s kind of the classic. Here are some examples: “the blog post was posted,” “the benchmark was rapidly created.”
(were|was|is|are|ha[sd] been|be)( [a-zA-Z]+)? [a-zA-Z]+e[dn]\>
An enhanced version. As I looked at the preceding point, I saw some other simple examples it doesn’t catch. For example, it doesn’t catch “had been” and it doesn’t catch verbs like “written.” Ironically, the first thing that came to mind as I thought about examples was “the book had been written.”
while|since
There’s nothing wrong with these words, except when they’re used in lieu of “because” to indicate causality. This is a problem for non-native English speakers, because these words have a temporal meaning too. For example, “Since MySQL 4.1 has no stored procedures, you have to use MySQL 5.0 if you want stored procedures.” If you aren’t a native English speaker, and even if you are, it’s easy to read that as “MySQL has had no stored procedures since version 4.1, …” and then when your eyes reach the part about MySQL 5.0, it makes no sense. My rule for this is to say “because” when I mean “because.”
using
Real examples: “Using MyISAM tables works very well” can become “MyISAM tables work very well.” And “A final possibility is simply to switch to using a table” can become “Finally, you can use a table” instead.
in order
The phrase “in order to” can almost always be replaced by “to.” It also tends to show a rough transition between the first and second phrases in a sentence. Perhaps these phrases should be integrated into a single phrase. “You can use this regex in order to find poorly constructed sentences” can become “this regex can find poorly constructed sentences” or “You can find poorly constructed sentences with this regex.” I prefer the latter; it is very direct, and that straightforward, simple writing style is really important in complex subject matter.
of course|without saying|obviously|clearly|needless
It goes without saying, but of course these words obviously point out when I’m writing stupid things that I clearly need to take a closer look at. Needless to say, most of the phrases in this paragraph are indeed needless to say. They are a red flag for lazy writing, such as glossing over a difficult point that should instead be explained — hard work, but necessary.
whether
I found quite a few places where the phrase “whether or not” was used. This can be shortened: “to see whether or not the disk is the problem” can become “to see whether the disk is the problem.” But better yet, the phrase often glues together poorly written phrases into an awkward sentence, just as “in order to” does. Can “whether” be replaced by “if?” Or does the sentence or paragraph just need to be reworked completely?
allow
This word can usually be replaced by “let.” “The remaining settings allow MySQL to allocate more RAM” can become “The remaining settings let MySQL allocate more RAM.” Occasionally, it is part of a larger phrase or thought needs to be shortened and clarified. “When nobody is writing, readers obtain read locks that allow other readers to do the same” became “When nobody is writing, readers can obtain read locks, which don’t conflict with other read locks.”

ensure
I found that this word is often subtly misused. It really means “guarantee” but is often used as “double-check” or “make sure.” I don’t want to be too dogmatic about this word: its usage in modern English is complex (see the usage note on assure here; that in itself might be a reason to avoid it). But I found many places where I wanted to remove it in favor of an explicit instruction that tells the reader to take action. “Ensure” as an instruction is kind of a politically correct way to tell someone to do something, and I’m not afraid to just tell you to do it if I think you need to. I don’t want you to miss my meaning.
only
I have a habit of using this word incorrectly. “I only have ten fingers” should be “I have only ten fingers.”
as (we|you)|again,
These phrases usually show a place where the writing is confused and redundant. They show up in places like “as we already said, you should tune your server” and “again, you should tune your server.” Any instruction to the reader to break the narrative flow is a place to examine whether the concepts are in the right order. Cross-references, footnotes, and reminders are not always evil, but they’re to be regarded with suspicion.

Readability metrics

The tools I used to find sentences and phrases that score badly on some readability metric were pretty helpful to me as I tightened the writing up more and more. Nobody has reviewed the book yet, but I think when they do, they’ll be unlikely to mention “oh, and by the way the writing is wonderfully compact!” If we pulled this off right, you won’t notice that the writing is clear and compact. Writing is like a stereo system: you’re supposed to hear the music, not the speakers.

Anyway, my point is that we expanded the first edition’s actual coverage many times over, and ended up with only 658 pages of actual material. So the writing is much more compressed, and to do that you have to find and eliminate confusing writing. Confusing writing usually means that the concepts don’t flow clearly, and it takes more words to say the same thing because you’re kind of bumbling about, gesturing at your meaning from several angles instead of saying it clearly just once.

Here’s how I analyzed each chapter:

  • I used OpenOffice’s export feature to export the file to MediaWiki format. This is a plain-text markup format. I forget now why I didn’t just export to text, but there was something about MediaWiki format that made it easier to munge with Perl.
  • I ran my clean_text.pl program against the exported file to convert the format to a simpler one without special characters and markup. Some of the markup (footnotes, for example) stayed in the text and confused the metrics, but that’s life.
  • I ran my analyze_text.pl program against this to find the “worst” places.

As I wrote in my previous post, the analyzer uses a combination of readability metrics and “other stuff” to measure the badness of each sentence and paragraph. It aggregates sentences and paragraphs by the metrics. I calculated the number of words, percent of complex words, syllables per word, number of sentences, words per sentence, and a bunch of other things, as well as the standard readability metrics. Each sentence and paragraph got scored on these. Then I printed overall metrics, and sorted the sentences and paragraphs worst-first and printed out a snippet of the offending text. Here’s a sample of chapter 3’s metrics (originally numbered chapter 4) at some intermediate stage in the writing process.

This was a lot of work. If I had been writing with Vim, I could have done better. I could have used the compiler integration and set my “make” program to the analysis program. If you use Vim and you don’t know about this, it’s a pity. My next book will be written in Vim, by the way.

Actually, I probably could have done better regardless, but this was good enough. I just searched for the snippets and then examined what was going on.

There were some false positives. For example, bullet-points often scored badly on the readability metrics, and so a five-word bullet point item would look like terrible writing just because it was short enough that it had a high percentage of complex words. It’s not an exact science. Maybe next time will be better.

If you’d like to see the source code, here’s the clean_text.pl and here’s the analyze_text.pl. Enjoy!

Technorati Tags:,

You might also like:

  1. Blogs as glamour magazines
  2. What is it like to write a technical book?
  3. My personal SQL coding standards
  4. Less is more

You have the right to see code samples in an interview

Joel Spolsky writes about 12 steps to better code, and elsewhere about how candidates should write code in interviews.

The reverse conditions are true, too. If you’re a candidate, you should evaluate the employer against the 12 steps, and you should also see code samples. How else will you know what you’re getting into? You really have the right to do this, and you should exercise the right. If you don’t, you’ll get stuck in a crap job maintaining crap code. [dramatic voice] It happened to me.

In many companies, you can see code they’ve released as open-source. (The fact that they’ve done this says a lot about them.) But in others, you’re going to need to surprise someone and say “pick some code that’s not sensitive and show it to me.” Something simple, like the HTML for the search form on their website, or a utility to do some systems administration task. Any company is going to have a lot of code like this that they can show you.

The other approaches I see are to ask about it, assume, or ask the interviewer to write some code for you.

  1. Asking is a valid approach. If you see hesitation, or if someone says “well, it’s not as nice as we’d like, and we’re hoping you will offset that” run don’t walk, is my advice. If you’re reading this as you consider your first job out of college or something, I strongly suggest not getting a job with a company that wants you to improve the way they do things. You should be learning from them, not vice versa.
  2. You can also assume. “Oh, they use Perl? Nevermind.” That’s a stupid approach. Really. Is it acceptable to judge people’s character by the color of their skin? Then why would you judge their code by the language? In all seriousness, I have actually written very elegant, clean VBScript. And I mean, good-quality code by anyone’s standards. It’s hard in VBScript. It’s easy in Perl if you follow the Dog, which is a sign of great intelligence. Think about it this way: people who write beautiful Perl are people you should be eager to work with; they are rocket scientists. You will be the dumbest person in the room, and that should make you happy.
  3. I’ve never asked an interviewer to write code for me. Let me know how it works out for you.
Technorati Tags:, , ,

No related posts.

Progress on Maatkit bounty, part 4

… I didn’t get two-way sync done, and I didn’t get the Nibble algorithm done. That much I expected. But I also didn’t get the current work released tonight because I’m paranoid about breaking things. I’m trying to go through all the tools and write at least a basic test for them to be sure they can do the simplest “unit of work” (such as mk-find running and printing out that it finds the mysql.columns_priv table).

It’s good that I’m doing this. I found that mk-heartbeat suddenly doesn’t work on my Ubuntu 7.10 laptop. It goes into infinite sleep. Can anyone repro this and/or diagnose? The same code works fine on Gentoo servers at work, and I have heard no complaints.

Update the problem is the combination of sleep() and alarm(), which I inherited in the code from the contributors. I even had a comment in the code about it not being safe in general, but I assumed it would work OK since there was no argument to sleep() (infinite sleep). But it doesn’t; the results are undefined and system-dependent. I re-implemented this code without using alarm() and will release it soon.

Hopefully I’ll be able to release something very soon. Release early/often is fine, but “knowingly release brokenness” isn’t in my code of conduct :)

Technorati Tags:, ,

You might also like:

  1. Get Maatkit fast from the command line
  2. Progress on Maatkit bounty
  3. New Maatkit release policy
  4. Progress on Maatkit bounty, part 2

Things I love about Perl

I don’t love everything about Perl, but I love its sense of humor, which I think probably comes from its creators’ senses of humor. From the Perl function documentation for redo:

“last”, “next”, or “redo” may appear within a “continue” block. “last” and “redo” will behave as if they had been executed within the main block. So will “next”, but since it will execute a “continue” block, it may be more entertaining.

“Entertaining,” in this context, means “if we were omniscient and looking over your shoulder while you spend a day debugging your occasional infinite loop or other bizarre behavior, we would be wildly entertained.”

At least that’s how I read it.

Sometimes the sense of humor, especially when imitated by neophytes trying to pretend to be part of The Gang Of Perl Greats, degrades into obnoxious sarcasm that obscures rather than documents. But this is fairly rare in the core documentation or other writings from the language’s authors.

If you’ve never read Programming Perl, you’re missing out on a lot of extremely subtle, very sharp and intelligent wit. I don’t have my copy of the book at hand, but one joke that comes to mind is how to write the Lord of the Rings trilogy with a regular expression substitution:

($lotr = $hobbit) =~ s/bilbo/frodo/g;

Or something like that. There are many fun examples that manage to teach the matter at hand more clearly, and keep me engaged more, than even the clearest straightforward explanation could.

Often imitated, but never equaled in any other book I’ve read. For example, I tried to read Extreme Programming Refactored (I really really tried, honest!) but could not make it through. I found myself getting irritated and wanting them to get to the point.

When Larry Wall et al make a joke about Gandalf, it is the point.

Technorati Tags:, , , , ,

No related posts.

Growth limits of open-source vis-a-vis MySQL Toolkit

Si Chen wrote recently about the growth limits of open-source projects. He points out that as a project becomes larger, it gets harder to maintain. I can only agree. As the MySQL Toolkit project has grown, it’s become significantly more work to maintain, document, and enhance. (This is why I’m asking you to sponsor me for a week off my regular job to work on MySQL Table Sync, by the way. Please toss some money in the hat.)

Rewriting code so it’s testable is a major focus for me now. Some of these tools have gotten complicated enough that I can’t keep track of all the code. In other words, they’re collapsing under their own weight.

Back in the project’s humble beginnings, it seemed adequate to just copy and paste a few lines here and there; after all, these are just scripts, right? Right. So I’ll just copy a few lines of code that do command-line option parsing and help screens. Hey, it turns out that several of the tools can connect to more than one server, so simple -u, -h and -p options won’t do; so I invent a DSN-like notation that lets the tools connect to an arbitrary number of servers. Copy and paste that code, too. It’s only ten lines — no big deal. Pretty soon I find out that many of the standard Perl modules aren’t available, for a lot of people. And even when they’re available, people have old versions and can’t upgrade, so I can’t rely on basic things like the quote_identifier() function in DBI modules; time to write my own. Well, that’s only a single line! Surely that’s okay to copy and paste.

As Kurt Vonnegut says, “So it goes.” This is the death not only of quality, but of maintainability and extensibility. The Right Answer ™ is to write everything as modules, with proper test suites, and then make the scripts as minimalistic as possible — essentially gluing the modules together with a few lines of harder-to-test code. That’s how I’m used to working, too, but for some reason I can’t explain, it seemed okay not to work that way with this project. That has turned out to be a big mistake, which I’m slowly correcting out of necessity.

But it turns out it’s not that simple, either. I’ve gotten a lot of emails, phone calls from friends, and bug reports about how hard it is to install or update Perl, or get a CPAN module, on many systems. It turns out that a lot of companies are rightfully suspicious about CPAN (I have a tolerate-hate relationship with it myself), and won’t let my consultant friends install or upgrade any module without a lot of red tape. OK, you say, so bundle and distribute the modules the toolkit needs, and they can be installed locally with the toolkit. That sounds nice, but it’s even worse for a variety of reasons. Just to mention one: did you know that it can be a pain in the butt even to set @INC so a module sitting in the same directory with the script will be found by the script? (Please don’t tell me how easy it is, or I’ll let you respond to the next person trying to get it to work on an obscure platform with a Perl installation from the middle ages). Okay, I’ll mention two reasons: some Perl modules have to be compiled and customized just for the operating system you’re installing them on, or they’ll segfault (of all things)! Don’t get me wrong, I don’t think the grass is greener on the other side; no way do I want to try writing these things in C or Java. Perl is about as portable as it gets.

The net result is that I have to do a lot of little tricks to make these things standalone programs, as much as humanly possible. I’m trying to reduce dependencies on external modules, even those that are part of core Perl. I’m re-inventing functionality because it’s not available in all versions. I’m writing modules that can be tested, but I’m not shipping them as separate modules; I’m basically using sed to copy-and-paste the module’s code into the scripts.

Why am I doing all this work?

Because it’s less work than not doing it.

But it is significantly more work than just whacking together some “scripts” and uploading them to sourceforge. That’s why there is a critical mass beyond which it gets harder to grow a project. The solution to this is to find a way to do things differently, work smarter, not harder. The challenge is to switch the fight against the demons of bad code and maintainability so it’s on my terms. In other words, don’t fight against these characteristics of growth; make them work for me. I won’t say I’ve learned that lesson completely, but I’m starting. That’s why I’m automating basically everything about this project (though for some reason I can’t get WWW::Mechanize to stay logged into Sourceforge, so I’m having a hard time automating part of the release process).

I’m also considering ways to provide this toolkit without taking so much out of my own pocket. What started out as me developing tools for my employer, and them graciously agreeing to let me make them available for Sourceforge, has gone far beyond my employer’s needs now. I can’t ask my employer to carry the weight, so it has fallen to me for a while now. That’s okay for some period while I work out how to do it differently, but not indefinitely. Among other things, it cuts into time I want to spend with my wife. Charging for support has definitely crossed my mind, as has some kind of community/enterprise split (such as the one Zmanda does). I don’t want to go there yet — so I’m just asking for a week of sponsored time off work, to begin with.

By the way, the process of replacing copy/pasted code isn’t without its hitches. I just found and fixed a bug in MySQL Table Checksum that I caused by moving the DSN parsing code to a module. And someone else just reported a different bug in another tool, where it turns out the copy/pasted code wasn’t quite identical and I changed the functionality by moving it to the module. Release early, release often. Rely on users to find bugs and report them. So it goes.

Technorati Tags:, , , , , , , ,

You might also like:

  1. Maatkit bounty begins tomorrow
  2. MySQL Toolkit needs a new name
  3. MySQL Toolkit updated