Archive for December, 2005

Excel vs. OpenOffice.org Calc in number formatting

I was playing with custom format strings in Excel recently and noticed something odd. The number of digits to the left of the decimal place seemed to vary bizarrely when using custom format strings for exponential notation, in ways that contradict the documentation. OpenOffice.org doesn’t exhibit the same wackiness. Which spreadsheet formats numbers more sanely?

Excel vs. Calc

If you’re not familiar with custom format strings, it’s just a way of making your own format. You can right-click on a cell, choose “Format Cells…” and click Custom at the bottom of the list, then enter a formatting string. The formatting syntax is used throughout Windows, and you rub elbows with it any number of ways when you program the Windows API, but the documentation seems to be only about 90% consistent between, say, Excel and the .NET Numeric Format Strings documentation.

Back to the data I was seeing with Excel. I wanted to have two digits to the left of the decimal point and two to the right, for a total of four significant figures; the exponent should be zero-padded so it’s two digits as well. Here is how you can do that with Excel, according to the documentation: 00.00E+00. Sounds good, but it doesn’t work in practice. Here’s some data, the actual results, and what I expected:

NumberActual ResultExpected Result
1209384 120.94E+04 12.09E+05
192939393 01.93E+08 19.29E+07
1293 1293.00E+00 12.93E+02
3910102935348 03.91E+12 39.10E+11

Wacky! Why on earth would Excel choose to format these numbers like it does? I played around with it for a while; it seemed to be unpredictable. It would put too many digits to the left of the decimal point, then when the number got one digit larger, suddenly switch to too few with a much larger exponent — what the heck? Then my coworker noticed the exponent Excel chose was always a multiple of four. I played around with more formats and figured out why: the exponent is a multiple of the number of digit placeholders to the left of the decimal point. Here is a table that makes this clear:

Number ##00.00E+00 #00.00E+00 00.00E+00
1 01.00E+00 01.00E+00 01.00E+00
12 12.00E+00 12.00E+00 12.00E+00
123 123.00E+00 123.00E+00 01.23E+02
1234 1234.00E+00 01.23E+03 12.34E+02
12345 01.23E+04 12.35E+03 01.23E+04
123456 12.35E+04 123.46E+03 12.35E+04
1234567 123.46E+04 01.23E+06 01.23E+06
12345678 1234.57E+04 12.35E+06 12.35E+06
1234567890 12.35E+08 01.23E+09 12.35E+08
12345678901 123.46E+08 12.35E+09 01.23E+10
123456789012 1234.57E+08 123.46E+09 12.35E+10
1234567890123 01.23E+12 01.23E+12 01.23E+12

Notice how the exponent is always a multiple of four in the first column, three in the second and two in the third.

I can’t find where this is documented, and it definitely contradicts the existing documentation which says those digits are used to control how the number, not the exponent, is formatted. I searched around the web and found other people agreed with me. OpenOffice.org’s Calc, on the other hand, clearly specifies that you control the number of significant digits, and the exponent is dependent on the value and the number format — so you really do have control over how the number itself is formatted. Here is the same data in Calc:

Number ##00.00E+00 #00.00E+00 00.00E+00
1 1000.00E-03 100.00E-02 10.00E-01
12 1200.00E-02 120.00E-01 12.00E+00
123 1230.00E-01 123.00E+00 12.30E+01
1234 1234.00E+00 123.00E+01 12.34E+02
12345 1230.00E+01 123.50E+02 12.30E+03
123456 1235.00E+02 123.46E+03 12.35E+04
1234567 1234.60E+03 123.00E+04 12.30E+05
12345678 1234.57E+04 123.50E+05 12.35E+06
1234567890 1235.00E+06 123.00E+07 12.35E+08
12345678901 1234.60E+07 123.50E+08 12.30E+09
123456789012 1234.57E+08 123.46E+09 12.35E+10
1234567890123 1230.00E+09 123.00E+10 12.30E+11

Which is better? Well, it depends. I’m categorically in favor of OpenOffice.org because it’s Free Software, of course. I also think it’s a superior product in many ways: openness, standards compliance, price, security, interoperability, and so forth. But in this particular aspect, I can’t really say which is “better.” The two products have different ways of doing it, that’s all. If I want to control how my numbers are formatted, I go with OpenOffice.org. On the other hand, Excel’s “feature” seems to lend itself well to engineering notation — a way to write numbers with exponents that are multiples of three, which correspond to Metric unit prefixes such as kilo, nano and so forth. Engineering notation requires control over significant figures though, which Excel throws out the window! (You can choose the “scientific” number format and select a number of decimal places, but that’s not exactly the same thing).

Even though OpenOffice.org is better, I’m going to avoid the “which is better” question and ask the questions I’ve not been able to answer: can I use “engineering notation” in Excel and get it to honor my instructions about significant figures? Can I get OpenOffice.org to do engineering notation? And finally, if anyone knows of some authoritative specification of how Microsoft products do number formatting, I’d be grateful for a link, because I can’t find it — the documentation I see is really poor (I have no such problem with Calc, and if I did, I could look at the source code). If I find any answers of my own to these questions, I’ll update this post.

And in case you’re wondering whether this post is related to my work on date formatting in JavaScript, yes it is. I’ll be finishing up some work on number formatting soon. I just need to write more unit tests.

Technorati Tags:No Tags

You might also like:

  1. How to format numbers in JavaScript flexibly and efficiently
  2. How to label Excel and OpenOffice.org XY scatter plots
  3. How to convert text to columns in OpenOffice.org Calc
  4. Javascript date parsing and formatting, Part 2
  5. Tables and data part 1: CSS and data types

How triggers affect ROWCOUNT and IDENTITY in SQL Server 2000

It’s safe to use @@ROWCOUNT in SQL Server 2000 even when there is a trigger on the base table. The trigger will not skew your results; you’ll get what you expect. @@ROWCOUNT works correctly even when NOCOUNT is set.

Silly rowcount/identity graphic

To quote the documentation:

@@ROWCOUNT

Returns the number of rows affected by the last statement.

Couldn’t be much simpler. Getting the last IDENTITY value is not as straightforward though. There are several options: @@IDENTITY, SCOPE_IDENTITY(), and IDENT_CURRENT(). You can read up on the differences, but the one you probably want to use is SCOPE_IDENTITY(). It gives you the last value in the current scope, which means it won’t be affected by triggers or other connections. The other two methods could give bizarre results depending on what else is going on in the database. They have their uses, but only in specialized cases.

Technorati Tags:No Tags

You might also like:

  1. 3 ways to maintain rollup tables in SQL

Don’t use future reserved words as identifiers in JavaScript

Just a quick note: I found recently that some browsers don’t complain when I use future reserved words as identifiers in JavaScript. Specifically, I used char as a variable name, and no Windows browser had a problem with it — but when I tested on Mac, both Safari and Firefox complained. Apparently there are differences in JavaScript implementation between the Windows and Mac versions of Firefox 1.5!

Nitty-gritty details: the ECMA-262 spec defines reserved words and future reserved words in section 7.5.1. Many OO-ish keywords like implements are reserved for the future, as well as specific data types like int and … you guessed it, char.

Technorati Tags:No Tags

You might also like:

  1. Browser variations in RegExp.exec()

Javascript date parsing and formatting, Part 2

Download date-functions v1.1

In this post I’ll explain how I built on my runtime date-formatting functionality into the date-parsing realm. The result is a date-parsing library that literally creates itself at runtime.

The demo

I have a demo of the date-parsing library online for your enjoyment.

How it works

The technique is similar to my date-formatting library:

  • accept some input such as 2005-10-11
  • accept a format specifier such as Y-m-d
  • use the format specifier to create a function capable of interpreting date strings in the given format

This allows parsing dates very efficiently and flexibly. In fact, the function that gets built will parse dates with as much detail as possible, down to the second, defaulting to a less precise date when there’s less information.

The date-parsing code is a bit more complex than the formatting code. The parsing code has to build a regular expression which will successfully match a well-formed input, as specified by the format string. It inserts groups into the regular expression wherever it sees some data it can use to deduce the value of the date, and keeps track of the groups so it can use the captured values as parameters to the Date constructor. For example, if it sees the character Y in the format string, it knows that value can be captured in the regular expression and used as the year parameter to the Date constructor. It matches, but doesn’t use, other data to ensure it is validly formatted. For example, the day of the week isn’t helpful when parsing a date. The demo will make this clear.

Round-trip processing

In many cases, depending on the format string, it should be possible to use the date-parsing code together with the date-formatting code for round-trip processing. Take a date, format it with some format string, then read it back in with the same format string, and you should get the same date. Of course, you need to preserve whatever level of detail you want to get back — you won’t get everything back if you throw it away during the formatting step. You’ll see that in the demo too.

The files

I’m wrapping the date-formatting and date-parsing code up into a single file. I’ve also upgraded the Javascript date-chooser to use both the date-parsing and date-formatting functionality. Please use and enjoy, and if you find any bugs, let me know. Likewise, if you make any improvements, that’s great too — please pass them on to me so I can pass them on to others.

What it’s not

This code is not a JavaScript implementation of strtotime. That’s too difficult and not that useful in my opinion. (For those not familiar with it, strtotime can understand input like “two weeks ago next Sunday”). It’s also not internationalized. It only works for my little slice of the universe: the English language — though international date-formatting standards (ISO 8601, highly recommended) make that a moot point anyway.

Technorati Tags:No Tags

You might also like:

  1. JavaScript formatting library update
  2. How to format numbers in JavaScript flexibly and efficiently
  3. JavaScript date chooser
  4. JavaScript date formatting benchmarks
  5. JavaScript number-formatting library updated

How to use CSS to go beyond separation of content and presentation

One of CSS’s loftiest goals was to help separate content and presentation on the Web1. It has succeeded to some degree, bringing great benefits in terms of maintainability, usability, and accessibility.

Think at a higher level

There is much noise about content and presentation, but it’s not the only beneficial separation. In fact, pick any dimensions you wish, and I think you can benefit by distinguishing among them at least to some degree: content, presentation, structure, behavior, and so forth.

One reason separation is good that it actually lets me factor out repetition and shared aspects, and deal with them just once. In this sense, I’m not just separating things in one place, I’m bringing them together in another. Aspect-Oriented Programming is a great example from the programming world. However, it is technique, rather than technology, that is important. CSS doesn’t guarantee the separation — it’s how I use it that matters.

With this in mind, I am currently ruminating the following: I think it’s a Good Thing to separate layout and presentation within CSS. Again, these are just dimensions within one technology; I think making the proper distinction between the two is just a good practice.

Definitions

First let me define the terms as I use them in this article:

Layout Styles
dictate how the elements of the page are arranged: where is the navigation, is the content a fixed-width column that’s centered on the page, and so forth. Layout markup is usually about setting some properties of elements that hold the “real content” of the page.
Presentational Styles
specify how the “real content” appears: font size, colors, borders, link hover effects and so forth.

What this means on a real website

In terms of a typical website that uses CSS for both purposes (layout and presentation), many elements on the page will have both types of styles. Global navigation, for example, will almost certainly have some positioning styles attached to a container, and often will have hover effects on links too. What if I want to change how the page is laid out, but not be distracted by all the font sizes on headers, margins on the lists, and so forth? I’ve done this myself and found it difficult to sort through an entire stylesheet, ignoring the presentation haystack to find the layout needle. Since then I’ve tried to keep the two separate, either by just using comments to mark separate sections within the one file, or by using two files. @include is great for this — the presentation stylesheet can include the layout stylesheet at the beginning. It helps a lot.

I don’t consider myself an expert on this. Not much attention has been paid to good practices when writing stylesheets. Everyone’s writing about standards, browsers, and so forth. For some reason, good writing practices don’t seem to be sexy (yet), and maybe that’s why everyone is eager to write about other things. I think this may have a lot to do with who’s writing CSS. In many cases, it’s not programmers, so as a programmer, I often find people’s stylesheets to be big freakin’ messes. In traditional programming, much effort has gone into studying what makes code maintainable, easily verifiable, and so forth. Similar work doesn’t seem to exist for CSS (though I personally feel the same principles apply). There’s not much, if any, guidance about style when writing CSS. How to indent, naming guidelines, and order of properties come to my mind when I think of this. Again, I don’t think I have anything more to offer than other people, but I’m thinking about it. Maybe at some point I’ll put some time into it and produce my own code standards for CSS. I’m sure others will start doing so as well, especially as CSS becomes more widely used and some people start getting stuck with bad code. When people get fed up with it, they’ll do something about it :-)

Perfection is an illusion

As a side note, I would like to say complete separation of content and presentation (or anything else) is not possible in many cases, despite some people’s insistence to the contrary. It’s a myth. A lot of this is because any given feature often belongs in several domains, so imposing a strict hierarchy rarely works (the realization that hierarchical taxonomies are flawed is exactly where all the “tagging” in Web 2.0 is coming from). I wrote my thesis on visual and informational representations of music and I’ve found it’s rarely clear-cut, be it music or text. My experience with music made me take a fresh look at HTML, and I was amazed at how my perspective changed. There’s much to be gained from an 80% solution, though, so fire away!


[1] People say this many different ways, such as “structure and presentation” and so forth. My memory says “content and presentation” was the first such phrase to be bandied about, and the terminology has changed over the last 6-8 years.

Technorati Tags:No Tags

You might also like:

  1. Why not to use CSS for columnar layouts
  2. Simple and complex types in XML Schema

More alternatives to openxml

Following up on my earlier post, here are some alternate usage scenarios for the UDF I wrote to avoid openxml in SQL Server 2000.

The most obvious variation is to create different versions to return various data types, such as INTEGER or DATETIME or whatnot. I started to do this, but then realized it’s silly. If I duplicate the code for every datatype, what have I accomplished besides duplicating code? The input is VARCHAR, so it’s getting implicitly converted at some point no matter what. It’s better for me to let the end user do the conversion, and not duplicate any code.

My next thought was a function to split apart name-value pairs, such as width=100 height=200 color=blue. Such a function should return three columns: name, value and ident. Again, once I wrote the code, I felt silly. There are at least two fine ways to do it without a new function. One is to pass the names in one input and the values in another. The other is to use two delimiters and just use SUBSTRING to split them apart. In either case, it’s pretty simple; the UDF is doing the looping, and the rest can be done with standard SQL. Here are two ways to do this:

declare @Names varchar(8000),
    @Values varchar(8000),
    @NameValues varchar(8000),
    @Delim1 char(1),
    @Delim2 char(1)
select @Names = 'width height color',
    @Values = '100 200 blue',
    @NameValues = 'width=100 height=200 color=blue',
    @Delim1 = ' ',
    @Delim2 = '='

select l.word as name, r.word as value, l.ident
from dbo.fn_SplitWords(@Names, @Delim1) as l
    inner join dbo.fn_SplitWords(@Values, @Delim1) as r
        on l.ident = r.ident

select 
    substring(word, 1, charindex(@Delim2, word) - 1) as name,
    substring(word,
        charindex(@Delim2, word) + 1,
        len(word) - charindex(@Delim2, word)) as value,
    ident
from dbo.fn_SplitWords(@NameValues, @Delim1)
where charindex(@Delim2, word) > 0

Both queries yield the same results:

These methods both have a shortcoming: it’s not possible to pass missing or zero-length values for a given name. Here is a query that does:

declare @NameValues varchar(8000),
    @Delim1 char(1),
    @Delim2 char(1)
select @NameValues = 'width=100 height=200 color=blue weight= length',
    @Delim1 = ' ',
    @Delim2 = '='

select 
    case when charindex(@Delim2, word) > 0
        then substring(word, 1, charindex(@Delim2, word) - 1)
        else word end
    as name,
    case when charindex(@Delim2, word) > 0
        then substring(word,
        charindex(@Delim2, word) + 1,
        len(word) - charindex(@Delim2, word)) end
    as value,
    ident
from dbo.fn_SplitWords(@NameValues, @Delim1)

From first to last, each of these queries is more flexible and complex than the preceding one. Therefore I prefer them in that order.

Technorati Tags:No Tags

You might also like:

  1. Efficient alternatives to Microsoft SQL Server’s openxml
  2. A bug in Microsoft SQL Server’s replace() function
  3. How to use meta-data to sort itself

JavaScript date parsing and formatting, Part 1

Download date-functions v1.1

Have you ever written JavaScript code whose behavior is undefined until runtime, is inefficient, and could be written efficiently if you knew the behavior ahead of time? Here’s a way to overcome this: defer it all till runtime, then dynamically build a more efficient function once you know what it has to do. In this article I’ll use date formatting as an example of how to take a performance hit once on the first invocation, and get extremely fast performance thereafter.

Motivation

Picture the following scenario:

  • a user supplies a formatting string for a date
  • the script parses the formatting string
  • the script loops through the formatting string and builds the formatted date

Lots of people (1, 2, 3) have written JavaScript date-formatting routines with this type of algorithm, which is usually fine unless you want efficiency (for example, processing a large table of values). Then it is a terrible approach.

An example of the inefficient way to do it

I used to use the hard-coded, slow method on my JavaScript date chooser. Here’s the old, inefficient formatting code:

Date.prototype.dateFormat = function(format) {
    var result = "";
    for (var i = 0; i < format.length; ++i) {
        result += this.dateToString(format.charAt(i));
    }
    return result;
}

Date.prototype.dateToString = function(character) {
    switch (character) {
    case "Y":
        return this.getFullYear();
    // snip a bunch of lines
    default:
        return character;
    }
}

Here’s how I used the code:

var d = new Date();
alert(d.dateFormat("Y-m-d"));

What’s wrong with that? Well, looping is slow, and I shouldn’t have to re-evaluate the format specifier every time I format a date with the same format specifier. That part of the code should be evaluated just once. For example, the formatting string Y-m-d should just be “compiled” into something like this:

return this.getFullYear() + '-' + this.getMonthName()...

That would be great! But, since the format specifier is passed in at runtime, how is it possible to hard-code that ahead of time?

Programs that write programs

It’s not possible to hard-code the more efficient method, because I don’t know what it’s going to do at runtime, but I can code it at runtime, since JavaScript has interesting features common to many languages that allow code to be treated as data and vice versa: JavaScript can write functions dynamically.

Programs that write programs are the happiest programs in the world.

Lots of programming languages can do this. Perl and LISP in particular are great for it, and it’s “what you do” in Artificial Intelligence work. In JavaScript, it’s not too hard either: all I need to do is parse the format string and build a definition of the function that will implement it. Assuming I want to name the function format0, I would get something like this:

var funcString = "Date.prototype.format0 = function() {return this.getFullYear() + '-' + (zeroPad(this.getMonth() + 1)) + '-' + zeroPad(this.getDate());}"

This isn’t a function, it’s a string. But by eval()ing it, I can make it a function, and then when I want the date formatted as Y-m-d, I just do this:

var d = new Date();
alert(d.format0());

How can I call a function without a name?

There’s still an obvious problem — I can’t call this code at runtime because I don’t know the name format0 of my formatting function. It might be the 99th different formatting string that’s been turned into a function, not the first. I need to save some way to look it up. How about the following?

  • call a function with a format specifier
  • do a hash lookup with the format specifier, looking for the function that implements the format
  • if the function doesn’t exist, make one and store its name in the hashtable, keyed on the format string
  • call the function and return the result

Sounds good to me! Since JavaScript has associative arrays (objects are associative arrays), I’ll use those as the hashtable. The resulting function looks for an entry in the hashtable, builds the function if it’s not there, and then uses it. There’s an initial cost the first time I format a string with a given format, but after that, it’s cheap, so I can go to town. Essentially I’m unrolling and pre-compiling a loop. To prove it, I can document.write(Date.prototype.format0); and see what the code ended up being:

function () { return this.getFullYear() + "-" + String.leftPad(this.getMonth() + 1, 2, "0") + "-" + String.leftPad(this.getDate(), 2, "0"); }

The functions are named in the order in which they’re defined: format0, format1 and so forth. Why do that? Why not just use the formatting string’s own value as the name of the function? Hashtable keys can be any string value, but functions have to have legal, callable names (at least in some browsers). That’s why I’m making a legal name for the function and using the hashtable to look it up.

Compatibility with PHP’s date function

In order to implement nearly full compatibility with PHP’s date function, I had to write a good bit of code, adding my own functionality to the Date object (both to the object itself, and to the prototype). I created several named formatting patterns, similar to the Microsoft .NET formatting patterns, which can be used in lieu of defining your own (see the demo for more). I also added a few functions to some other built-in objects. Please let me know if you find anything wrong with this code. I only wrote it for English; it’s not implemented for other cultures.

Here is the JavaScript, and here is a demo of the whole shebang.

Benchmarks

Is it really faster to do it the way I’ve described? Yes, it is. It is way faster, even when you compare it to code that has far fewer features.

Technorati Tags:No Tags

You might also like:

  1. Javascript date parsing and formatting, Part 2
  2. JavaScript Number Formatting Library v1.3 released
  3. How to format numbers in JavaScript flexibly and efficiently
  4. JavaScript date formatting benchmarks
  5. JavaScript date chooser

Everyone is wrong sometimes

Just a quick note on the newness of Web 2.0 and how little weight you should assign to the many opinions floating around, including mine. The web has exploded in the last few years. A lot of people are making a lot of great things. There are also many self-proclaimed pundits and people proclaiming others to be pundits. Whether labeled as pundits by themselves or others, these people aren’t infallible; they don’t know things for certain. They are writing their best guesses and educated opinions, and they can be and frequently are wrong.

One of the most highly-regarded voices in Web technology today is the online magazine A List Apart. I appreciate their great content as much as anyone, and I’m not here to cut them down. But the year is 2005, and I would like to draw your attention to something they wrote about Web standards in February 2001, in an article entitled To Hell With Bad Browsers:

In six months, a year, or two years at most, all sites will be designed with these standards.

They were wrong. Even today we are pitifully far from “all” sites; probably fewer than 1% of sites are reasonably standards-compliant. I want to use that to illustrate my point: Anyone can be wrong, so don’t put anyone on a pedestal. Anyone! Not even someone who writes for A List Apart. It’s a most uncomfortable place to be, because falling off is inevitable. And it’s not fair to yourself, either — by looking at others as beyond failure, you’re selling yourself short, not valuing highly enough your own abilities.

Above all, please don’t put me on a pedestal! Much appreciated.

Technorati Tags:No Tags

No related posts.

How to simulate optional parameters in SQL

This article explains how to write WHERE clauses that accept optional parameters, reducing the need for separate queries. I have seen a lot of SQL code where the SELECT is duplicated many times, but the WHERE clause is different in each one, and only one will execute. It’s easy to factor out the differences in the WHERE clause. For example, consider the following code, assuming two optional parameters to the query:

if @param1 is not null
    select * from table1 where col1 = @param1
else
    select * from table1 where col2 = @param2
This can be rewritten:
select * from table1
where (@param1 is null or col1 = @param1)
    and (@param2 is null or col2 = @param2)

Why it’s better this way

The duplication above may not seem like a big deal, but when the SELECT statement is 50 lines of code and there are 10 optional parameters, it’s a huge problem to have all that code duplicated:

  • it’s hard to test
  • the code is likely to get buggy when someone tries to change it
  • it’s hard to be sure the repeated code is exactly the same in all the places it’s duplicated

To the last point: even worse, it’s hard to be sure the code isn’t the same either. In other words, if I suspect there are differences in the SELECT, all the duplication makes it hard to find the differences, so the code becomes much harder to read, and I’m much less confident I haven’t missed something. Note: if the code isn’t really repeated, insert a comment so someone like me doesn’t come along and try to “fix” it!

How it works

Any decent query optimizer should be able to recognize when a clause is always true or always false, and either delete the clause or terminate the query with no results. For example, pretend I have the following query:

select * from t1
where (? is null or c1 = ?)
    and (? is null or c2 = ?)

Now suppose my parameters are 3 and NULL. What happens to the query?

select * from t1
where (3 is null or c1 = 3)
    and (NULL is null or c2 = NULL)

Any decent optimizer, including MySQL’s optimizer, will rewrite that query:

select * from t1
where (c1 = 3)

You can use EXPLAIN EXTENDED, followed by SHOW WARNINGS, to see this at work in MySQL.

Possible reasons not to do this

There is a potential downside (besides “fixers” like myself getting into the code). The query optimizer might not be able to optimize the combined statement as well as the separate ones, so if this is mission-critical or gets called all the time, measure its performance. Be especially watchful as regards indexes. If there is an index on one of the columns in the WHERE clause, the combined statement might prevent the query optimizer from using the index, especially as the query becomes more complex. The bottom line, as always, is to consider whether performance matters, and if it does, measure performance, analyze the query plan, and proceed accordingly.

Technorati Tags:No Tags

You might also like:

  1. How to write a SQL exclusion join
  2. SQL Server 2000 date and time puzzler
  3. How to write SQL JOIN clauses more compactly
  4. Maatkit version 1674 released
  5. How to find duplicate rows with SQL

How to write good comments: say why, not what

I have written before about badly commented code, but I’ve recently encountered two examples of code that warns someone of something bad, then never tells them what it is. This is a great reason to write comments that say why, not what — my first rule of thumb when commenting code.

Example: SQL stored procedure

The first example of a bad comment is in a SQL stored procedure, where a table is updated from a view:

-- *** THIS MUST BE SEPARATE ***
update table1 set col1 = someVal
    from table1 join view1 on [join critieria]

I did a quick search and found 18 places this code had been copied and pasted. I asked about it, and after some time found someone who knew what the code did. Here’s the answer I got about it:

A bug in the query optimizer prevents left joining onto view1 in a large join statement. col1 was not getting set from isnull(someVal, 0).

That’s a serious bug! Other programmers definitely deserve to know about it in detail. There are lots of problems here, though. First of all, that’s not a left join. Second, there’s no isnull() at all. What does that comment have to do with this query? Answer: nothing. Someone copied and pasted the meaningless, paralyzing comment along with code.

I can’t think of a more effective way to make code impossible to maintain. It’s brilliantly simple: just hint at a vague, terrifying possibility of something bad happening if you change the code. Voila, nobody will ever touch it again.

Example: ASP article editor

The second instance of this “worst practice” is an internal ASP article editor. If a user viewed the editor with Internet Explorer on a Macintosh, the following code would print out an error:

sBrowserType = request.ServerVariables("HTTP_USER_AGENT")
if instr(sBrowserType, "MSIE") > 0 and instr(sBrowserType, "Mac") > 0 then
    response.Write "You can not edit the body text of this article with Mac IE - please use a different browser"
    ' some code to prevent further use of the page
end if

There was no comment explaining why, and the intern who wrote the code several summers ago was long gone. After I asked everyone — programmers, writers, team leaders — one person dredged up a vague recollection that this particular browser wouldn’t save all the article’s text, causing your article to truncate after a certain point.

The first rule of Xaprb

I think it’s self-evident why these are examples of spectacularly unmaintainable programming, but I’ll just sum it up for easy digestion:

  • Comments should say why, not what.

I’m probably not the first to say this, but I don’t know of anyone else who’s said it, so I claim naming rights: henceforth, I refer to this as the first rule of Xaprb :-)

What the rule means

The rule follows from the principle of Don’t Repeat Yourself (DRY), which states “Every piece of knowledge must have a single, unambiguous, authoritative representation within a system.” The code already says what and is the authoritative representation. In the first example, the code and comment repeat each other and omit a piece of system knowledge (the missing knowledge is why), so the comment is about as useful as saying

-- *** THIS CODE IS SEPARATE ***

In the second example, there’s no comment, but there’s a behavior and a message. The message repeats the behavior. Both say “you can’t use this page,” which is repeated knowledge. Neither says why. Again, a piece of system knowledge (”why”) is not represented at all. DRY, together with Once and Only Once, requires not only at most one representation of a piece of knowledge, but also at least one. Both examples in this article fail these criteria.

From a usability standpoint, any interface that warns but doesn’t inform is a faux pas, too. “You can’t do that” is irritating. “Your browser will not save your work correctly” is helpful.

Technorati Tags:No Tags

No related posts.