Archive for February, 2006

Priorities are a project risk

If you have to prioritize projects (and you do), you are sacrificing one thing for another. What happens when that starts to work against itself? What happens when you can’t just put the highest priority project on top? Can you imagine your priorities looking like this?

Playing cards demonstrating circular priorities

I see this happen all the time. That’s not to say priorities are evil in and of themselves. I’m just pointing out that at some point priorities will conflict. This affects many things, but most importantly it affects everyone’s morale, including the customer’s. This is a project risk like any other.

My current employment is at a modified XP shop, and I’m really psyched about it. I’m a big fan of XP, Agile and Scrum methodologies — and a big fan of Scott Ambler, too. I have always been risk-focused when managing software projects. Right now I’m not managing any, but I appreciate the XP approach because of its focus on mitigating risk by dealing with it (possible) instead of trying to eliminate it (impossible).

This is the first job I’ve had in a long time where I’ve been this excited about getting right into it and building something great in a small team of smart people. I feel like I’ve got a Maserati and an empty road. At my previous job, there was an entry on the timesheet for time spent filling out the timesheet. Here’s a quote from my present boss:

If tracking time becomes something we have to become aware of, then we have a problem.

I think it’s going to be a very good journey.

Technorati Tags:No Tags

No related posts.

How to write flexible INSERT and UPDATE statements in MySQL

MySQL provides several variations on INSERT and UPDATE to allow inserting and updating exactly the desired data. These features provide a lot of power and flexibility, making MySQL significantly more capable than it otherwise might be. In this article I’ll give an overview of each feature, help you understand how to choose among them, and point out some things to watch out for.

Setup

I am using MySQL 4.1.15 to create my examples. I assume MyISAM tables without support for transactions, with the following sample data:

create table t1 (
    a int not null primary key,
    b int not null,
    c int not null
) type=MyISAM;

create table t2 (
    d int not null primary key,
    e int not null,
    f int not null
) type=MyISAM;

insert into t1 (a, b, c) values
    (1, 2, 3),
    (2, 4, 6),
    (3, 6, 9);

insert into t2 (d, e, f) values
    (1, 1, 1),
    (4, 4, 4),
    (5, 5, 5);

Overview

Suppose I wish to insert the data from t2 into t1. This data would violate the primary key (a row exists where column a is 1) so the insert will fail: ERROR 1062 (23000): Duplicate entry '1' for key 1. Recall that in MySQL, a primary key is simply a unique index named PRIMARY. Any data that violates any unique index will cause the same problem.

This situation occurs frequently. For example, I might export some data to a spreadsheet, send it to a client, and the client might update or add some data and return the spreadsheet to me. That’s a terrible way to update data, but for various reasons, I’m sure many readers have found themselves in a similar situation. It happens a lot when I’m working with a client who has multiple versions of data in different spreadsheets, and I’m tasked with tidying it all up, standardizing formatting and importing it into a relational database. I have to start with one spreadsheet, then insert and/or update the differences from the others.

What I want to do is either insert only the new rows, or insert the new rows and update the changed rows (depending on the scenario). There are several ways to accomplish both tasks.

Inserting only new rows

If I want to insert only the rows that will not violate the unique index, I can:

  • Delete duplicate rows from t2 and insert everything that remains:

    delete t2 from t2 inner join t1 on a = d;
    insert into t1 select * from t2;

    The first statement deletes the first row from t2; the second inserts the remaining two. The disadvantage of this approach is that it’s not transactional, since the tables are MyISAM and there are two statements. This may not be an issue if nothing else is altering either table at the same time. Another disadvantage is that I just deleted some data I might want in subsequent queries.

  • Use an exclusion join. This query should work on any RDBMS that supports LEFT OUTER JOIN:

    insert into t1 (a, b, c)
        select l.d, l.e, l.f
        from t2 as l
            left outer join t1 as r on l.d = r.a
        where r.a is null;

    The downside to this method may be lower efficiency, depending on the data and how complex the join needs to be. The join is done up front, which can be a lot of work on large datasets, especially when only a few duplicate rows might exist.

  • Use INSERT IGNORE to ignore the duplicate rows:

    insert ignore into t1 select * from t2;

    The duplicate rows are ignored. Note that MySQL does not do a generic “duplicate-check” to see if the rows contain duplicate values when determining if a row is a duplicate and should be ignored. It only ignores rows that violate a unique index. If I have no unique index on a column, IGNORE has no effect.

    When using IGNORE, MySQL prints information about duplicates:

    Query OK, 2 rows affected (0.02 sec)
    Records: 3  Duplicates: 1  Warnings: 0

    This method is probably the fastest of all, especially if very few duplicate keys exist in t2. MySQL simply tries to insert every row and keeps going when one fails. The disadvantage is that IGNORE is a proprietary, non-standard extension.

Inserting new rows and updating existing rows

Now suppose I want to insert new rows and update existing rows. Again, there are several ways to do it, in one or two steps:

  • Use standard SQL in a two-step process to insert new rows and update existing rows. The following is one way to do it, but it’s not the best way:

    insert into t1 (a, b, c)
        select l.d, l.e, l.f
        from t2 as l
            left outer join t1 as r on l.d = r.a
        where r.a is null;
    
    update t1 as l
        inner join t2 as r on l.a = r.d
        set l.b = r.e, l.c = r.f;

    The benefit to this approach is standards compliance. This should work on a wide variety of database platforms.

    The downside is poor efficiency. Imagine the datasets are huge and there are only a few duplicate rows. The first statement inserts the (huge number of) new rows by joining the two huge datasets together. The next statement joins them together again, except this time the join is even bigger because of all the new rows in t1! And worse yet, it updates the rows that just got inserted, which is certainly not needed. It is far better to do the update first, which should only affect a few rows, then insert the new rows:

    update t1 as l
        inner join t2 as r on l.a = r.d
        set l.b = r.e, l.c = r.f;
    
    insert into t1 (a, b, c)
        select l.d, l.e, l.f
        from t2 as l
            left outer join t1 as r on l.d = r.a
        where r.a is null;

    This is far more efficient, but it still might be very bad. It could lock the tables for a long time with large datasets, and like all two-step processes, it is not transactional.

  • Use non-standard MySQL extensions to make the two-step process more efficient. MySQL allows multiple-table updates, which can be used to mark which rows are duplicates during the UPDATE, eliminating the need for an exclusion join in the INSERT. To accomplish this, t2 needs a new column to record its “status,” which I will call done.

    alter table t2 add done tinyint null;
    
    update t1
        inner join t2 on t1.a = t2.d
        set t1.b = t2.e,
            t1.c = t2.f,
            t2.done = 1;
    
    insert into t1 (a, b, c)
        select d, e, f from t2
        where done is null;

    This can be significantly more efficient on the large datasets I’ve been imagining. The downside to this approach is non-portability to other database platforms.

  • Use MySQL’s non-standard ON DUPLICATE KEY UPDATE extension to accomplish the insert and update in a single step. As with the non-standard INSERT IGNORE above, this is probably the fastest method:

    insert into t1(a, b, c)
        select d, e, f from t2
        on duplicate key update b = e, c = f;

    There are other ways to write this statement, for example using the VALUES function, which can help simplify complex queries by referring to the value which would have been inserted into the given column:

    insert into t1(a, b, c)
        select d, e, f from t2
        on duplicate key update b = values(b), c = values(c);

    The disadvantage to this approach is lack of portability, of course. Inserting and updating in a single statement is highly non-standard.

Complexities, incompatibilities, and bugs

If the queries above look like perfect solutions, don’t be fooled. I chose my data and table structures to demonstrate successful scenarios. There are many ways things can fail or be confusing:

  • Some versions of MySQL simply don’t support the features I discussed above. Check the MySQL documentation for details about which features are available on any given version.
  • Some versions of MySQL have bugs that involve the above types of queries. For example, version 4.1.7 doesn’t like queries of the form INSERT... SELECT... ON DUPLICATE KEY UPDATE at all. It will complain about a syntax error. It allows inserting literal values, but not the results of a SELECT statement. In these buggy versions, the MySQL features may be available but not usable.
  • Some versions of MySQL get confused by the VALUES syntax I demonstrated above. If your source and destination tables have similar column names, you may have this problem. For example, in MySQL 5.0.15-log with tables that have the same column names,

    insert into t1 (a, b, c)
        select a, b, c
        from t2
        on duplicate key update b = values(b);
    ERROR 1052 (23000): Column 'b' in field list is ambiguous
  • Multi-table updates can be tricky if there is ambiguity in column names. Using an alias, such as ‘r’ for ‘right’ and ‘l’ for ‘left’ as I did above, can help.
  • MySQL reports values for the number of rows affected. When an operation affects rows in multiple tables, or when a duplicate row causes an update to existing values, the rows-affected statistics change in odd ways. The MySQL manual explains how this works, so I don’t want to go into it; I just want to point out that you should expect odd values.
  • The above techniques all assume t2 contains no duplicate values of d, which is enforced in my examples by the primary key on that column. If this is not the case, it becomes significantly more difficult to write the queries. It may be best to clean up t2 and create a primary key on the appropriate column(s), to avoid these problems.
  • I have caused crashes and binary log corruption in MySQL 4.1.7 with a combination of temporary tables, subqueries, and ON DUPLICATE KEY UPDATE. I think this is a bug with this specific version, but I suggest testing everything on a non-production database server, even if it seems harmless. The statements that caused crashes for me seemed very innocuous.

Choosing a technique

Which method to use largely depends on requirements. If the software must support multiple database backends or versions, perhaps the generic, standard queries are the best bet. If performance is the goal and the queries don’t need to be portable, I see no reason not to use the solution that performs best. If the software must support multiple database backends and performance is critical, there’s probably no way to avoid writing different queries for each supported backend.

I belive fully portable or “platform-independent” SQL is mostly a myth. Writing generic “standard” SQL to the lowest common denominator almost certainly results in under-utilizing the RDBMS’s abilities. Getting the most from my software is more important than dreaming of “platform-independent” queries.

Summary

Sometimes a proprietary extension to standards provides something unavailable by any other means. In this article I have discussed several ways to use such non-standard extensions in MySQL for performance and convenience. Divergence from standards is a double-edged sword. Not only does it potentially make code non-portable, it can encourage mediocrity by teaching bad habits instead of teaching people the “right” way to do things. For example, updating multiple tables in a single statement, or inserting and updating at the same time, are definitely strange and ugly things to do. As in life, the most important thing is to find a good balance and determine which criteria really matter.

If this article was useful to you, you should subscribe to stay current with my upcoming articles. It’s free and convenient.

Edit 2006-02-26 I forgot to cross-reference INSERT IF NOT EXISTS queries in MySQL, a related post where I explain some variations on a particular scenario — for example, where there is no unique index on the column you want to avoid duplicating, or you want to allow only n duplicates.

Technorati Tags:No Tags

You might also like:

  1. How to write INSERT IF NOT EXISTS queries in standard SQL
  2. How to find duplicate rows with SQL
  3. How to simulate FULL OUTER JOIN in MySQL
  4. Mutex tables in SQL
  5. How to avoid unique index violations on updates in MySQL

How to set up Gentoo wireless networking on AMD64

This is a quick note on wireless networking on Gentoo GNU/Linux with a dual-core AMD 64-bit processor. It’s not hard to do, but the choices for hardware are more limited.

First, installing Gentoo is just as easy as a standard installation. There is a separate installation document for it. All I needed to do was follow the instructions. I’m not expert with Gentoo, but I’ve installed it a half-dozen times or so since 2002… if it’s not easy for you, wait 4 years and try again ;-)

Here’s my hardware:

  • ECS KV2 Extreme Socket 939 VIA K8T800 Pro ATX AMD Motherboard
  • ASUS V9520-X/TD/64 Geforce FX5200 64MB DDR AGP 4X/8X Low Profile Video Card
  • Sunbeam PSU-CGMI450-US-SV ATX 450W Silver Tri Mirrored X-Plug Modular Power Supply
  • VANTEC Clear 18″ ATA 66/100/133 IDE Round Cable, 2-Connector Model CBL-100IDE18-CS
  • AMD Athlon 64 3700+ San Diego 1GHz FSB Socket 939 Processor Model ADA3700BNBOX (dual core, 2.2GHz)
  • Patriot Signature 2GB (2 x 1GB) 184-Pin DDR SDRAM DDR 400 (PC 3200) Unbuffered System Memory Model PSD2G400KH
  • Western Digital Raptor WD360GD 36.7GB 10,000 RPM Serial ATA150 Hard Drive
  • Seagate Barracuda 7200.8 ST3250823AS 250GB 7200 RPM Serial ATA150 Hard Drive
  • 2 Pioneer Beige IDE DVD Burners, Model DVR-110 D
  • COOLER MASTER Centurion 5 CAC-T05-UW Black Aluminum Bezel, SECC Chassis ATX Mid Tower Computer Case

I bought it all from NewEgg for about $900 and paid my brother’s friend for his time helping me choose the hardware and install it (actually, that was my brother’s friend’s brother). If you want similar help, by the way, send me an email and I’ll hook you up with him. I’d have been lost without his guidance.

As you can see, I forgot a network card! Oops! So I went to a local store and bought something that looked pretty standard. I looked online for it first, but I should have spent more time figuring out compatibility, because as I found out, it’s not guaranteed on AMD64.

I bought a Linksys WRT54G, which has an RT2500 chipset. For those who don’t know, it is the chipset — the guts of the card — which really matters. Most cards have a chipset that may be found in any number of other cards, all of which use the same driver. And manufacturers often change the chipset, sometimes to something totally different, which may be completely unsupported and nothing at all like the original (from a device driver point of view). I thought this chipset was going to be a breeze because the manufacturer has open-sourced the drivers (thank you!). Unfortunately, I found out it’s not supported on more exotic platforms, including mine.

I searched around on the Gentoo forums, and found a lot of weird stories of things working sometimes and not working other times; and I even had the bizarre experience of the card working for me, for just a few minutes. But there are two things in the way of full support: a) 64-bit processor and b) two of them (dual core). The RT2500 chip can be coaxed to work, apparently completely sporadically, in non-SMP mode. In other words, if you build your kernel without SMP support so it only uses one of the two cores, sometimes the drivers will compile. And, when that happens, sometimes the hardware will work and sometimes not. I paid too much for this machine to use half of my processor, so that wasn’t good enough for me even if I could have gotten it to work reliably without SMP.

After some banging of the head against the wall, I returned the card and got one with an Atheros chipset. The card is a D-Link DWL-G510, hardware version B1, firmware version 2.11. Again, you may buy exactly the same card and find a different chip in it, so there are no guarantees. But there is no problem getting the card to work if it has the Atheros chipset.

I built the 2.6.15-gentoo-r4 kernel with SMP support and built in the encryption routines used by the card:

Processor type and features --->
    [*] Symmetric multi-processing support

Cryptographic options ---> (I enabled them all)

then did emerge madwifi-driver (version 0.1443.20060207) and followed the instructions for setting up the ath0 interface via udev (version 079-r1). These are well documented at many websites, including the Gentoo forums, but the output of emerge is really all you need. I didn’t need to create the interface via a command as the instructions say; I was able to just let udev create it at boot.

Then all I needed to do was modprobe ath_pci and the interface appeared! Of course this module can be added to /etc/modules.autoload.d/kernel-2.6. Finally, I configured the network’s settings for my access point.

After this bump in the road, it’s been smooth sailing. The network works great — better than my DSL, that’s for sure. And I have to say, this machine absolutely flies. It probably has a lot to do with the large L1 cache (and everything else for that matter). Look at this:

tigger ~ # time emerge xorg-x11
[snip...]
real    24m59.155s
user    18m3.664s
sys     6m27.860s

Needless to say, there are no problems running Firefox on this machine.

Technorati Tags:No Tags

You might also like:

  1. Ubuntu on Dell Inspiron 1501
  2. Favorite USB wireless card for Ubuntu?
  3. How to update a GCC profile on Gentoo
  4. Why I (still) like Gentoo
  5. Credit card expiration dates should conform to standards

IE blog is a great experience

Have you noticed how consistent the terminology, tone and style are on the IE blog? Clearly, a lot of work is going into making this blog as good as possible at promoting the work the IE team is doing on the new browser. It remains to be seen how the browser itself evolves, but I think there is a lot to learn from studying the IE blog.

First, it looks to me as though Microsoft have carefully chosen a message, purpose and style for the blog’s articles. The articles are very consistently directed at promoting the improvements in the upcoming browser. Nobody puts this type of effort into a blog unless they take it very seriously. Microsoft must believe this particular blog is important to their business. It’s directed at what are probably small audiences (I’m guessing these include web developers, the interested public, and standards geeks) within an already small audience (people who read blogs), but Microsoft probably understands these are the folks who might influence the very people who left IE for Firefox.

Next, it seems to be edited by professionals. Most of the articles are by programmers, team leaders, and other geek types. I don’t want to disparage these folks. They may be excellent writers; I’m not assuming their geek status means they’re not. But you don’t just take a bunch of excellent writers and end up with the same tone and style, no matter how good they are — and that’s exactly what I see on the IE blog. I think you need professional editors to achieve that, so I can’t believe they’re not using them. This is an excellent move, in my opinion.

Just one example: the consistent focus on how IE6 is not a “great experience” and IE7 is a “great experience.” Here are some quotes:

The IE6 experience for managing your Favorites and History is fairly disjoint and inconsistent. In IE7 we wanted to address these issues as well as make sure that we had a great experience between Tabs and Favorites.

IE7 Printing: An Experience You Won’t Want To Miss

In my previous post, I gave a glimpse of what to come in IE7 and printing. Now that the Beta 2 Preview build is publicly available I like to walk you through all the additions we added to printing and print preview. Internal studies showed that printing is the second most used feature after navigation in a browser and traditionally has not seen the respect it deserves.

To address this IE7 focuses around 3 experiences:

  • Having a great default print experience
  • Intuitive UI, putting you in control
  • Letting you select the content you are interested in and preview it for adjustments

Having a great default experience

Most users just want to quickly put the content of the screen on …

I could go on, but instead I’ll just mention another couple of things I see consistently:

  • correct grammar, spelling, capitalization and punctuation
  • consistently interesting, active, and engaging tone
  • universal excitement about all the great new improvements in IE

Overall, I get the sense someone may have come up with “talking points,” similar to how politicians discuss their platforms. It’s not just the IE blog, either. The RSS Team Blog is getting the same love.

None of these ideas is new. Each is a best practice in traditional media, and even in other online content, but in my opinion, relatively few people are writing blogs to traditional standards of quality — or even expecting other bloggers to do so. I think Microsoft is doing a great job with these blogs. They really stand out head and shoulders. I’d love to emulate their consistency and quality myself.

These blogs bear witness to Microsoft’s consummate marketing ability. I don’t take it for granted their IE programming team is as good, but their blog certainly plants a good impression in the reader’s mind, and if you’re in business, convincing people you’re great may be more important, in the final analysis, than actually being great. I hope, not because I’m a fan of Microsoft as a software company (I’m not!) but because so many people use their products, that they really do get IE in good shape. But whether they do or not, their blog is great.

The only gripe I have, which is shared with many other blogs, is the off-topic posts. I read the blog to learn about what’s happening in IE. I honestly, truly don’t care about the Seattle Seahawks (or any other sports team). These posts detract from the blog as far as I’m concerned; they are like spam in my inbox.

Well done Microsoft! Now if only the rest of the blogging world would sit up and take notice!

Technorati Tags:No Tags

You might also like:

  1. Blogs as glamour magazines
  2. How to install and maintain multiple WordPress blogs easily

How to choose SQL column types: a case study

There are at least two important factors to consider when choosing a column type for a SQL database table: technical requirements and semantics. The choice can be confusing, but it is important to get it right. I’ve seen it done both ways and it’s sometimes hard to tell what’s wrong — it just feels wrong. In this article I will explain how to spot these mistakes, and more importantly, how to explain the mistake clearly to others.

I’ll assume the data requirements are already known — in other words, I have all the information I need to decide what type of data will live in the column.

Technical considerations

Technical considerations are often the first thing many technical people think about when choosing a column type. This is natural; technical prowess is a core strength, so there can sometimes be a tendency to make it the most important factor, or even the only factor, in the decision. Resist that tendency!

That said, it’s important to consider the storage requirements, indexing characteristics, foreign key relationships and so forth when choosing a column type. Most RDBMSs have extensive documentation about choosing column types. Any specific suggestion other than “consult your documentation” is beyond this article’s scope! If you’re not intimately familiar with your RDBMS, you might also consider getting help from someone who is.

I assume a highly technical audience for this website, so the only other thing I’ll say about technical considerations is “Beware! You need to think more broadly.”

Semantic considerations

It’s very valuable to choose column types that describe the data, because the type serves as documentation about the data. Even if the data is documented in a written specification, if I choose the wrong data type, I’ve contradicted the specification. The only true, authoritative specification of what a system does is the actual implementation, so it needs to be aligned with the reality of the system’s purpose, not just its function.

At some point, meaning becomes very important, because there is often a wide latitude within which the technical considerations are immaterial. For example, if data type A, B, and C can all store the data adequately, some people may think “well, at this point it doesn’t matter; I can choose any of the three.” Be careful with that assumption! When storing a price, for example, MONEY or a related type is a much better choice than other numeric types, even if they can store the same data. Conversely, one should never use a MONEY type to store something that’s not a currency value.

There are also variations on data types that might confuse the issue. For example, in SQL Server 2000 the DECIMAL data type has the following storage requirements, depending on the precision (maximum total number of digits, both to the left and right of the decimal point):

I recently saw a table designed with a DECIMAL(9, 1) column to hold average customer ratings, whose values range from 0 to 5 with one decimal place. The best data type, in my opinion, is DECIMAL(2, 1). Why did the developer add more precision?

According to the table above, the column requires 5 bytes of storage, whether the precision is 2 or 9. I can almost imagine the developer’s train of thought:

This data is (2, 1) but that needs 5 bytes, and for the same size, I could get (9, 1). That’s free storage! I should use it!

This imaginary line of reasoning is a fallacy. It values a factor that makes no difference, and ignores one that does.

Explaining the difference

How do you explain this mindset to someone who’s stuck in the “free storage” worldview? Let’s look again at the two viewpoints:

  • The developer’s line of reasoning was “I can get more capacity for the same storage, so I should.”
  • A better train of thought is “this data needs 5 bytes no matter what, so the unneeded extra storage is valueless, but the extra documentation provided by the smaller precision is valuable.”

In other words, one person thinks lack of control over data storage size makes the data storage capacity very important. The second person thinks the lack of control makes the data storage capacity irrelevant. I agree with the second person.

Notice the contrast between the technical and semantic factors. These two mindsets are in conflict. This is why I emphasized the need to temper the technical person’s mindset, because technical people concentrate on their areas of competence. It’s important to step back and take the technical blinders off, lest a sub-optimal choice look like the best thing.

How you approach it is up to you, but I might consider holding the unconscious decisions and gut feelings up in the light for conscious examination. I might also emphasize the value of the semantics. In the example I gave above, I simply emailed my thoughts to the DBAs, who agreed immediately and changed the column type to DECIMAL(2, 1). Perhaps it can be that simple, perhaps not.

Define your own types

Many RDBMSs allow user-defined data types. At my current employer, I’ve never seen this facility used. I would love to see us start doing so, though. It would be very helpful to define types such as ITEMNO, CREDITCARDNO and TRACKINGCODE. I think user-defined types are just as important as roles (another thing we don’t use).

User-defined types do bring their own challenges; they require more work, may have technical ramifications (how the RDBMS handles the type across databases or servers where it may not be defined), might cause extra bureaucracy, and there are external factors to think about — educating a consultant who may need to work with the data, for example. Standard tools and code libraries may not be aware of them, too — for example, the .NET SqlClient classes. I think user-defined types are worth considering, but of course that’s a decision that must be weighed carefully.

The moral of the story

The moral of the story is “think carefully.” Now, if only I were better at that myself! The other moral, of course, is “have fun.”

Technorati Tags:No Tags

You might also like:

  1. Why you should not use BIT columns in MySQL