Stay Curious!

Efficient alternatives to Microsoft SQL Server's openxml

Microsoft SQL Server’s openxml function is highly CPU-intensive. This article demonstrates more efficient alternatives to openxml.

Why people use openxml

A common usage for openxml is to use XML to pass a “list” or “table” of values into a stored procedure. Here is a scenario I’ve seen a lot at my current job:

  1. An ASP accepts a string as input, such as “ipod car adapter”
  2. The ASP parses the input into space-separated values, then concatenates them together into XML
  3. The ASP passes the XML to a stored procedure
  4. The sproc uses openxml to populate a temporary table from the XML
  5. The sproc does some work (such as searching a product catalog)

This is fine, except openxml can cause a huge CPU spike, which is bad news in a Web-facing search server during holiday shopping season.

I have rewritten such a search sproc to remove openxml, and found it to account for upwards of three-fourths of the total cost, even in an otherwise very expensive sproc. I have anecdotal wisdom from other DBAs about people who over-use openxml and end up with systems that run hot all the time.

A list of simple delimited values doesn’t need XML – it’s overkill. A better alternative is to pass the string directly into the sproc, and write a UDF to return a table with the elements of the string. It’s still string parsing, but it’s better than using XML.

Solution 1: a WHILE loop in a user-defined function

Here is a user-defined function that will split a delimited set of words into a table and return the table:

CREATE function dbo.fn_SplitWords (
    @Words varchar(8000),
    @Delim char(1))
    returns @Words_table table(word varchar(255), ident int identity not null)
as begin

    declare @Wordstart int, @WordEnd int, @DoubleDelim varchar(2)
    
    set @Wordstart = 1
    set @WordEnd = 1
    set @DoubleDelim = replicate(@Delim, 2)

    -- Prepare the data
    while charindex(@DoubleDelim, @Words) > 0
        set @Words = replace(@Words, @DoubleDelim, @Delim)
    -- Left-trim it
    if left(@Words, 1) = @Delim
    begin
        if @Delim = ' '
            set @Words = ltrim(@Words)
        else
            set @Words = substring(
                @Words,
                patindex('%[^' + @Delim + ']%', @Words),
                len(@Words) - patindex('%[^' + @Delim + ']%', @Words) + 1)
    end
    -- Right-pad it
    if right(@Words, 1) <> @Delim
        set @Words = @Words + @Delim

    while @Wordstart > 0
    begin
        -- Extract the next word
        set @WordEnd = charindex(@Delim, @Words, @Wordstart)
        if @WordEnd > @WordStart
        begin
            insert into @Words_table select substring(@Words, @Wordstart, @WordEnd - @Wordstart)
            set @Wordstart = @WordEnd + 1
        end
        else set @Wordstart = 0 -- Terminate the loop
    end
    return
end

Please note the bug I’m avoiding in the while loop above. I explain the SQL Server 2000 replace bug in another post.

Here are some test calls for the UDF:

select * from dbo..fn_SplitWords('this is a test call', ' ')
select * from dbo..fn_SplitWords(' this is a test call', ' ')
select * from dbo..fn_SplitWords('this is a test call ', ' ')
select * from dbo..fn_SplitWords('this-is-a-test-call', '-')
select * from dbo..fn_SplitWords('this is a test      call', ' ')
select * from dbo..fn_SplitWords(' ', ' ')
select * from dbo..fn_SplitWords('', ' ')

After I wrote this, I saw someone else did the same thing elsewhere, though in my opinion very poorly implemented and explained. In any case, here’s a link for the sake of completeness: Treat Yourself to Fn_Split()

A more efficient approach

Instead of using a loop as I did above, it’s actually much more efficient to use an integers table and a JOIN to parse the tokens apart. This approach is slightly less flexible, and doesn’t handle all the special cases I handled above with my UDF, such as tokens being separated by several delimiters instead of just one. Regardless, it is absolutely a better way to go, as long as the input is well-formed. It doesn’t use any nonstandard SQL, either – it’s a relational solution to the problem. Here are three resources where you can learn more about this extremely elegant technique:

Other advantages to a UDF

This approach has other advantages over openxml, too:

  1. it’s easy to string-ify a table for input to another sproc from within SQL (see my article about concatenating strings in SQL for more)
  2. it is much simpler, needs much less code, and is easier to understand and maintain
  3. the UDF can be used directly in a FROM clause; there is no need to create a temporary table or table variable unless you want to store and re-use the values
  4. you can use table variables instead of temporary tables, saving disk I/O and locks in the tempdb database
  5. you can pass more data without running into upper limits on the size of varchar, because space-delimited is much less verbose than XML

I would be remiss if I didn’t mention the downsides:

  1. string-parsing is never efficient, and can be error-prone
  2. you will need to create and maintain UDFs (in my case, I need at least two – one for strings and one for integers)
  3. you have less flexibility about types and schemas; this technique is only convenient for simple cases

More about efficiency

SQL server seems to be smart enough to reuse resources within a query batch, so the high cost of using openxml only seems to happen on the first invocation in a batch. When I benchmarked it with a thousand iterations, the string-parsing solution’s constant cost appeared to be about half the constant cost of openxml – not a significant improvement. However, in the common case where it’s used only once, the string-parsing is much more efficient because there is no startup cost.

All in all, I think string-parsing is the lesser of the evils.

Posted on Tue, Nov 15, 2005. Approximately 1000 Words.

Databases