Microsoft SQL Server’s openxml
function is highly CPU-intensive. This article demonstrates more efficient alternatives to openxml
.
Why people use openxml
A common usage for openxml
is to use XML to pass a “list” or “table” of values into a stored procedure. Here is a scenario I’ve seen a lot at my current job:
- An ASP accepts a string as input, such as “ipod car adapter”
- The ASP parses the input into space-separated values, then concatenates them together into XML
- The ASP passes the XML to a stored procedure
- The sproc uses
openxml
to populate a temporary table from the XML - The sproc does some work (such as searching a product catalog)
This is fine, except openxml
can cause a huge CPU spike, which is bad news in a Web-facing search server during holiday shopping season.
I have rewritten such a search sproc to remove openxml
, and found it to account for upwards of three-fourths of the total cost, even in an otherwise very expensive sproc. I have anecdotal wisdom from other DBAs about people who over-use openxml
and end up with systems that run hot all the time.
A list of simple delimited values doesn’t need XML—it’s overkill. A better alternative is to pass the string directly into the sproc, and write a UDF to return a table with the elements of the string. It’s still string parsing, but it’s better than using XML.
Solution 1: a WHILE
loop in a user-defined function
Here is a user-defined function that will split a delimited set of words into a table and return the table:
CREATE function dbo.fn_SplitWords (
@Words varchar(8000),
@Delim char(1))
returns @Words_table table(word varchar(255), ident int identity not null)
as begin
declare @Wordstart int, @WordEnd int, @DoubleDelim varchar(2)
set @Wordstart = 1
set @WordEnd = 1
set @DoubleDelim = replicate(@Delim, 2)
-- Prepare the data
while charindex(@DoubleDelim, @Words) > 0
set @Words = replace(@Words, @DoubleDelim, @Delim)
-- Left-trim it
if left(@Words, 1) = @Delim
begin
if @Delim = ' '
set @Words = ltrim(@Words)
else
set @Words = substring(
@Words,
patindex('%[^' + @Delim + ']%', @Words),
len(@Words) - patindex('%[^' + @Delim + ']%', @Words) + 1)
end
-- Right-pad it
if right(@Words, 1) <> @Delim
set @Words = @Words + @Delim
while @Wordstart > 0
begin
-- Extract the next word
set @WordEnd = charindex(@Delim, @Words, @Wordstart)
if @WordEnd > @WordStart
begin
insert into @Words_table select substring(@Words, @Wordstart, @WordEnd - @Wordstart)
set @Wordstart = @WordEnd + 1
end
else set @Wordstart = 0 -- Terminate the loop
end
return
end
Please note the bug I’m avoiding in the while
loop above. I explain the SQL Server 2000 replace
bug in another post.
Here are some test calls for the UDF:
select * from dbo..fn_SplitWords('this is a test call', ' ')
select * from dbo..fn_SplitWords(' this is a test call', ' ')
select * from dbo..fn_SplitWords('this is a test call ', ' ')
select * from dbo..fn_SplitWords('this-is-a-test-call', '-')
select * from dbo..fn_SplitWords('this is a test call', ' ')
select * from dbo..fn_SplitWords(' ', ' ')
select * from dbo..fn_SplitWords('', ' ')
After I wrote this, I saw someone else did the same thing elsewhere, though in my opinion very poorly implemented and explained. In any case, here’s a link for the sake of completeness: Treat Yourself to Fn_Split()
A more efficient approach
Instead of using a loop as I did above, it’s actually much more efficient to use an integers table and a JOIN
to parse the tokens apart. This approach is slightly less flexible, and doesn’t handle all the special cases I handled above with my UDF, such as tokens being separated by several delimiters instead of just one. Regardless, it is absolutely a better way to go, as long as the input is well-formed. It doesn’t use any nonstandard SQL, either—it’s a relational solution to the problem. Here are three resources where you can learn more about this extremely elegant technique:
- http://www.bizdatasolutions.com/tsql/sqlarrays.asp
- http://www.sommarskog.se/arrays-in-sql.html
- http://www.users.drew.edu/skass/sql/
Other advantages to a UDF
This approach has other advantages over openxml
, too:
- it’s easy to string-ify a table for input to another sproc from within SQL (see my article about concatenating strings in SQL for more)
- it is much simpler, needs much less code, and is easier to understand and maintain
- the UDF can be used directly in a FROM clause; there is no need to create a temporary table or table variable unless you want to store and re-use the values
- you can use table variables instead of temporary tables, saving disk I/O and locks in the
tempdb
database - you can pass more data without running into upper limits on the size of varchar, because space-delimited is much less verbose than XML
I would be remiss if I didn’t mention the downsides:
- string-parsing is never efficient, and can be error-prone
- you will need to create and maintain UDFs (in my case, I need at least two—one for strings and one for integers)
- you have less flexibility about types and schemas; this technique is only convenient for simple cases
More about efficiency
SQL server seems to be smart enough to reuse resources within a query batch, so the high cost of using openxml
only seems to happen on the first invocation in a batch. When I benchmarked it with a thousand iterations, the string-parsing solution’s constant cost appeared to be about half the constant cost of openxml
—not a significant improvement. However, in the common case where it’s used only once, the string-parsing is much more efficient because there is no startup cost.
All in all, I think string-parsing is the lesser of the evils.