Thursday, March 29, 2012

Finding if a REAL is LIKE %[49]9

Timings... sometimes there are almost too many ways to do the same thing.

The only significant findings I see from all the below timings is:

1) Integer math is generally fastest, naturally. Bigint math isn't much
slower, for integers that all fit within an integer.

2) Converting float to varchar is relatively slow, and should be avoided if
possible. Converting from integer to varchar or varchar to int is several
times faster.

3) Most significantly, and less obvious, CASE expr WHEN ... recomputes expr
for each WHEN condition, unfortunately, and is the same speed (or perhaps
slightly slower) as listing WHEN expr = value for each condition. Perhaps an
indexed computed column (somehow materialized) would be advisable when
possible to avoid repeated computations in CASE..WHEN expressions (if that
helps..).

Note that if you divide by COUNT(*), most timings below are below one
microsecond per row, so this all may not be very significant in most
applications, unless you do frequent aggregations of some sort.

COUNT(*) FROM [my_sf_table] = 477446 rows

The result from each query = either 47527 or 47527.0

Platform: Athlon 2000 XP w/512MB RAM, table seems to be cached in RAM, SQL
2000, all queries run at least 3 times and minimum timings shown (msec).

SRP is a REAL (4 bytes)

Fastest ones are near the end.

CPU SQL
(ms)
-- Convert to varchar (implicitly) and compare right two digits
-- (original version -- no I didn't write it)
4546 select sum(case right(srp,2)
when '99' then 1 when '49' then 1 else 0 end)
from sf

-- Use LIKE for a single comparison instead of two, much faster
-- Note that the big speedup indicates that
-- CASE expr WHEN y then a WHEN z then b .
-- recalculates expr for each WHEN clause
2023 select sum(case when srp like '%[49]9' then 1 else 0 end)
from sf

-- Floating point method of taking a modulus (lacking fmod/modf)
2291 select sum(case round(srp - 100e*floor(srp*.01e),0)
when 99 then 1 when 49 then 1 else 0 end)
from sf

-- Round to nearest 50 and compare with 49
1322 select sum(case round(srp-50e*floor(srp*.02e),0)
when 49 then 1 else 0 end)
from sf

-- Divide by 49 by multiplying by (slightly larger than) 1e/49e
811 select sum(floor((cast(srp as integer)%50)*2.04082E-2))
from sf

-- Integer approach without using CASE
731 select sum(coalesce(nullif(sign(cast(srp as
integer)%50-48),-1),0))
from sf

-- My original integer approach
651 select sum(case cast(srp as integer)%100
when 99 then 1 when 49 then 1 else 0 end)
from sf

-- Modulus 50 integer approach without CASE
481 select sum((cast(srp as integer)%50)/49)
from sf

-- Modulus 50 integer approach
460 select sum(case cast(srp as integer)%50
when 49 then 1 else 0 end)
from sf

-- bigint without CASE
531 select sum((cast(srp as bigint)%50)/49)
from sf

-- bigint with CASE
521 select sum(case cast(srp as bigint)%50
when 49 then 1 else 0 end)
from sf

-- get SIGN to return -1 or 0, then add 1
-- much better than the coalesce+nullif approach
500 select sum(sign(cast(srp as integer)%50-49)+1)
from sf

-- SIGN with BIGINT
551 select sum(sign(cast(srp as bigint)%50-49)+1)
from sf

BTW, I know srp should be int to begin with for this to be faster... Okay,
so...

select cast(srp as int) srp into sf from [my_real_sf_table]

720 select sum(case when srp like '%[49]9' then 1 else 0 end) from sf

339 select sum(1+sign(srp%50-49)) from sf
310 select sum(srp%50/49) from sf
300 select sum(case srp%50 when 49 then 1 else 0 end) from sf

What if it were a char(7)?

select cast(cast(srp as integer) as char(7)) srp into sf2 from
[my_sf_table]

801 select sum(case right(rtrim(srp),2) when '49' then 1
when '99' then 1 else 0 end) from sf2
717 select sum(case when srp like '%[49]9' then 1 else 0 end) from sf2
405 select sum(srp%50/49) from sf2
391 select sum(case srp%50 when 49 then 1 else 0 end) from sf2

How about varchar(7)?

drop table sf2
select cast(cast(srp as integer) as varchar(7)) srp into sf2 from
[my_sf_table]

581 select sum(case right(srp,2) when '49' then 1
when '99' then 1 else 0 end) from sf2
569 select sum(case when srp like '%[49]9' then 1 else 0 end) from sf2

LIKE is faster on VARCHAR than on CHAR columns...
Apparently it has to effectively RTRIM the trailing spaces during the LIKE
operation.

Is binary collation any faster?

drop table sf2
select cast(cast(srp as integer) as varchar(7))
COLLATE Latin1_General_BIN srp
into sf2 from tbl_superfile

561 select sum(case right(srp,2) when '49' then 1
when '99' then 1 else 0 end) from sf2
530 select sum(case when srp like '%[49]9' then 1 else 0 end) from sf2

Binary collation comparisons are slightly faster, though it's not a big
difference (with just two characters being compared).

662 select sum(case convert(binary(2),right(srp,2))
when 0x3439 then 1 when 0x3939 then 1 else 0 end) from sf2

----

5037 select right(srp,2) srp,count(*) from my_sf_table
group by right(srp,2)
order by right(srp,2)

920 select cast(srp as int)%100 srp,count(*) from my_sf_table
group by cast(srp as int)%100
order by cast(srp as int)%100

--

On the one hand, premature optimization can be a waste of time and energy.
On the other hand, understanding performance implications of various
operations can help write more efficient systems.

In any case, an indexed computed column or one updated on a trigger could
virtually eliminate the need for any of these calculations to be performed,
except upon insertion or update, so maybe my comparisons aren't very
meaningful for most applications, considering we're talking about less than
3 microseconds per row here worst-case.

But the results remind me, some recommend avoiding Identity when it's not
necessary. I find Identity(int,1,1) to be a nice, compact surrogate key that
is useful for quick comparisons, grouping, etc, and so on. Also, it seems
most appropriate as the primary key to all lookup tables in a star schema in
OLAP data warehousing. (?) Of course, in some situations, it's not
appropriate, particularly when having a surrogate key violates data
integrity by allowing duplicates that would not be allowed with a proper
primary key constraint, or when the surrogate key is completely redundant
with (especially a short) single-column unique key value that would be a
better selection as the primary key. With multi-column primary keys, I think
it's sometimes convenient to have a surrogate Identity if only for IN
clauses that reference that identity column (though EXISTS can usually
replace those, so maybe that's a weak excuse for an extra column.)Aaron W. West (tallpeak@.hotmail.NO.SPAM) writes:
> CPU SQL
> (ms)
> -- Convert to varchar (implicitly) and compare right two digits
> -- (original version -- no I didn't write it)
> 4546 select sum(case right(srp,2)
> when '99' then 1 when '49' then 1 else 0 end)
> from sf
> -- Use LIKE for a single comparison instead of two, much faster
> -- Note that the big speedup indicates that
> -- CASE expr WHEN y then a WHEN z then b .
> -- recalculates expr for each WHEN clause
> 2023 select sum(case when srp like '%[49]9' then 1 else 0 end)
> from sf

I tried some variations of this, and indeed it seems that there is a cost
when the expression appears with several WITH clauses. I tried a variation
of this, where I supplemented the test table with a char(2) column, so I
could factor out that the WITH clauses themselves were not the culprits.

CREATE TABLE realthing (realta real NOT NULL,
lasttwo char(2) NOT NULL)
go
INSERT realthing (realta, lasttwo)
SELECT r, right(r, 2)
FROM (SELECT r = convert(real, checksum(newid()))
FROM Northwind..Orders a
CROSS JOIN Northwind..Orders b) AS f
go
DBCC DROPCLEANBUFFERS
go
DECLARE @.start datetime
SELECT @.start = getdate()
SELECT SUM(CASE right(realta, 2)
WHEN '99' THEN 1
WHEN '49' THEN 1
WHEN '39' THEN 1
ELSE 0 END)
FROM realthing
SELECT datediff(ms, @.start, getdate()) -- 20766 ms.
go
DBCC DROPCLEANBUFFERS
go
DECLARE @.start datetime
SELECT @.start = getdate()
SELECT SUM(CASE WHEN right(realta, 2) LIKE '[349]9' THEN 1 ELSE 0 END)
FROM realthing
SELECT datediff(ms, @.start, getdate()) -- 8406 ms.
go
DBCC DROPCLEANBUFFERS
go
DECLARE @.start datetime
SELECT @.start = getdate()
SELECT SUM(CASE lasttwo
WHEN '99' THEN 1
WHEN '49' THEN 1
WHEN '39' THEN 1
ELSE 0 END)
FROM realthing
SELECT datediff(ms, @.start, getdate()) -- 920 ms.
go
DBCC DROPCLEANBUFFERS
go
DECLARE @.start datetime
SELECT @.start = getdate()
SELECT SUM(CASE WHEN lasttwo LIKE '[349]9' THEN 1 ELSE 0 END)
FROM realthing
SELECT datediff(ms, @.start, getdate()) -- 1466 ms.

Thus, when using the char(2) column LIKE is slower despite that there
is only one WHEN condition. So indeed it seems that right(realta, 2)
is computed thrice in the first test.

Another funny thing is the actual results from the queries - they are
different. When I ran:

select count(*) from realthing where lasttwo <> right(realta, 2)

The result was about half of the size of realthing! I can't see that
this difference affects the results though.

Now, your article had a lot more tests, but I have to confess that
you lost me quite early, because you never discussed what is the
actual problem. Since you are working with floating-poiont numbers
there is a great risk that different methods not only has different
execution times, but also gives different results.

--
Erland Sommarskog, SQL Server MVP, esquel@.sommarskog.se

Books Online for SQL Server SP3 at
http://www.microsoft.com/sql/techin.../2000/books.asp|||Thanks for reminding me... I should always start a post with a description
of the problem.

An existing trigger updates the column tbl_sales.aors to A if a price (SRP,
in pennies) ends in other than 49 or 99 (cents), and S otherwise. In this
case, the problem was largely in my head. That is, I didn't "like" the
existing UPDATE statement, and wanted to see if I could make it more
efficient. My first solution was near to the best, convert real to integer
and take the modulus...

Here, I'll start over, but with a significantly different problem, and one
that is a real problem, that of denormalization vs normalization:

Currently, I have a sort of data warehouse table of sales data. It stores
the date and time of each sale, client_id, UPC, and other information, but
no quantity, and no price. The quantity is assumed to be 1, and multiple
sales are stored as multiple records.

During import of a sales data file, many fields are populated from a sort of
product description table, in an insert trigger. The original author
apparently thought denormalization of the product data into the sales table
would be better for query performance. I don't think so, but am not really
sure.

There's a potential backfill issue with the current denormalized data
structure. After product information changes, or new product information
comes in, I have to backfill all the sales records missing the changed data.
I think it goes without saying that it's bad to have to backfill data just
because of denormalization.

1) Should I completely remove all redundant information from this table?

2) If I remove some fields and not others, which ones should I remove? (or
how do I decide which ones not to remove ... ones that are least likely to
change are most likely to be in GROUP BY clauses?)

(I'm thinking I should rename the table, remove all I can, create a view to
expose the lookup fields, and then work on any resulting performance issues
I may have...)

I've mostly decided to remove the redundant descriptive fields (artist_name,
title, label_name, etc), and replace them with title_id, but am unsure
whether to eliminate fields such as genre, market_id, region_id, and others
used in GROUP BY queries.

(I should probably be using MS OLAP to aggregate the data, but haven't yet
put all the data in the proper form for this... adding title_id should help,
I think... now need a table of dates, a table of weeks, a table of chains, a
table of coalitions, and other dimension tables.. which reminds me,
client_ID can be chain_ID or coalition_ID, which is a violation of 1NF, so
I've read... I could create views that expose the chains and coalitions, and
wonder if that would be a reasonable solution or if I should break the table
into four tables for the different types of clients...)

Joining two or three tables doesn't seem like it should be a big expense,
when the lookup tables are no more than a few hundred thousand rows for the
largest one (the information on products), for example... if I have the
proper indexes. But I think GROUP BY would be a little slower if I added any
columns from the other tables to my GROUP BY, so maybe a little
denormalization of data that rarely changes (a client is unlikely to change
its chain_ID) isn't a bad idea. But any denormalization should include
constraints so that if the data changes in the base table, it is
automatically changed in the sales table (Foreign key... add the constraint
REFERENCES.. ON UPDATE CASCADE to the chain_id of the referenced table, and
make sure the column is indexed?)

"Erland Sommarskog" <esquel@.sommarskog.se> wrote in message
news:Xns951BD21B14996Yazorman@.127.0.0.1...
Aaron W. West (tallpeak@.hotmail.NO.SPAM) writes:
> CPU SQL
> (ms)
> -- Convert to varchar (implicitly) and compare right two digits
> -- (original version -- no I didn't write it)
> 4546 select sum(case right(srp,2)
> when '99' then 1 when '49' then 1 else 0 end)
> from sf
> -- Use LIKE for a single comparison instead of two, much faster
> -- Note that the big speedup indicates that
> -- CASE expr WHEN y then a WHEN z then b .
> -- recalculates expr for each WHEN clause
> 2023 select sum(case when srp like '%[49]9' then 1 else 0 end)
> from sf

I tried some variations of this, and indeed it seems that there is a cost
when the expression appears with several WITH clauses. I tried a variation
of this, where I supplemented the test table with a char(2) column, so I
could factor out that the WITH clauses themselves were not the culprits.

CREATE TABLE realthing (realta real NOT NULL,
lasttwo char(2) NOT NULL)
go
INSERT realthing (realta, lasttwo)
SELECT r, right(r, 2)
FROM (SELECT r = convert(real, checksum(newid()))
FROM Northwind..Orders a
CROSS JOIN Northwind..Orders b) AS f
go
DBCC DROPCLEANBUFFERS
go
DECLARE @.start datetime
SELECT @.start = getdate()
SELECT SUM(CASE right(realta, 2)
WHEN '99' THEN 1
WHEN '49' THEN 1
WHEN '39' THEN 1
ELSE 0 END)
FROM realthing
SELECT datediff(ms, @.start, getdate()) -- 20766 ms.
go
DBCC DROPCLEANBUFFERS
go
DECLARE @.start datetime
SELECT @.start = getdate()
SELECT SUM(CASE WHEN right(realta, 2) LIKE '[349]9' THEN 1 ELSE 0 END)
FROM realthing
SELECT datediff(ms, @.start, getdate()) -- 8406 ms.
go
DBCC DROPCLEANBUFFERS
go
DECLARE @.start datetime
SELECT @.start = getdate()
SELECT SUM(CASE lasttwo
WHEN '99' THEN 1
WHEN '49' THEN 1
WHEN '39' THEN 1
ELSE 0 END)
FROM realthing
SELECT datediff(ms, @.start, getdate()) -- 920 ms.
go
DBCC DROPCLEANBUFFERS
go
DECLARE @.start datetime
SELECT @.start = getdate()
SELECT SUM(CASE WHEN lasttwo LIKE '[349]9' THEN 1 ELSE 0 END)
FROM realthing
SELECT datediff(ms, @.start, getdate()) -- 1466 ms.

Thus, when using the char(2) column LIKE is slower despite that there
is only one WHEN condition. So indeed it seems that right(realta, 2)
is computed thrice in the first test.

Another funny thing is the actual results from the queries - they are
different. When I ran:

select count(*) from realthing where lasttwo <> right(realta, 2)

The result was about half of the size of realthing! I can't see that
this difference affects the results though.

Now, your article had a lot more tests, but I have to confess that
you lost me quite early, because you never discussed what is the
actual problem. Since you are working with floating-poiont numbers
there is a great risk that different methods not only has different
execution times, but also gives different results.

--
Erland Sommarskog, SQL Server MVP, esquel@.sommarskog.se

Books Online for SQL Server SP3 at
http://www.microsoft.com/sql/techin.../2000/books.asp|||Aaron W. West (tallpeak@.hotmail.NO.SPAM) writes:
> An existing trigger updates the column tbl_sales.aors to A if a price
> (SRP, in pennies) ends in other than 49 or 99 (cents), and S otherwise.
> In this case, the problem was largely in my head. That is, I didn't
> "like" the existing UPDATE statement, and wanted to see if I could make
> it more efficient. My first solution was near to the best, convert real
> to integer and take the modulus...

If you have the price as real, I would be more worried that a price
which exactly is 76.99 internally is represented as 76.989998 and the
gets handled as 76.98.

> During import of a sales data file, many fields are populated from a
> sort of product description table, in an insert trigger. The original
> author apparently thought denormalization of the product data into the
> sales table would be better for query performance. I don't think so, but
> am not really sure.
> There's a potential backfill issue with the current denormalized data
> structure. After product information changes, or new product information
> comes in, I have to backfill all the sales records missing the changed
> data. I think it goes without saying that it's bad to have to backfill
> data just because of denormalization.

I would not say that I like this arrangement. Then again, I work only
with OLTP databases, and I have understand that in data warehouses
denormalization is customary.

It is difficult to say what would be the best in your case. On the one
hand, the denormalization appears to cause some problems when you update.
On the other, the denormalization may pay back when you actually query
in the inforamation. Actually benchmarking is probably the only way to
find out. The catch may be that the size of the data is such that it
does not really lend itself to massive experimenting.

The group microsoft.public.sqlserver.datawarehouse is likely to
be frequented by people with more experience of dilemmas like this one.

--
Erland Sommarskog, SQL Server MVP, esquel@.sommarskog.se

Books Online for SQL Server SP3 at
http://www.microsoft.com/sql/techin.../2000/books.asp|||SRP is declared as a REAL, but contains integer data (pennies). I just
haven't bothered to drop indexes referring to SRP and alter the column to
integer yet... but in any case, integer data seems to be always represented
exactly in a REAL, until it's up to the limit of representation of REAL...
which is 2^24, because the mantissa is 24 bits in a REAL.

----

if you care...

Three ways of proving that fact...

-- This finds the first power of 2 + 1 which is not representable accurately
as a REAL
declare @.i int, @.r real
set @.i = 2
while @.i < 2001002003
begin
if (cast(@.i as float)<>cast(@.i as real))
begin
print @.i
break
end
set @.i=((@.i&-2)*2)|1
end

--outputs 16777217

-- Same without a loop:
SELECT MIN(Number)
FROM Numbers
WHERE Number<31 AND CAST(POWER(2,Number)+1 AS REAL)
<> CAST(POWER(2,Number)+1 AS FLOAT)

----
24

(1 row(s) affected)

Or, given a table of numbers 1 to 1000 at least:

SELECT MIN(M*1000000+T*1000+U)-1 MaxExactRealInteger
FROM
(SELECT TOP 33 Number-1 M FROM Numbers) AS M,
(SELECT TOP 1000 Number-1 T FROM Numbers) AS T,
(SELECT TOP 1000 Number-1 U FROM Numbers) AS U
WHERE CAST(M*1000000+T*1000+U AS FLOAT) <>
CAST(M*1000000+T*1000+U AS REAL)

MaxExactRealInteger
------
16777216

(1 row(s) affected)

Which again, is POWER(2,24)

Takes around a half minute to run...

"Erland Sommarskog" <esquel@.sommarskog.se> wrote in message
news:Xns951C95812A564Yazorman@.127.0.0.1...
Aaron W. West (tallpeak@.hotmail.NO.SPAM) writes:
> An existing trigger updates the column tbl_sales.aors to A if a price
> (SRP, in pennies) ends in other than 49 or 99 (cents), and S otherwise.
> In this case, the problem was largely in my head. That is, I didn't
> "like" the existing UPDATE statement, and wanted to see if I could make
> it more efficient. My first solution was near to the best, convert real
> to integer and take the modulus...

If you have the price as real, I would be more worried that a price
which exactly is 76.99 internally is represented as 76.989998 and the
gets handled as 76.98.

...sql

No comments:

Post a Comment