Hello!
I need to find the best way to find items that are similar to each
other depending on several attributes.
Each item has 100 attributes with a numeric (float) value between 0 and
100.
Items that are similar have the least difference attribute value for
all 100 attribtes.
So what we need to do is calculate the difference (of all attributes)
between the item that we want to find similiarities of and the other
items in the database. But we cannot use SUM() over all attributes
because we need to do some calculation on most (if not all) attributes.
For example, if the difference is more than 20 we should consider that
as the item does not match very well on that attribute, therefore
return 0 (zero). If the difference is less then 5 we consider that as
very good match and therefore return 100 points in similiary. If its
between 5 and 20 then we return a score based on a formula we have.
We will also need to group these items into categories. For example
category1 can be calculated like
(ScoreOfAttribute1+ScoreOfAttribute2+Sco
reOfAttribute8)/3. We divide by
3 since there are 3 attributes in that category.
I have tried a couple of ways to implement this.
Example one:
Items
- ItemId
- Name
- Price
Attributes
- AttributeId
- Description
AttributeValues
- ItemId
- AttributeId
- AttributeValue
Example two:
Items
- ItemId
- Name
- Price
AttributeValue
- AttributeId
- Attribute1Value
- Attribute2Value
- .
- Attribute100Value
I have also carefully choosen the indexes and most of the time they
will be used but there will always be some table scan since we need to
do calculations to find the items with most simlarity, based on my
"similary calculations".
So now I need to get some new ideas on how to solve this problem.
I have been thinking of a table with precalculated points but that
table will probably get very big. We have over 150000 items that will
need to be compared with each other.
I have been thinking if it would be possible to sum all attribute
values and store that and then divide by the number of attributes, and
then we could just calculate the difference of the best 500 hits and
get the top 100 from those? This will probably be quicker but it may
also result in faults, since we cannot be sure that the most similar
item is within this top 500.
So if you have any ideas on this problem please write!
The table design can be changed. The performance on the SELECT is the
most important and the UPDATE/DELETE/INSERT is acceptable to take some
time.
What we are looking for here is to get the result (top 100 most similar
items (out of 150000 items with 100 attributes)) within 1 second.Post some sample data and the result you want
Madhivanan|||Please post DDL and maybe some insert statements for sample data.
http://www.aspfaq.com/etiquette.asp?id=5006
Just to take a wild guess, lets imagine the following table.
(Note: I created 3 views below just to break up the logic into simple chunks
that are easier to alter/discuss. There is no reason why this would not be
all in one step):
create table Products
(
ProductID integer primary key not null
, Attrib1 integer
, Attrib2 integer
,constraint Products_Attrib1 check (Attrib1 between 0 and 100)
,constraint Products_Attrib2 check (Attrib2 between 0 and 100)
)
;
go
insert into products (ProductID, Attrib1, Attrib2) values(1,12,34);
insert into products (ProductID, Attrib1, Attrib2) values(2,34,45);
insert into products (ProductID, Attrib1, Attrib2) values(3,75,98);
insert into products (ProductID, Attrib1, Attrib2) values(4,98,26);
insert into products (ProductID, Attrib1, Attrib2) values(5,56,44);
insert into products (ProductID, Attrib1, Attrib2) values(6,95,23);
go
/*
View to cross reference all products and retrieve the difference between
individual attributes
*/
Create view ProductXref as
Select a.productID as Product1
, b.productID as Product2
, abs(A.Attrib1 - b.Attrib1) as Score1
, abs(A.Attrib2 - b.Attrib2) as Score2
from Products A
cross join Products B
where A.ProductID <> B.ProductID
go
/*
View to cross calculate individual scores per attribute
arbitrary function used when score between 5 and 20
*/
Create view AttributeScores as
Select Product1
, Product2
,(Case when Score1 < 5 then 100
when Score1 between 5 and 20
Then 100-(score1*5)
Else 0 End) as Score1
,(Case when Score2 < 5 then 100
when Score2 between 5 and 20
Then 100-(score2*5)
Else 0 End) as Score2
from ProductXref
go
/*
View to calculate final scores.
Also returns individual scores for comparison purposes
*/
create View XrefScoring as
Select Product1
, Product2
, Score1
, Score2
, (Score1 + Score2)/2 as AvgScore
from AttributeScores
go
/*
select final results
*/
select Product1
, Product2
, Score1
, Score2
, AvgScore
from XrefScoring
where AvgScore > 0
and Product1 = 4
"Patrik" <corneliusson@.gmail.com> wrote in message
news:1148889857.588763.316200@.j55g2000cwa.googlegroups.com...
> Hello!
> I need to find the best way to find items that are similar to each
> other depending on several attributes.
> Each item has 100 attributes with a numeric (float) value between 0 and
> 100.
> Items that are similar have the least difference attribute value for
> all 100 attribtes.
> So what we need to do is calculate the difference (of all attributes)
> between the item that we want to find similiarities of and the other
> items in the database. But we cannot use SUM() over all attributes
> because we need to do some calculation on most (if not all) attributes.
> For example, if the difference is more than 20 we should consider that
> as the item does not match very well on that attribute, therefore
> return 0 (zero). If the difference is less then 5 we consider that as
> very good match and therefore return 100 points in similiary. If its
> between 5 and 20 then we return a score based on a formula we have.
> We will also need to group these items into categories. For example
> category1 can be calculated like
> (ScoreOfAttribute1+ScoreOfAttribute2+Sco
reOfAttribute8)/3. We divide by
> 3 since there are 3 attributes in that category.
> I have tried a couple of ways to implement this.
> Example one:
> Items
> - ItemId
> - Name
> - Price
> Attributes
> - AttributeId
> - Description
> AttributeValues
> - ItemId
> - AttributeId
> - AttributeValue
>
> Example two:
> Items
> - ItemId
> - Name
> - Price
> AttributeValue
> - AttributeId
> - Attribute1Value
> - Attribute2Value
> - .
> - Attribute100Value
> I have also carefully choosen the indexes and most of the time they
> will be used but there will always be some table scan since we need to
> do calculations to find the items with most simlarity, based on my
> "similary calculations".
> So now I need to get some new ideas on how to solve this problem.
> I have been thinking of a table with precalculated points but that
> table will probably get very big. We have over 150000 items that will
> need to be compared with each other.
> I have been thinking if it would be possible to sum all attribute
> values and store that and then divide by the number of attributes, and
> then we could just calculate the difference of the best 500 hits and
> get the top 100 from those? This will probably be quicker but it may
> also result in faults, since we cannot be sure that the most similar
> item is within this top 500.
> So if you have any ideas on this problem please write!
> The table design can be changed. The performance on the SELECT is the
> most important and the UPDATE/DELETE/INSERT is acceptable to take some
> time.
> What we are looking for here is to get the result (top 100 most similar
> items (out of 150000 items with 100 attributes)) within 1 second.
>|||You got everything right (except that we use decimal(5,2) instead of
integers for the values of the attributes), so I dont think there is no
need for me to write more example code.
Do you think its possible to speed up this query some more? Maybe make
a clustered view or something?
Got any ideas?|||You improve this a bit with a CASE expression done at the same time as
the cross join.
SELECT P1.product_id, P2.product_id,
CASE WHEN ABS(P1.a1 - P2.a1) <= 5
THEN ABS(P1.a1 - P2.a1) ELSE 999.99 END
+ CASE WHEN ABS(P1.a2 - P2.a2) <= 5
THEN ABS(P1.a2 - P2.a2) ELSE 999.99 END
.
+ CASE WHEN ABS(P1.a100 - P2.a100) <= 5
THEN ABS(P1.a100 - P2.a100) ELSE 999.99 END
AS score
FROM Products AS P1, Products AS P2
WHERE P1.product_id <> P2.product_id
AND CASE WHEN ABS(P1.a1 - P2.a1) <= 5
THEN ABS(P1.a1 - P2.a1) ELSE 999.99 END
+ CASE WHEN ABS(P1.a2 - P2.a2) <= 5
THEN ABS(P1.a2 - P2.a2) ELSE 999.99 END
.
+ CASE WHEN ABS(P1.a100 - P2.a100) <= 5
THEN ABS(P1.a100 - P2.a100) ELSE 999.99 END
< @.magic score;|||Yup, you would probably do them at the same time, but I did it in more than
one step just to make it easier to follow. Where math is concerned I prefer
to break things down into smaller pieces for illustration purposes. It
makes my head hurt less.
"--CELKO--" <jcelko212@.earthlink.net> wrote in message
news:1149030171.071254.57080@.i40g2000cwc.googlegroups.com...
> You improve this a bit with a CASE expression done at the same time as
> the cross join.
> SELECT P1.product_id, P2.product_id,
> CASE WHEN ABS(P1.a1 - P2.a1) <= 5
> THEN ABS(P1.a1 - P2.a1) ELSE 999.99 END
> + CASE WHEN ABS(P1.a2 - P2.a2) <= 5
> THEN ABS(P1.a2 - P2.a2) ELSE 999.99 END
> ..
> + CASE WHEN ABS(P1.a100 - P2.a100) <= 5
> THEN ABS(P1.a100 - P2.a100) ELSE 999.99 END
> AS score
> FROM Products AS P1, Products AS P2
> WHERE P1.product_id <> P2.product_id
> AND CASE WHEN ABS(P1.a1 - P2.a1) <= 5
> THEN ABS(P1.a1 - P2.a1) ELSE 999.99 END
> + CASE WHEN ABS(P1.a2 - P2.a2) <= 5
> THEN ABS(P1.a2 - P2.a2) ELSE 999.99 END
> ..
> + CASE WHEN ABS(P1.a100 - P2.a100) <= 5
> THEN ABS(P1.a100 - P2.a100) ELSE 999.99 END
> < @.magic score;
>|||What are the indexes you have at the moment?
How long is the query taking to run?
Will you always be looking for comparisons for a single product?
I don't think we can create an indexed view because the view contains a self
join. I don't know if this is a problem in SQL 2005 or not, but it is in
2000. If you need it to be really fast, you could create a reporting table
that is updated every so often (w

ly, daily, hourly, etc) but I would use
that only as a last resort.
"Patrik" <corneliusson@.gmail.com> wrote in message
news:1149024530.832795.257200@.j55g2000cwa.googlegroups.com...
> You got everything right (except that we use decimal(5,2) instead of
> integers for the values of the attributes), so I dont think there is no
> need for me to write more example code.
> Do you think its possible to speed up this query some more? Maybe make
> a clustered view or something?
> Got any ideas?
>|||Right now since we have a table that looks like this.
ProductId
Attribute1
Attribute2
...
Attribute100
There is only one index (clustered index) on the ProductId column.
Yes I will always compare just a single product against the others.
When the scope of items are 1000 the query takes <2 seconds. When the
scope of items are 13000 the query takes 14 seconds.
Jim Underwood wrote:
> What are the indexes you have at the moment?
> How long is the query taking to run?
> Will you always be looking for comparisons for a single product?
> I don't think we can create an indexed view because the view contains a se
lf
> join. I don't know if this is a problem in SQL 2005 or not, but it is in
> 2000. If you need it to be really fast, you could create a reporting tabl
e
> that is updated every so often (w

ly, daily, hourly, etc) but I would us
e
> that only as a last resort.
> "Patrik" <corneliusson@.gmail.com> wrote in message
> news:1149024530.832795.257200@.j55g2000cwa.googlegroups.com...|||"Patrik" <corneliusson@.gmail.com> wrote in message
news:1149111797.665216.285050@.i40g2000cwc.googlegroups.com...
> When the scope of items are 1000 the query takes <2 seconds. When the
> scope of items are 13000 the query takes 14 seconds.
You mean if you have 13000 rows in your table the query takes 14 seconds to
find all the matches for one product? This seems like an awfully long time
for 13000 records. In this case, since we are checking every record, we
can't make use of the index, but I would expect 13000 records to be
processed in a second or two.
I tested on my system with 15000 products (100 attributes each) and was able
to get results back in about 2 seconds. I am running SQL Server 2000 on a
PC with Windows XP installed. The pc has 1 gig of ram and one 3.4 Ghz
processor. Several other services are installed and running on the machine,
so it is not a dedicated db server.|||Maybe it takes longer because I do alot of calculations?
For example:
((((Score16+Score17+Score18)/3)/2) + (Score5/2))*0.25 As Something
Jim Underwood skrev:
> "Patrik" <corneliusson@.gmail.com> wrote in message
> news:1149111797.665216.285050@.i40g2000cwc.googlegroups.com...
> You mean if you have 13000 rows in your table the query takes 14 seconds t
o
> find all the matches for one product? This seems like an awfully long tim
e
> for 13000 records. In this case, since we are checking every record, we
> can't make use of the index, but I would expect 13000 records to be
> processed in a second or two.
> I tested on my system with 15000 products (100 attributes each) and was ab
le
> to get results back in about 2 seconds. I am running SQL Server 2000 on a
> PC with Windows XP installed. The pc has 1 gig of ram and one 3.4 Ghz
> processor. Several other services are installed and running on the machin
e,
> so it is not a dedicated db server.