Wednesday, March 28, 2012

finding duplicate values

Here's my situation: I work at a healthcare testing labratory, I have a table with all our tests, Dim_Test. The two columns i want to look at are: [Mnemonic] and [Test Name]. Mnemonic is the short test name and Test Name is a long name. The condition of the data is that there are a few duplicate mnemonics, but the the duplicates have different long test names. I know this is really messed up but thats our lab system for you, i have no control over that. I want to make a table that has all mnemonics that are duplicated, and the Long test name for each of the duplications. I have come up with a couple possible solutions but i dont know how to implement them or which would be the best course of action.

The First: Sort on the Mnemonic and delete duplicates, store that to a temp table. Then somehow compare Dim_test and the temp table to find the rows missing from the temp table and then store that to, say, temp2. So temp2 would contain all the data that was deleted during the deletion of duplicates on Dim_test. Then i could use the input Dim_test and do a lookup on table temp2 which would give me all mnemonics and test names with duplicate mnemonics.

The Second and third: Somehow get a dataset that has the distinct mnemonic and the count of each distinct mnemonic. Then i could use a conditional split to find out which mnemonics have a count of greater than one. then i can store that to a temp table and lookup on that. That temp table would be the same as temp2 in the first solution. The difference between the second and third solutions is that one uses an aggregate transformation and the other uses a pivot transformation.

I think the pivot transformation would work best if i pivoted on mnemonic but i dont know how to set it up. Please help. Thank you.

You don't need SSIS for this. This T-SQL will do the trick:

Code Snippet

use tempdb

go

CREATE Table [Dim_Test]

(

[Mnemonic] NVARCHAR (20),

[Test Name] NVARCHAR (200)

)

INSERT INTO [Dim_Test] VALUES ('AAA', 'This is a long test name 1')

INSERT INTO [Dim_Test] VALUES ('BBB', 'This is a long test name 2')

INSERT INTO [Dim_Test] VALUES ('CCC', 'This is a long test name 3')

INSERT INTO [Dim_Test] VALUES ('DDD', 'This is a long test name 4')

INSERT INTO [Dim_Test] VALUES ('EEE', 'This is a long test name 5')

INSERT INTO [Dim_Test] VALUES ('AAA', 'This is a long test name 6')

INSERT INTO [Dim_Test] VALUES ('FFF', 'This is a long test name 7')

INSERT INTO [Dim_Test] VALUES ('BBB', 'This is a long test name 8')

INSERT INTO [Dim_Test] VALUES ('AAA', 'This is a long test name 9')

GO

WITH DuplicatesCTE

AS

(

SELECT [Mnemonic]

FROM [Dim_Test]

GROUP BY [Mnemonic]

HAVING COUNT (*) > 1

)

SELECT [Dim_Test].*

FROM [Dim_Test]

INNER JOIN [DuplicatesCTE]

ON [Dim_Test].[Mnemonic] = [DuplicatesCTE].[Mnemonic]

If your source system is using SQL 2000 you will have to replace the CTE with a derived table, but the technique is the same.

Does this give you what you need?

|||select a.mnemonic, a.[test name] from dim_test a inner join
(select mnemonic, count(*) from dim_test group by mnemonic having count(*) > 1) b
on a.mnemonic = b.mnemonic

This should give you a list of all of the duplicate mnemonics with their [test name]s.

If you want to create a NEW table with this information add an "into tablename" clause like so:

select a.mnemonic, a.[test name] into tablename
from dim_test a inner join
(select mnemonic, count(*) from dim_test group by mnemonic having count(*) > 1) b
on a.mnemonic = b.mnemonic|||

Both of these seemed to work, i ended up using Matthew's solution. I should really learn more about the SQL language, i pretty much do everything in SSIS. Thank you both very much.

Matthew could you please help me compact this. Please excuse my ignorance of the basic SQL language. So all the data is in Dim_test and the part i left out before was that there is also data with duplicate Mnemonics and Test Names example:

VALUES ('AAA', 'This is a long test name 1')

VALUES ('AAA', 'This is a long test name 1')

So this is what i want to do in a Stored procedure, and I need to do it without changing any data in Dim_test:

1. Disregard all duplicates of Mnemonic AND [Test Name] in Dim_test

2. DELETE FROM DupMnemonics

3. Store the results of your query in DupMnemonics

Thank you so much.

This is the query i ended up using

WITH DuplicatesCTE AS (SELECT Mnemonic
FROM DupMnemonics
GROUP BY Mnemonic
HAVING (COUNT(*) > 1))
SELECT DupMnemonics_1.Mnemonic, DupMnemonics_1.[Test Name]
FROM DupMnemonics AS DupMnemonics_1 INNER JOIN
DuplicatesCTE AS DuplicatesCTE_1 ON DupMnemonics_1.Mnemonic = DuplicatesCTE_1.Mnemonic

|||An outer SELECT DISTINCT should eliminate those duplicates.|||

You can use a CTE to do this as well, when used in combination with the ROW_NUMBER ranking funcctiokn. Here's an example:

http://www.eggheadcafe.com/software/aspnet/29974931/removing-duplicate-rows-.aspx

|||

MatthewRoche wrote:

You can use a CTE to do this as well, when used in combination with the ROW_NUMBER ranking funcctiokn. Here's an example:

http://www.eggheadcafe.com/software/aspnet/29974931/removing-duplicate-rows-.aspx

Yeah, to the OP, there are so many ways to achieve this, you may want to move your discussions to the Transact-SQL forum. This is, after all, off topic in the SSIS forum... Wink|||

Is there a way to just move this thread over to T-SQL

Thank you all.

|||

heff89 wrote:

Is there a way to just move this thread over to T-SQL

Thank you all.

There is. I'll do that now for you.

No comments:

Post a Comment