Fragmented array: Find duplicates

I have a query that I used that was based on the following query that Hugo
gave me yesterday (I added CheckNumber in for the problem).
I had a problem with finding duplicate check numbers in the table
(voids,re-issued checks etc). It does show in my results, but out of
4300+checks it is hard to find all the duplicates for auditing purposes
while setting up my script.
How do I go about changing my query to show ONLY duplicates (or vice versa)?
Here is the data:
drop table EmployeePay
drop table PayDetail
CREATE TABLE [dbo].[EmployeePay] (
[EmployeePayID] [int] IDENTITY (1, 1),
[ClientID] [int] Not NULL ,
[CheckNumber] [int] Not Null,
[PayDate] [smalldatetime] Not NULL,
PRIMARY KEY (EmployeePayID)
) ON [PRIMARY]
CREATE TABLE [dbo].[PayDetail] (
[PayDetailID] [int] IDENTITY (1, 1),
[ClientID] [int] Not NULL ,
[PayDate] [smalldatetime] Not NULL ,
[Code] [varchar] (5) Not NULL ,
[Hours] [int] Not NULL ,
[Amount] [money] Not NULL,
PRIMARY KEY (PayDetailID)
) ON [PRIMARY]
insert EmployeePay values (1,1234,'01-01-05')
insert EmployeePay values (1,2255,'02-03-05')
insert EmployeePay values (2,1331,'01-01-05')
insert EmployeePay values (2,3011,'03-06-05')
insert EmployeePay values (2,1331,'05-06-05')
insert PayDetail values (1,'01-01-05','5',20,200.50)
insert PayDetail values (1,'01-01-05','6',25,10.50)
insert PayDetail values (1,'01-01-05','13',50,120.25)
insert PayDetail values (1,'02-03-05','5',5,110)
insert PayDetail values (1,'02-03-05','9',18,250.50)
insert PayDetail values (2,'01-01-05','5',50,120.25)
insert PayDetail values (2,'01-01-05','44',10,320.32)
insert PayDetail values (2,'01-01-05','32',50,120.25)
insert PayDetail values (2,'03-06-05','46',18,235.75)
insert PayDetail values (2,'05-06-05','5',-50,-120.25)
insert PayDetail values (2,'05-06-05','44',-10,-320.32)
insert PayDetail values (2,'05-06-05','32',-50,-120.25)
and the script I am using:
CREATE VIEW PayDetailWithRank
AS
SELECT pd1.ClientID, pd1.PayDate, pd1.Code, pd1.Hours, pd1.Amount,
(SELECT COUNT(*)
FROM PayDetail AS pd2
WHERE pd2.ClientID = pd1.ClientID
AND pd2.PayDate = pd1.PayDate
AND pd2.Code < pd1.Code) AS Rank
FROM PayDetail AS pd1
go
SELECT e.ClientID, e.PayDate, e.CheckNumber,
d1.Code, d1.Hours, d1.Amount,
d2.Code, d2.Hours, d2.Amount,
d3.Code, d3.Hours, d3.Amount
FROM EmployeePay AS e
INNER JOIN PayDetailWithRank AS d1
ON d1.ClientID = e.ClientID
AND d1.PayDate = e.PayDate
AND d1.Rank % 3 = 0
LEFT JOIN PayDetailWithRank AS d2
ON d2.ClientID = e.ClientID
AND d2.PayDate = e.PayDate
AND d2.Rank / 3 = d1.Rank / 3
AND d2.Rank % 3 = 1
LEFT JOIN PayDetailWithRank AS d3
ON d3.ClientID = e.ClientID
AND d3.PayDate = e.PayDate
AND d3.Rank / 3 = d1.Rank / 3
AND d3.Rank % 3 = 2
ORDER BY e.ClientID, e.PayDate, d1.Rank
go
DROP VIEW PayDetailWithRank
go
This will give you 1 duplicate check. Can I change this to show only
duplicate, somehow?
Thanks,
TomNever mind.
I figure out how. I have to modify the statement slightly
I need to change the Order by to Group By and add a "Having Count(*) > 1"
clause.
Tom
"tshad" <tscheiderich@.ftsolutions.com> wrote in message
news:%23tVBdqhuFHA.1472@.TK2MSFTNGP15.phx.gbl...
>I have a query that I used that was based on the following query that Hugo
>gave me yesterday (I added CheckNumber in for the problem).
> I had a problem with finding duplicate check numbers in the table
> (voids,re-issued checks etc). It does show in my results, but out of
> 4300+checks it is hard to find all the duplicates for auditing purposes
> while setting up my script.
> How do I go about changing my query to show ONLY duplicates (or vice
> versa)?
> Here is the data:
> drop table EmployeePay
> drop table PayDetail
> CREATE TABLE [dbo].[EmployeePay] (
> [EmployeePayID] [int] IDENTITY (1, 1),
> [ClientID] [int] Not NULL ,
> [CheckNumber] [int] Not Null,
> [PayDate] [smalldatetime] Not NULL,
> PRIMARY KEY (EmployeePayID)
> ) ON [PRIMARY]
> CREATE TABLE [dbo].[PayDetail] (
> [PayDetailID] [int] IDENTITY (1, 1),
> [ClientID] [int] Not NULL ,
> [PayDate] [smalldatetime] Not NULL ,
> [Code] [varchar] (5) Not NULL ,
> [Hours] [int] Not NULL ,
> [Amount] [money] Not NULL,
> PRIMARY KEY (PayDetailID)
> ) ON [PRIMARY]
> insert EmployeePay values (1,1234,'01-01-05')
> insert EmployeePay values (1,2255,'02-03-05')
> insert EmployeePay values (2,1331,'01-01-05')
> insert EmployeePay values (2,3011,'03-06-05')
> insert EmployeePay values (2,1331,'05-06-05')
> insert PayDetail values (1,'01-01-05','5',20,200.50)
> insert PayDetail values (1,'01-01-05','6',25,10.50)
> insert PayDetail values (1,'01-01-05','13',50,120.25)
> insert PayDetail values (1,'02-03-05','5',5,110)
> insert PayDetail values (1,'02-03-05','9',18,250.50)
> insert PayDetail values (2,'01-01-05','5',50,120.25)
> insert PayDetail values (2,'01-01-05','44',10,320.32)
> insert PayDetail values (2,'01-01-05','32',50,120.25)
> insert PayDetail values (2,'03-06-05','46',18,235.75)
> insert PayDetail values (2,'05-06-05','5',-50,-120.25)
> insert PayDetail values (2,'05-06-05','44',-10,-320.32)
> insert PayDetail values (2,'05-06-05','32',-50,-120.25)
>
> and the script I am using:
> CREATE VIEW PayDetailWithRank
> AS
> SELECT pd1.ClientID, pd1.PayDate, pd1.Code, pd1.Hours, pd1.Amount,
> (SELECT COUNT(*)
> FROM PayDetail AS pd2
> WHERE pd2.ClientID = pd1.ClientID
> AND pd2.PayDate = pd1.PayDate
> AND pd2.Code < pd1.Code) AS Rank
> FROM PayDetail AS pd1
> go
> SELECT e.ClientID, e.PayDate, e.CheckNumber,
> d1.Code, d1.Hours, d1.Amount,
> d2.Code, d2.Hours, d2.Amount,
> d3.Code, d3.Hours, d3.Amount
> FROM EmployeePay AS e
> INNER JOIN PayDetailWithRank AS d1
> ON d1.ClientID = e.ClientID
> AND d1.PayDate = e.PayDate
> AND d1.Rank % 3 = 0
> LEFT JOIN PayDetailWithRank AS d2
> ON d2.ClientID = e.ClientID
> AND d2.PayDate = e.PayDate
> AND d2.Rank / 3 = d1.Rank / 3
> AND d2.Rank % 3 = 1
> LEFT JOIN PayDetailWithRank AS d3
> ON d3.ClientID = e.ClientID
> AND d3.PayDate = e.PayDate
> AND d3.Rank / 3 = d1.Rank / 3
> AND d3.Rank % 3 = 2
> ORDER BY e.ClientID, e.PayDate, d1.Rank
> go
> DROP VIEW PayDetailWithRank
> go
>
> This will give you 1 duplicate check. Can I change this to show only
> duplicate, somehow?
> Thanks,
> Tom
>|||Hi
Changing the first left outer join to be an inner join will ensure at least
one duplicate.
SELECT e.ClientID, e.PayDate, e.CheckNumber,
d1.Code, d1.Hours, d1.Amount, d1.rank,
d2.Code, d2.Hours, d2.Amount, d2.rank,
d3.Code, d3.Hours, d3.Amount, d3.rank
FROM EmployeePay AS e
JOIN PayDetailWithRank AS d1
ON d1.ClientID = e.ClientID
AND d1.PayDate = e.PayDate
AND d1.Rank % 3 = 0
JOIN PayDetailWithRank AS d2
ON d2.ClientID = e.ClientID
AND d2.PayDate = e.PayDate
AND d2.Rank / 3 = d1.Rank / 3
AND d2.Rank % 3 = 1
LEFT JOIN PayDetailWithRank AS d3
ON d3.ClientID = e.ClientID
AND d3.PayDate = e.PayDate
AND d3.Rank / 3 = d1.Rank / 3
AND d3.Rank % 3 = 2
ORDER BY e.ClientID, e.PayDate, d1.Rank
What are you expecting if there are 4 duplicates?
John
"tshad" <tscheiderich@.ftsolutions.com> wrote in message
news:%23tVBdqhuFHA.1472@.TK2MSFTNGP15.phx.gbl...
>I have a query that I used that was based on the following query that Hugo
>gave me yesterday (I added CheckNumber in for the problem).
> I had a problem with finding duplicate check numbers in the table
> (voids,re-issued checks etc). It does show in my results, but out of
> 4300+checks it is hard to find all the duplicates for auditing purposes
> while setting up my script.
> How do I go about changing my query to show ONLY duplicates (or vice
> versa)?
> Here is the data:
> drop table EmployeePay
> drop table PayDetail
> CREATE TABLE [dbo].[EmployeePay] (
> [EmployeePayID] [int] IDENTITY (1, 1),
> [ClientID] [int] Not NULL ,
> [CheckNumber] [int] Not Null,
> [PayDate] [smalldatetime] Not NULL,
> PRIMARY KEY (EmployeePayID)
> ) ON [PRIMARY]
> CREATE TABLE [dbo].[PayDetail] (
> [PayDetailID] [int] IDENTITY (1, 1),
> [ClientID] [int] Not NULL ,
> [PayDate] [smalldatetime] Not NULL ,
> [Code] [varchar] (5) Not NULL ,
> [Hours] [int] Not NULL ,
> [Amount] [money] Not NULL,
> PRIMARY KEY (PayDetailID)
> ) ON [PRIMARY]
> insert EmployeePay values (1,1234,'01-01-05')
> insert EmployeePay values (1,2255,'02-03-05')
> insert EmployeePay values (2,1331,'01-01-05')
> insert EmployeePay values (2,3011,'03-06-05')
> insert EmployeePay values (2,1331,'05-06-05')
> insert PayDetail values (1,'01-01-05','5',20,200.50)
> insert PayDetail values (1,'01-01-05','6',25,10.50)
> insert PayDetail values (1,'01-01-05','13',50,120.25)
> insert PayDetail values (1,'02-03-05','5',5,110)
> insert PayDetail values (1,'02-03-05','9',18,250.50)
> insert PayDetail values (2,'01-01-05','5',50,120.25)
> insert PayDetail values (2,'01-01-05','44',10,320.32)
> insert PayDetail values (2,'01-01-05','32',50,120.25)
> insert PayDetail values (2,'03-06-05','46',18,235.75)
> insert PayDetail values (2,'05-06-05','5',-50,-120.25)
> insert PayDetail values (2,'05-06-05','44',-10,-320.32)
> insert PayDetail values (2,'05-06-05','32',-50,-120.25)
>
> and the script I am using:
> CREATE VIEW PayDetailWithRank
> AS
> SELECT pd1.ClientID, pd1.PayDate, pd1.Code, pd1.Hours, pd1.Amount,
> (SELECT COUNT(*)
> FROM PayDetail AS pd2
> WHERE pd2.ClientID = pd1.ClientID
> AND pd2.PayDate = pd1.PayDate
> AND pd2.Code < pd1.Code) AS Rank
> FROM PayDetail AS pd1
> go
> SELECT e.ClientID, e.PayDate, e.CheckNumber,
> d1.Code, d1.Hours, d1.Amount,
> d2.Code, d2.Hours, d2.Amount,
> d3.Code, d3.Hours, d3.Amount
> FROM EmployeePay AS e
> INNER JOIN PayDetailWithRank AS d1
> ON d1.ClientID = e.ClientID
> AND d1.PayDate = e.PayDate
> AND d1.Rank % 3 = 0
> LEFT JOIN PayDetailWithRank AS d2
> ON d2.ClientID = e.ClientID
> AND d2.PayDate = e.PayDate
> AND d2.Rank / 3 = d1.Rank / 3
> AND d2.Rank % 3 = 1
> LEFT JOIN PayDetailWithRank AS d3
> ON d3.ClientID = e.ClientID
> AND d3.PayDate = e.PayDate
> AND d3.Rank / 3 = d1.Rank / 3
> AND d3.Rank % 3 = 2
> ORDER BY e.ClientID, e.PayDate, d1.Rank
> go
> DROP VIEW PayDetailWithRank
> go
>
> This will give you 1 duplicate check. Can I change this to show only
> duplicate, somehow?
> Thanks,
> Tom
>|||>> had a problem with finding duplicate check numbers in the table (voids,r
e-issued checks etc). <<
I have a better question for you: why are you creating duplicate check
numbers? Do you think that perhaps not having a proper relational key
could be part of the problem?
If you will learn to write proper DDL, you will save a lot of complex
DML.
Also, look up the ISO-8601 format for dates and times.|||"--CELKO--" <jcelko212@.earthlink.net> wrote in message
news:1126814005.385526.122020@.z14g2000cwz.googlegroups.com...
> I have a better question for you: why are you creating duplicate check
> numbers? Do you think that perhaps not having a proper relational key
> could be part of the problem?
As I had mentioned, this is not my database. I built a table to mimic that
tables I am extracting information from to create a CSV file that I can use
to import into our system. I am not creating Duplicate Check numbers - that
is what is in their system and I have to deal with it. I have over 5000
checks and as it turned out 5 duplicate check numbers. I used my group
by/having solutions to find them. As I mentioned, they were voids (hence
the duplicate check number) as well as a check that was re-issued (we don't
know why yet).
Are you suggesting I should spend my time creating a PROPER database schema,
even if it doesn't mirror what I am trying to accomplish?
And BTW, you have pounded incessantly about the evils of the IDENTITY.
I for one am grateful they used it, as you would have seen in my other post.
Hugos solution with the Views worked great, but hadn't taken into account
multiple codes on one check (which there would have been no way for him to
know this). I found that their tables used identities and that was the only
field I could replace the Code field with to make it work. It had to be
sequential and unique - Gaps would not have been an issue.
Tom
> If you will learn to write proper DDL, you will save a lot of complex
> DML.
> Also, look up the ISO-8601 format for dates and times.
>|||On Thu, 15 Sep 2005 13:40:38 -0700, tshad wrote:
(snip)
>Hugos solution with the Views worked great, but hadn't taken into account
>multiple codes on one check (which there would have been no way for him to
>know this). I found that their tables used identities and that was the onl
y
>field I could replace the Code field with to make it work. It had to be
>sequential and unique - Gaps would not have been an issue.
Hi Tom,
It would also have been solveable without the identity column. If there
was no single column to make each row unique within Client/Paydate, I'd
have used a combination of two columns. Or three. Or ... well, you get
the point. The code would have been more messy, but it would have
worked.
The only thing I would not have been able to handle were complete
duplicates (i.e. two or more rows with the same value in each of their
columns). But if you had those, you'd have much bigger problems to solve
than generating a kind-of-crosstab report. <g>
Best, Hugo
--
(Remove _NO_ and _SPAM_ to get my e-mail address)|||"Hugo Kornelis" <hugo@.pe_NO_rFact.in_SPAM_fo> wrote in message
news:5oqji11o59rufd33auh0psimi867s65rj7@.
4ax.com...
> On Thu, 15 Sep 2005 13:40:38 -0700, tshad wrote:
> (snip)
> Hi Tom,
> It would also have been solveable without the identity column. If there
> was no single column to make each row unique within Client/Paydate, I'd
> have used a combination of two columns. Or three. Or ... well, you get
> the point. The code would have been more messy, but it would have
> worked.
I agree.
But in this case, there was no other set of columns (that I could make out)
that would have solved the problem. I am not sure you could have made
multiple columns work with a surrogate key, in this case. All the tables
(earnings, taxes and deductions) all could have multiple records with the
same code, date, employee number (SSN) etc on the same check. That was what
I was looking for, I just happen to notice that they have an identity field
there and it solved the problem.
Thanks,
Tom
> The only thing I would not have been able to handle were complete
> duplicates (i.e. two or more rows with the same value in each of their
> columns). But if you had those, you'd have much bigger problems to solve
> than generating a kind-of-crosstab report. <g>
> Best, Hugo
> --
> (Remove _NO_ and _SPAM_ to get my e-mail address)

Wednesday, March 7, 2012

Find duplicates

No comments:

Post a Comment

Fragmented array

Blog Archive

About Me