Thursday, March 29, 2012

Finding Duplicates

I am trying to complete an insert from query but the problem is I have duplicates, so I'm getting an error message. So to correct it I am creating a Find Duplicates statement in the Query analyzer but Its not working can someone tell me whats wrong with this statement (by the way I'm in SQL 2000 Server)

thank you

SELECT EmployeeGamingLicense [TM#]AS [TM# Field], Count([TM#])AS NumberOfDups
FROM TERMINATION
GROUP BY [TM#]
HAVING Count([TM#])>1;
GOHey Desiree...been a while...

What's this:

EmployeeGamingLicense [TM#]AS [TM# Field],

Doesn't look like one column...how about the DDL for the TERMINATION table...

Or how about

SELECT [TM#], Count([TM#]) AS NumberOfDups
FROM TERMINATION
GROUP BY [TM#]
HAVING Count([TM#])>1|||HI BRETT!!! :) HOw ya Been?

Your right totally over looked that, thank you. Works perfect now. Got another question though, what if I wanted to find duplicate records from one table to another. I would like to be able to see if I can find matching records in two tables One is the TERMINATION.tbl the other EmployeeGamingLicense.tbl. Can I make this into a Stored procedure?

SELECT EmployeeGamingLicense.[TM#], EmployeeGamingLicense.LASTNAME, EmployeeGamingLicense.FIRSTNAME, EmployeeGamingLicense.[SSN#]
FROM EmployeeGamingLicense LEFT JOIN TERMINATION ON EmployeeGamingLicense.[TM#] = TERMINATION.[TM #]
WHERE (((TERMINATION.[TM #]) Is Not Null));|||what if I wanted to find duplicate records from one table to another. I would like to be able to see if I can find matching records in two tables...

I'd have to ask you what that means...

For example...in your previous query, you found dups...

Do you want to see the rows in the License table where you have dups in TERMINATION?

Do you want to see where you have dups because of a join between the 2?

I'm not sure...|||I'm good my statement worked I found the duplicates I needed to find. The thing is I'm Inheriting databases and I need to find the errors with them before I create the Gui for the client. Thank you Brett :)|||So you have to audit the entire db that has a poorly defined schema...

good luck

Any dates or numerics in varchar fields?

Once you clean up the dups...do you plan to put unique constarints on?|||No the dates are in (datetype: smalledatetime) and the Numbers like SSN# is in (datetype: Int) :)|||A little off topic but what do you do with SSN's that begin with 0?|||Anything that does not get math applied to it, is not a number in my book...

You want to constrain it to be numeric...that's fine...otherwise it's char or varchar|||is that ok though for them to be Int and smalldatetime, or do you suggest char or nvarchar|||I agree, we use char(9) in our db's but I've had to fight off developers that want SSN and TIN fields as int datatypes. I hate to have to determine if I need to add leading zeros back to SSN's that are int datatypes or if the SSN is truly corrupt/bad data.|||Ok this is defintely one of those learning processes I desperatley need. I love the feed back forces me to analyze the datatypes more deeply especially for future purposes. Ok so you would reccommend char instead.|||Ok Interger data consist of negative and positive whole numbers such as -15, 0, 5, and 2,509. Ok so INT base data type is a Number in the range from -2,147,483,648 through 2,147483.647 only requires 4 bytes of storage per value). Ok and char can consist of numberics, and can consist of up to 8 KB|||Dates should defenitley stay as dates...why small though?

Is space an issue?

Just a rule of thumb....

Does math need to applied? No? Then it's char or varchar...

Now here's another one...

Are you're dates nullable or not nullable?|||A little off topic but what do you do with SSN's that begin with 0?Worse yet, what do you do for the wierdo's among us with letters in their SSN?

-PatP|||What is that suppose to mean?|||No good question didnt even think about that No space is not an issue so it doesnt have to be smalldatetime can just be datetime datatype instead.

THANK YOU :)

I Love this really makes you think and plan ahead, and with database you need to plan ahead eheheh|||Is anyone as in love with BOOKS ONLINE as I am. :)
and this FORUM OF COURSE|||Just make everything nvarchar(4000) especially since space isn't an issue. That way every column is the same and easier to manage. Better yet text works well too.|||Just make everything nvarchar(4000) especially since space isn't an issue. That way every column is the same and easier to manage. Better yet text works well too.

Now Peter...|||SSNs should be char(9). Listen to Der Kaisermann.

(OK, char(11) would be a second choice...)|||SSNs should be char(9). Listen to Der Kaisermann.

(OK, char(11) would be a second choice...)If you are only storing the kind of SSN/TIN numbers used today, that might be true. If you have to store all of the SSN values, you'll need a VARCHAR(12).

For most purposes, you can get by with only the 9 digit SSN values used today, and an indicator for non-standard ones that you can then describe in more detail for the exceptions in some kind of "notes" column.

-PatP|||What is that suppose to mean?

OK, examples?...but my spider sense says you're right...

Only because peoplesoft has National Id as varchar(12)|||The commonly used SSN values today are 999-99-9999

Immigrant families, circa 1948-1952 are formatted like 999-99-9999#A9, with the first nine digits being the same for all of them.

Families of business owners, circa 1963-1965 were formatted 999-99-9999C99, with the first nine digits being the business owner's SSN.

Although rare, tax paying non-residents are formatted as either aa9-000-99-9999, 999-00-9999AA9, or a few variants. These don't follow much of any rules that I know of, although there might be some rhyme or reason under the surface.

There aren't a bunch of these "problem child" values, but there are enough that they can cause massive headaches.

Then there are the real pain-in-the-patoot types that have earned significant amounts under one of those "oddball" formats, then were given another one of the commonly used variety (999-99-9999) for "administrative convenience" and established enough earnings/etc that neither could be rescinded. Those folks have the distinction of having two active, legitimate SSNs, and may even be paying taxes on both of them at the same time! They can ball up just about anything, since even the packages that support the nonstandard formats have no real way to establish a relationship between them!

-PatP|||Wow, I had to deal with SSN's, but not in the case where a person has more than 1 at any given time... In my case every newborn is given a fake SSN (usually mother's SSN-1) until a name is given to a child (boggles my mind when a mother is carrying a baby for 9 months and at the time of birth has no idea how to call it!!!!!!)|||Pat, I do NOT want to play Trivial Pursuit with you.|||Pat, I do NOT want to play Trivial Pursuit with you.It seems like nobody wants to play with me!

I've got a bunch of great Trivial Pursuit stories. I once won a game where I went 5 for 5, then spent something like 15 turns trying to get the sports one (in a room of sports-aholics).

-PatP|||It seems like nobody wants to play with me!
-PatP

Not even the girlfreind?|||Brett, you don't play Trivial Pursuit with your girlfriend. You play Twister.|||The commonly used SSN values today are 999-99-9999

Immigrant families, circa 1948-1952 are formatted like 999-99-9999#A9, with the first nine digits being the same for all of them.

-PatP

...and here I thought it was something Canadien?!?|||No, Canadian SSNs use the metric system. Everybody knows that.|||If you are only storing the kind of SSN/TIN numbers used today, that might be true. If you have to store all of the SSN values, you'll need a VARCHAR(12).

For most purposes, you can get by with only the 9 digit SSN values used today, and an indicator for non-standard ones that you can then describe in more detail for the exceptions in some kind of "notes" column.

-PatP
I'm a bit confused. My table contains SSNs (all unique) and i've indexed the column, and is often used in WHERE clauses. It's said of that indexes shall not be created on FREE-FORM text columns (varchar usually). Some bad examples in this context are Flag column and SSN column. But since SSN's unique surely one can index it.

CHAR(12) would be better or VARCHAR(12)?

Thanx!|||What? You still here? What were we talking about, again?

Oh yeah. I remember.

"Thou Shalt Not Create Indexes On Character Fields"? I know of no such commandment. They aren't as efficient as indexes on, say, integer columns, but its still much better than not having an index on your character field.

But I'd use varchar unless you want to enforce the length of the string. That's why I originally suggested char(9) or char(11), but as Pat pointed out not ALL social security numbers are the same length.|||WHAT?

I thought you used IDENTITY Columns in every table?|||sorry to for the stupid question but what do you mean By Free-Form what is exactly is that?

Inquisitive by nature :)|||I cant believe we are still on this you guys are so funny

I love it :)|||What? Me use IDENTITY?

I NEVER use IDENTITY columns.

I'm the GUID GUY, remember?

...and by Free-form, he means that users can put whatever text they want in the field. It's not populated from a drop-down list, for instance.|||There is a hard limit to the size of an index key of just over 900 bytes. So you can not index a varchar(1000) field, if you expect to be storing values that are 1000 characters long. If all of the values happen to be less than the limit (I do not know it off hand. I know, bad DBA), then you can get away with it. Of course, the first 1000 character value to come along will hurt you bad.|||What? You still here? What were we talking about, again?

Oh yeah. I remember.

"Thou Shalt Not Create Indexes On Character Fields"? I know of no such commandment. They aren't as efficient as indexes on, say, integer columns, but its still much better than not having an index on your character field.

But I'd use varchar unless you want to enforce the length of the string. That's why I originally suggested char(9) or char(11), but as Pat pointed out not ALL social security numbers are the same length.
Sorry, bear with me plz!
Shouldn't one use an index on a column even if it's unique. SSNs are usually used in the WHERE clauses in applications(in my application just like emp.id). Users in app. access individual records on the basis on SSN. If a table has millions of records, Index Scan or Table Scan?
Stupid question, it is. but doesn't bother me. Your answer may help in clearing the concepts. :cool:|||You may get an Index Seek. But even Indes Scan is better than Table Scan in your situation, because there are much fewer index pages for SSN index of char(9) vs. data pages for your table.|||Sorry, bear with me plz!
Shouldn't one use an index on a column even if it's unique. SSNs are usually used in the WHERE clauses in applications(in my application just like emp.id). Users in app. access individual records on the basis on SSN. If a table has millions of records, Index Scan or Table Scan?
Stupid question, it is. but doesn't bother me. Your answer may help in clearing the concepts. :cool:
Well, if it's truly unique it should have a unique index or unique constraint(which creates a unique index) on the column. Check out this article about index rules in general http://www.sql-server-performance.com/optimizing_indexes.asp|||You may get an Index Seek. But even Indes Scan is better than Table Scan in your situation, because there are much fewer index pages for SSN index of char(9) vs. data pages for your table.

Getting off the floor and into my barstool...umm office chair

What?

You're not suggesting a scan is a good thing? Are you?|||You're not suggesting a scan is a good thing? Are you?An index scan versus a table scan... It could be a good thing, in relative terms.

-PatP

No comments:

Post a Comment