Wednesday, March 28, 2012

Finding Correct Primary Key

Hello,
I could not reach a conclusion on this. Maybe someone helps me to understand
better.
I have a consolidated text extract from other systems and want to upload
the data to an SQL table.
Below is the fields in extract.
OrderNo
LineNumber
SellingCompany
PurcCtry
FinalizeDt
ItemNo
CustomerID
SalesRep
Quantity
Sales_USD
At first analysis, I thought OrderNo and LineNumber are the PK of the table
and designed the table.
When I receive the data I realized that since the data is coming from
different sources. Some of them has duplicate Orderno LineNumber
combination.
Now my data has two different of type of data one has unique key of Orderno
LineNumber combination. On the other hand second part of data needs OrderNo
LineNumber ,FinalizeDt,ItemNo and CustomerID for a unique key.
Question is : Should I keep these data in two different table, or should I
keep them in a single table by adding more columns on PK ?
What is the affect of defining larger PK even it is not needed for some of
the data?
My table has data has 1.5 million record and increasing 75000 each month.
70 % of data has Unique Orderno Line Number
Data is used by sql server reporting services, I have also some plans to use
analysis server in the future.
Thanks in advance.
erdal,You can use up to 16 columns for a composite primary key, however, make sure
you only use as many as you actually need, based on your data.
If you expect frequent and large increases in table size, make the primary
key nonclustered, and find a more appropriate candidate for the clustered
index. An ideal candidate for this kind of situation is a column with
non-nullable unique values, that increase for each new inserted row (e.g. an
identity column).
Using several tables is an option, yet it may complicate querying for the
reporting purposes.
ML|||> Question is : Should I keep these data in two different table, or should I
> keep them in a single table by adding more columns on PK ?
It seems to me that the data from the 2 sources represents different
entities so you should store these in different tables.

> On the other hand second part of data needs OrderNo
> LineNumber ,FinalizeDt,ItemNo and CustomerID for a unique key.
You need to do some analysis to determine the appropriate PK. What does the
data from this data source actually represent (e.g. order details history or
different order entry systems)? Simply adding columns until you come up
with a unique combination isn't the right approach. Data may be unique
based on your current data sample but not necessarily in the future.
Perhaps the data is bad and you are only masking an underlying problem.
Maybe some data scrubbing is needed before import. These are all issues
that need to be addressed in order to make intelligent schema design
decisions.
Hope this helps.
Dan Guzman
SQL Server MVP
"Erdal Akbulut" <erdalim21@.yahoo.com> wrote in message
news:ukME7Fv1FHA.2348@.TK2MSFTNGP15.phx.gbl...
> Hello,
> I could not reach a conclusion on this. Maybe someone helps me to
> understand
> better.
> I have a consolidated text extract from other systems and want to upload
> the data to an SQL table.
> Below is the fields in extract.
> OrderNo
> LineNumber
> SellingCompany
> PurcCtry
> FinalizeDt
> ItemNo
> CustomerID
> SalesRep
> Quantity
> Sales_USD
>
> At first analysis, I thought OrderNo and LineNumber are the PK of the
> table
> and designed the table.
> When I receive the data I realized that since the data is coming from
> different sources. Some of them has duplicate Orderno LineNumber
> combination.
> Now my data has two different of type of data one has unique key of
> Orderno
> LineNumber combination. On the other hand second part of data needs
> OrderNo
> LineNumber ,FinalizeDt,ItemNo and CustomerID for a unique key.
> Question is : Should I keep these data in two different table, or should I
> keep them in a single table by adding more columns on PK ?
> What is the affect of defining larger PK even it is not needed for some of
> the data?
> My table has data has 1.5 million record and increasing 75000 each month.
> 70 % of data has Unique Orderno Line Number
> Data is used by sql server reporting services, I have also some plans to
> use
> analysis server in the future.
> Thanks in advance.
>
> erdal,
>|||On Sat, 22 Oct 2005 13:15:06 +0300, Erdal Akbulut wrote:

>Hello,
>I could not reach a conclusion on this. Maybe someone helps me to understan
d
>better.
>I have a consolidated text extract from other systems and want to upload
>the data to an SQL table.
>Below is the fields in extract.
>OrderNo
>LineNumber
>SellingCompany
>PurcCtry
>FinalizeDt
>ItemNo
>CustomerID
>SalesRep
>Quantity
>Sales_USD
>
>At first analysis, I thought OrderNo and LineNumber are the PK of the table
>and designed the table.
>When I receive the data I realized that since the data is coming from
>different sources. Some of them has duplicate Orderno LineNumber
>combination.
(snip)
>Question is : Should I keep these data in two different table, or should I
>keep them in a single table by adding more columns on PK ?
Hi Erdal,
It seems to me that you're approaching this from the wrong way. You
should start with how the data is supposed to be related in the
business.
Not knowing your business and making some wild assumptions about the
meaning of some columns names, my first guess is that (OrderNo,
LineNumber) might be a good candidate key for this table - but that's
noway sure! If your business assigns one and only one salesrep toe ach
customer, than SalesRep should be in the Customers table and can be
removed from this table. Similar, if each customer makes all it's
purchases in one country, then PurcCtry should be moved to the Customers
table as well. There might be more similar dependencies between the
columns that only someone with knowledge of the business model of your
company can find.
WRT the duplicated order numbers: if these sources each have their own
system for assigning order numbers (as might be the case after the
merger of two companies, or in a company with independent branches) and
duplicates are to be expected, then you should add a column to indicate
the source (not the data source, but the source of the order - e.g. the
name or ID of the branch where the order comes from), and add that
column to the primary key. OTOH, if the order numbers really should be
unique in the complete collection of data, then you'll have to hunt down
the reason why duplicates were entered in the first place, fix the
problem, clean up the erronous data before proceeding with this
consolidation project.
(previously snipped text:)
>Now my data has two different of type of data one has unique key of Orderno
>LineNumber combination. On the other hand second part of data needs OrderNo
>LineNumber ,FinalizeDt,ItemNo and CustomerID for a unique key.
I'm not sure if I understand you right. Maybe you're just trying to find
a set of columns that would technically (but not logically) qualify as
primary key - in that case see above: either add a column or fix the
real problem.
But if you're saying that the business model logically implies that the
data from one source has (OrderNo, LineNumber, FinalizeDt, ItemNo,
CustomerID) as the "real" primary key, then
a) You have a very unusual business model (but of course, mine is not to
judge), and
b) You can't combine the data from the two sources into one table, since
they have a different logical structure - it would be like trying to
combine ledger entries and personnel records into one table.

>What is the affect of defining larger PK even it is not needed for some of
>the data?
You'll get corrupted data in your table. Would you consider defining the
primary key on your Personnel table as (EmployeeID, EmpName)? I'd hope
not - since it's not the question IF you'll end up with duplicated data,
but WHEN you'll get it. And the answer is "sooner than you think".
Best, Hugo
--
(Remove _NO_ and _SPAM_ to get my e-mail address)|||> When I receive the data I realized that since the data is coming from
> different sources. Some of them has duplicate Orderno LineNumber
> combination.
>
It sounds as if what you need is something like SourceID, OrderNo,
LineNumber. If SourceID is not included in the extract files, then populate
it during the transform/load step.
.|||Thanks for the answers,
Let me try to clarify the issue.
we have 7 companies running on the same application, and 2 more using
different systems. All applications run on a mainframe environment and the
data is also consolidated in mainframe.
For the first 7 companies. OrderNo and Orderline are the candidates. No
problem at this part.
For the other 2 companies, we do not have the data in orderline level. The
data is summed by item, customer etc by month, and they have dummy ordenos
as 999999 and line no as 001
Currently I have two tables. one for the first 7 companies having Orderno
and Linenumber as PK. and another table for the other 2 companies having PK
with multiple columns.
I was thinking to merge the data in one table and define a larger PK having
PK columns from first and second table.
Seems it is not recommended in this case.
regards,
"Erdal Akbulut" <erdalim21@.yahoo.com> wrote in message
news:ukME7Fv1FHA.2348@.TK2MSFTNGP15.phx.gbl...
> Hello,
> I could not reach a conclusion on this. Maybe someone helps me to
understand
> better.
> I have a consolidated text extract from other systems and want to upload
> the data to an SQL table.
> Below is the fields in extract.
> OrderNo
> LineNumber
> SellingCompany
> PurcCtry
> FinalizeDt
> ItemNo
> CustomerID
> SalesRep
> Quantity
> Sales_USD
>
> At first analysis, I thought OrderNo and LineNumber are the PK of the
table
> and designed the table.
> When I receive the data I realized that since the data is coming from
> different sources. Some of them has duplicate Orderno LineNumber
> combination.
> Now my data has two different of type of data one has unique key of
Orderno
> LineNumber combination. On the other hand second part of data needs
OrderNo
> LineNumber ,FinalizeDt,ItemNo and CustomerID for a unique key.
> Question is : Should I keep these data in two different table, or should I
> keep them in a single table by adding more columns on PK ?
> What is the affect of defining larger PK even it is not needed for some of
> the data?
> My table has data has 1.5 million record and increasing 75000 each month.
> 70 % of data has Unique Orderno Line Number
> Data is used by sql server reporting services, I have also some plans to
use
> analysis server in the future.
> Thanks in advance.
>
> erdal,
>|||On Mon, 24 Oct 2005 18:30:15 +0300, Erdal Akbulut wrote:

>Thanks for the answers,
>Let me try to clarify the issue.
>we have 7 companies running on the same application, and 2 more using
>different systems. All applications run on a mainframe environment and the
>data is also consolidated in mainframe.
>For the first 7 companies. OrderNo and Orderline are the candidates. No
>problem at this part.
>For the other 2 companies, we do not have the data in orderline level. The
>data is summed by item, customer etc by month, and they have dummy ordenos
>as 999999 and line no as 001
>
>Currently I have two tables. one for the first 7 companies having Orderno
>and Linenumber as PK. and another table for the other 2 companies having P
K
>with multiple columns.
>I was thinking to merge the data in one table and define a larger PK having
>PK columns from first and second table.
>Seems it is not recommended in this case.
Hi Erdal,
Seems to me that you have two options to chose from:
A) Use two tables. One keyed on (OrderNo, LineNumber) to store the data
from the first 7 companies, and the second one with a smaller amount of
columns, and keyed on (Item, Customer, etc, Month) for the last 2
companies. Don't introducce fake order and line numbers in the last
table, they'll only confuse people.
B) Use a staging table to accept the data from the first 7 companies.
Aggregate this data to the same level as the data you receive from the
other 2 companies. Then combine the data from the latter 2 companies
with the aggregated data from the first 7 companies into one table.
Which of these solutions suits your problem best is for you to decide,
since I don't know what you're planning to do with the data, once it's
in the database.
Best, Hugo
--
(Remove _NO_ and _SPAM_ to get my e-mail address)

No comments:

Post a Comment