Showing posts with label stumped. Show all posts
Showing posts with label stumped. Show all posts

Friday, March 9, 2012

Find Missing data from table

Here is an issue that has me stumped for the past few days. I have a
table called MerchTran. Among various columns, the relevant columns for
this issue are:

FileDate datetime
, SourceTable varchar(25)

SQL statement:
SELECT DISTINCT
FileDate
, SourceTable
FROM
MerchTran
ORDER BY
FileDate
, SourceTable

Data looks like this:
FileDate DataSource
-----------
2005-02-13 00:00:00.000S1
2005-02-13 00:00:00.000S2
2005-02-13 00:00:00.000S3
2005-02-14 00:00:00.000S1
2005-02-14 00:00:00.000S2
2005-02-14 00:00:00.000S3
2005-02-15 00:00:00.000S2
2005-02-15 00:00:00.000S3
2005-02-16 00:00:00.000S1
2005-02-16 00:00:00.000S2
2005-02-16 00:00:00.000S3
2005-02-17 00:00:00.000S1
2005-02-17 00:00:00.000S2
2005-02-18 00:00:00.000S1
2005-02-18 00:00:00.000S2
2005-02-18 00:00:00.000S3
2005-02-19 00:00:00.000S1
2005-02-19 00:00:00.000S3

We run a daily process that inserts data in to this table everyday for
all 3 sources S1, S2, S3

Notice how some data is missing indicating the import process for that
particular source failed.
Example: Missing record
2005-02-15 00:00:00.000S1
2005-02-17 00:00:00.000S3
2005-02-19 00:00:00.000S2

Can someone please help me with a SQL Statement that will return me the
3 missing records as above.

Thanks in advance for all your help!

DBA in distress!
Vishal[posted and mailed, please reply in news]

SQLJunkie (vsinha73@.gmail.com) writes:
> Here is an issue that has me stumped for the past few days. I have a
> table called MerchTran. Among various columns, the relevant columns for
> this issue are:
>...
> We run a daily process that inserts data in to this table everyday for
> all 3 sources S1, S2, S3
> Notice how some data is missing indicating the import process for that
> particular source failed.
> Example: Missing record
> 2005-02-15 00:00:00.000 S1
> 2005-02-17 00:00:00.000 S3
> 2005-02-19 00:00:00.000 S2
> Can someone please help me with a SQL Statement that will return me the
> 3 missing records as above.

The below assumes that daily means daily and not only Monday to Friday.
It will catch if a date is missing entirely, although if the first or
last day is missing.

CREATE TABLE demo (date datetime NOT NULL,
src char(2) NOT NULL,
CONSTRAINT pk_demo PRIMARY KEY (date, src))
go
-- Creates a numbers table in a somewhat casual way that
-- is not guaranteed to work. The MAXDOP turns of parallelism
-- to improve our chances.
SELECT TOP 8000 n = identity(int, 1, 1)
INTO numbers
FROM Northwind..Orders a
CROSS JOIN Northwind..Orders b
OPTION (MAXDOP 1)
go
INSERT demo (date, src)
SELECT '2005-02-13 00:00:00.000', 'S1' UNION
SELECT '2005-02-13 00:00:00.000', 'S2' UNION
SELECT '2005-02-13 00:00:00.000', 'S3' UNION
SELECT '2005-02-14 00:00:00.000', 'S1' UNION
SELECT '2005-02-14 00:00:00.000', 'S2' UNION
SELECT '2005-02-14 00:00:00.000', 'S3' UNION
SELECT '2005-02-15 00:00:00.000', 'S2' UNION
SELECT '2005-02-15 00:00:00.000', 'S3' UNION
SELECT '2005-02-16 00:00:00.000', 'S1' UNION
SELECT '2005-02-16 00:00:00.000', 'S2' UNION
SELECT '2005-02-16 00:00:00.000', 'S3' UNION
SELECT '2005-02-17 00:00:00.000', 'S1' UNION
SELECT '2005-02-17 00:00:00.000', 'S2' UNION
SELECT '2005-02-18 00:00:00.000', 'S1' UNION
SELECT '2005-02-18 00:00:00.000', 'S2' UNION
SELECT '2005-02-18 00:00:00.000', 'S3' UNION
SELECT '2005-02-19 00:00:00.000', 'S1' UNION
SELECT '2005-02-19 00:00:00.000', 'S3'
go
SELECT *
FROM (SELECT date = dateadd(DAY, n - 1, mindate)
FROM numbers n
CROSS JOIN (SELECT mindate = MIN(date),
maxdate = MAX(date)
FROM demo) d
WHERE n BETWEEN 1 AND datediff(DAY, mindate, maxdate) + 1) AS a
CROSS JOIN (SELECT src = 'S1' UNION SELECT 'S2' UNION SELECT 'S3') AS s
WHERE NOT EXISTS (SELECT *
FROM demo
WHERE demo.date = a.date
AND demo.src = s.src)
go
DROP TABLE numbers
DROP TABLE demo

--
Erland Sommarskog, SQL Server MVP, esquel@.sommarskog.se

Books Online for SQL Server SP3 at
http://www.microsoft.com/sql/techin.../2000/books.asp|||Thanks for the quick response Erland. I see what you are trying to do,
but I have data in the table starting 2003. Would be a lot of work if I
have to insert 3 records in table numbers for everyday (including
saturday and sunday) starting 2003. But I will try this nonetheless.

Thanks!

Vishal|||SQLJunkie (vsinha73@.gmail.com) writes:
> Thanks for the quick response Erland. I see what you are trying to do,
> but I have data in the table starting 2003. Would be a lot of work if I
> have to insert 3 records in table numbers for everyday (including
> saturday and sunday) starting 2003. But I will try this nonetheless.

Numbers need to have as many rows as there days in the timespan.

In fact, it does have to be a numbers table. I used a numbers, because
1) it was easier to compose one on the fly.
2) a numbers table is applicable other problems as well.
However, in our system we don't have a numbers - but we have a dates table,
and here is how we fill it in a safe way:

CREATE TABLE dates (
thedate aba_date NOT NULL,
CONSTRAINT pk_dates PRIMARY KEY (thedate)
)

-- Get a temptable with numbers. This is a cheap, but not 100% reliable.
-- Whence the query hint and all the checks.
SELECT TOP 80001 n = IDENTITY(int, 0, 1)
INTO #numbers
FROM sysobjects o1
CROSS JOIN sysobjects o2
CROSS JOIN sysobjects o3
CROSS JOIN sysobjects o4
OPTION (MAXDOP 1)
go
-- Make sure we have unique numbers.
CREATE UNIQUE CLUSTERED INDEX num_ix ON #numbers (n)
go
-- Verify that table does not have gaps.
IF (SELECT COUNT(*) FROM #numbers) = 80001 AND
(SELECT MIN(n) FROM #numbers) = 0 AND
(SELECT MAX(n) FROM #numbers) = 80000
BEGIN
DECLARE @.msg varchar(255)

-- Insert the dates:
INSERT dates (thedate)
SELECT dateadd(DAY, n, '19800101')
FROM #numbers
WHERE dateadd(DAY, n, '19800101') < '21500101'

SELECT @.msg = 'Inserted ' + ltrim(str(@.@.rowcount)) + ' rows into
#numbers'
PRINT @.msg
END
ELSE
RAISERROR('#numbers is not contiguos from 0 to 80001!', 16, -1)
go
DROP TABLE #numbers

--
Erland Sommarskog, SQL Server MVP, esquel@.sommarskog.se

Books Online for SQL Server SP3 at
http://www.microsoft.com/sql/techin.../2000/books.asp|||CREATE TABLE Readings
(collection_date DATETIME NOT NULL,
source CHAR(2) NOT NULL
CHECK(source IN ('S1', 'S2', 'S3')),
PRIMARY KEY (collection_date, source));

This is a quick way to see what you do have on the days with missing
data:

CREATE VIEW Shortdates(collection_date, source)
AS
SELECT collection_date, source
FROM Readings
GROUP BY collection_date
HAVING COUNT(*) < 3;

We can now use this view or make it into a derived table. We will also
another table, which probably exists already.

CREATE TABLE Sources (sources CHAR(2) NOT NULL PRIMARY KEY, ..);

SELECT R1.collection_date, S1.source
FROM Readings AS R1,
Sources AS S1
WHERE S1.source
NOT IN (SELECT R2.source
FROM Shortdates AS R2 -- or use a derived table
WHERE R1.collection_date
= R2.collection_date);

-- Untested

Friday, February 24, 2012

Find 90th percentile scores for each student

I am stumped on a set-based approach for this one.

A cursor approach is straightforward enough, but i want to avoid that.

Here's my table:

create table StudentScores
(
id int primary key identity(1,1),
student_id int not null,
score int not null
)

with some sample data:

insert into StudentScores (student_id, score)
select 1, 10 union all
select 1, 29 union all
select 1, 50 union all
select 1, 53 union all
select 1, 45 union all
select 1, 10 union all
select 1, 29 union all
select 1, 50 union all
select 1, 53 union all
select 1, 45 union all
select 1, 88 union all
select 2, 23 union all
select 2, 54 union all
select 2, 55 union all
select 2, 34 union all
select 2, 56 union all
select 2, 78 union all
select 2, 23 union all
select 2, 54 union all
select 2, 55 union all
select 2, 34 union all
select 2, 56 union all
select 2, 78 union all
select 2, 23 union all
select 2, 54 union all
select 2, 55 union all
select 2, 34 union all
select 2, 56 union all
select 2, 78 union all
select 2, 98

What I want is, for each student, what is their 90th percentile score?

For a given single student, one possibility would be:

declare @.studentid int
set @.studentid = 2
select top 1 @.studentid as student_id, a.score as [90th percentile score]
from
(
select top 90 percent score from StudentScores
where student_id = @.studentid order by score asc
) as a
order by a.score desc

But I want this for all students, and not use a cursor.

Any ideas?

Thanks!So ... do you want a single maximum score fro each student that is in the top 90th percentile, or all scores that are in the 90th percentile?

If you derive a table of 90th percentile points, then maybe the outer query can bring back the student_id and max score.

Not perfect, but maybe a start:


select StudentScores.student_id, max(StudentScores.score) score, s.[90th percentile score]
from StudentScores
inner join (
select student_id, (sum(score) * 9) / 100 [90th percentile score]
from StudentScores
group by student_id) s ON s.Student_id = StudentScores.student_id
where StudentScores.score > s.[90th percentile score]
group by StudentScores.student_id, StudentScores.score, s.[90th percentile score]
order by StudentScores.student_id desc|||What I want is this:

1. just to be concrete, say each student has 100 scores.
2. for each student, order the student's scores from highest to lowest.
3. pick the 10th score down the list for each student.

If each student had 200 scores, you would pick the 20th score down the list. That's what I mean by 90th percentile score.

Does that make sense?

tomh53 - It's not immediately clear to me that your query produces this. I'll study it more and see. :)|||Perhaps something like this:

select distinct student_id, score
from StudentScores ss
where ss.score = (
select min(score)
from StudentScores ss2
where ss2.studenti_id = ss.studentid and score in (
select top 11 percent score
from StudentScores
where student_id = ss2.studentid
)
)

I guess that top 11 percent would give the the 90th percent as the min of the group, or does that not work?

R|||The article linked below sheds some light on percentiles in tsql:

http://www.sqlteam.com/item.asp?ItemID=16480

Good luck.|||lytri,

your solution worked, except you forgot to add an order by clause in the innermost query. Other than that little change it's perfect.

thanks!|||Ahh yes, that would probably help wouldn't it...

Well, glad I could take a break from work and help someone out :)