Help me de-duplicate a provided dataset

Question

1.00/5 (1 vote)

See more:

De-duplicate the provided dataset

Records will be considered duplicates if they meet all of the following conditions:

a. Last Name exact match
b. First Name fuzzy / similar match (points for creativity here)
c. Any exact match of one or more of the following:
1. Email Address
2. Full mailing Address
3. Phone Number

Once the ?? number of duplicate records are identified, they need to be merged into a single record per group, and the data merged in such a way that we have the most complete set of attributes as possible.

Example: a. If two duplicate records share an email address, but only one has a full mailing address, the resultant merged record should have both the email and the mailing address.
b. If two duplicate records have different values for one of the following, the merged record should use the more recent attribute as identified by the ModifiedOn and/or CreatedOn timestamp values

First Name
Email Address
Full Mailing Address
Phone Number
The resulting de-duplicated “Master” record needs to be appended to the source dataset, given a unique integer ID (you can seed this however you like), and then that new identifier assigned as the ParentID of the child duplicated source records.

Save and return the (now larger) dataset as a .csv file

The csv file with initial data hasthe below columns :
ID CreatedOn ModifiedOn Customer_LastName Customer_FirstName Customer_AddressLine1 Customer_City Customer_State Customer_Zip Customer_HomePhone Customer_InternetEmail

What I have tried:

> Tried parsing the csv file which contains the data into data table and filtered based on the requirements.
> Tried importing data to SQL and using the ADO.net to filter out the queries.

Posted 18-Feb-19 8:36am

Member 10183768

Updated 21-Feb-19 2:14am

Add a Solution

Comments

ZurdoDev 18-Feb-19 15:05pm

What is your question?

Member 10183768 18-Feb-19 15:08pm

What logic should I use in T-SQL to get the result ?

Member 10183768 18-Feb-19 15:08pm

What logic should I use in T-SQL to get the result ?

ZurdoDev 18-Feb-19 15:12pm

The logic is already there. You now need to translate it into sql.

Member 10183768 18-Feb-19 15:14pm

Can you help me translate in SQL ? I am writing a .net console application to read the data to datatable but I don't think that's a good approach.

ZurdoDev 18-Feb-19 15:18pm

Depends. I won't do it all for you. I've got my own work to do. Where are you stuck?

Member 10183768 18-Feb-19 15:25pm

Sorry if my expectation is wrong ! I was hoping someone could write the whole part.

ZurdoDev 18-Feb-19 15:26pm

Not likely to happen.

MadMyche 18-Feb-19 19:55pm

Are you talkin 20 records or 2000?

Member 10183768 19-Feb-19 1:01am

60000

Pete O'Hanlon 18-Feb-19 15:06pm

What code do you have? What are you stuck on?

#realJSOP 19-Feb-19 15:01pm

He's stuck on "Your homework assignment for tonight is..."

Pete O'Hanlon 20-Feb-19 7:25am

I wish I could award a vote here but have a virtual 5 for that.

Santosh kumar Pithani 18-Feb-19 23:34pm

Writing query is not problem but keep sample data.Below query will help you ..all the best.

:WITH CTE AS(
select row_number()OVER(Partition by EmailAddress ORDER BY ISNULL(ModifiedOn,CreatedOn) DESC) AS RN ,* FROM Table_NAME

WHERE ([FullmailingAddress] IS NOT NULL AND [LastName]=@LastName AND [FirstName] LIKE @FirstName+'%') AND
([EmailAddress]=@EmailAddress OR [FullmailingAddress]=@FullmailingAddress OR [PhoneNumber]=@PhoneNumber])

)

select * FROM CTE WHERE RN=1;

1 solution

Add a Solution

Add your solution here

Treat my content as plain text, not as HTML

Preview 0

…

Existing Members

Sign in to your account

...or Join us

Download, Vote, Comment, Publish.

Your Email
Password
Forgot your password?

Your Email
This email is in use. Do you need your password?
Optional Password

I have read and agree to the Terms of Service and Privacy Policy
Please subscribe me to the CodeProject newsletters

When answering a question please:

Read the question carefully.
Understand that English isn't everyone's first language so be lenient of bad spelling and grammar.
If a question is poorly phrased then either ask for clarification, ignore it, or edit the question and fix the problem. Insults are not welcome.
Don't tell someone to read the manual. Chances are they have and don't get it. Provide an answer or move on to the next question.

Let's work to help developers, not make them feel stupid.

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

W∴ Balboos, *GHB* · Answer 1 · 2019-02-21T02:14:00

Solution 1

Here's something - about as close to an answer as you'll get - but you'll need to figure out how to use it.

TSQL: lookup how to use DISTINCT

After you see what it gives you figure out how to make use of it.

Posted 21-Feb-19 2:14am

W∴ Balboos, GHB