De-duplicate the provided dataset
Records will be considered duplicates if they meet all of the following conditions:
a. Last Name exact match
b. First Name fuzzy / similar match (points for creativity here)
c. Any exact match of one or more of the following:
1. Email Address
2. Full mailing Address
3. Phone Number
Once the ?? number of duplicate records are identified, they need to be merged into a single record per group, and the data merged in such a way that we have the most complete set of attributes as possible.
Example: a. If two duplicate records share an email address, but only one has a full mailing address, the resultant merged record should have both the email and the mailing address.
b. If two duplicate records have different values for one of the following, the merged record should use the more recent attribute as identified by the ModifiedOn and/or CreatedOn timestamp values
First Name
Email Address
Full Mailing Address
Phone Number
The resulting de-duplicated “Master” record needs to be appended to the source dataset, given a unique integer ID (you can seed this however you like), and then that new identifier assigned as the ParentID of the child duplicated source records.
Save and return the (now larger) dataset as a .csv file
The csv file with initial data hasthe below columns :
ID CreatedOn ModifiedOn Customer_LastName Customer_FirstName Customer_AddressLine1 Customer_City Customer_State Customer_Zip Customer_HomePhone Customer_InternetEmail
What I have tried:
> Tried parsing the csv file which contains the data into data table and filtered based on the requirements.
> Tried importing data to SQL and using the ADO.net to filter out the queries.