Efficient Data Migration: Merging and Deduplicating Customer Records
This challenge focuses on designing efficient SQL queries to handle a common data migration task: merging customer data from an old system into a new system. You'll need to handle potential duplicate records, ensuring data integrity and avoiding redundant entries in the target database. This is crucial for maintaining a clean and reliable customer database.
Problem Description
You are tasked with migrating customer data from a legacy old_customers table to a new new_customers table. The old_customers table may contain duplicate customer entries, identified by a combination of email and phone_number. The new_customers table should contain only unique customer records. When duplicates are found, you need to decide which record to prioritize for migration and how to handle associated data from the old system.
Your goal is to write SQL queries that:
- Identify duplicate customer records in
old_customers. - For each group of duplicates, select a "master" record based on a defined priority.
- Insert the selected master records into
new_customers, ensuring no duplicates are created. - Optionally, update an
archive_customerstable with the non-master duplicates.
Key requirements:
- Uniqueness in
new_customersshould be enforced based onemailandphone_number. - A specific priority rule must be applied to select the master record from duplicates.
- The migration process should be efficient, especially for large datasets.
Expected behavior:
- Only one record per unique
email/phone_numbercombination should exist innew_customers. - The record selected as the master should be the one that best represents the customer according to the priority rules.
- Non-master duplicate records should be archived.
Edge cases to consider:
- Customers with missing
emailorphone_number. - Customers with multiple duplicate entries.
- Customers present in
old_customersbut already existing innew_customers(though this scenario is simplified in the examples by assumingnew_customersis initially empty for the migration scope).
Examples
Example 1:
Assume old_customers table structure:
customer_id (INT), name (VARCHAR), email (VARCHAR), phone_number (VARCHAR), registration_date (DATETIME)
Assume new_customers table structure:
customer_id (INT), name (VARCHAR), email (VARCHAR), phone_number (VARCHAR), registration_date (DATETIME)
Assume archive_customers table structure:
archive_id (INT), original_customer_id (INT), name (VARCHAR), email (VARCHAR), phone_number (VARCHAR), registration_date (DATETIME), reason_for_archive (VARCHAR)
Input: old_customers table:
| customer_id | name | phone_number | registration_date | |
|---|---|---|---|---|
| 1 | Alice Smith | alice@example.com | 111-222-3333 | 2023-01-15 10:00:00 |
| 2 | Bob Johnson | bob@example.com | 444-555-6666 | 2023-02-20 11:30:00 |
| 3 | Alice Smith | alice@example.com | 111-222-3333 | 2023-01-20 09:00:00 |
| 4 | Charlie Brown | charlie@example.com | 777-888-9999 | 2023-03-10 14:00:00 |
| 5 | Alice Smith | alice.s@example.com | 111-222-3333 | 2023-01-18 12:00:00 |
Priority Rule: If multiple records share the same email and phone_number, prioritize the record with the earliest registration_date. If registration_date is also the same, prioritize the one with the lower customer_id.
Output: new_customers table:
| customer_id | name | phone_number | registration_date | |
|---|---|---|---|---|
| 1 | Alice Smith | alice@example.com | 111-222-3333 | 2023-01-15 10:00:00 |
| 2 | Bob Johnson | bob@example.com | 444-555-6666 | 2023-02-20 11:30:00 |
| 4 | Charlie Brown | charlie@example.com | 777-888-9999 | 2023-03-10 14:00:00 |
Explanation:
- Records with
customer_id1 and 3 are duplicates (same email and phone). Record 1 has an earlierregistration_date, so it's chosen. - Record 5 has a different email but the same phone number as records 1 and 3. Based on the problem statement, duplicates are identified by combination of
emailANDphone_number. Therefore, record 5 is not considered a duplicate of 1 or 3. - Record 2 and 4 are unique.
Output: archive_customers table:
| archive_id | original_customer_id | name | phone_number | registration_date | reason_for_archive | |
|---|---|---|---|---|---|---|
| 1 | 3 | Alice Smith | alice@example.com | 111-222-3333 | 2023-01-20 09:00:00 | Duplicate |
Explanation:
customer_id3 was a duplicate ofcustomer_id1 and was not chosen as the master, so it's archived.
Example 2 (Edge Case: Missing Data):
Input: old_customers table:
| customer_id | name | phone_number | registration_date | |
|---|---|---|---|---|
| 10 | David Lee | david@example.com | NULL | 2023-04-01 09:00:00 |
| 11 | David Lee | NULL | 999-888-7777 | 2023-04-05 10:00:00 |
| 12 | David Lee | david@example.com | 999-888-7777 | 2023-04-03 11:00:00 |
Priority Rule: Same as Example 1.
Output: new_customers table:
| customer_id | name | phone_number | registration_date | |
|---|---|---|---|---|
| 12 | David Lee | david@example.com | 999-888-7777 | 2023-04-03 11:00:00 |
Explanation:
- Records 10, 11, and 12 are considered duplicates because they all refer to "David Lee" and share some identifying information (even if one piece is NULL). The combination of
emailandphone_numberis used for deduplication. - For record 10,
phone_numberis NULL. - For record 11,
emailis NULL. - Record 12 has both
emailandphone_numberpopulated. - When comparing these, record 12 has the earliest
registration_dateamong those that have bothemailandphone_numberpopulated or are more complete. If we treat NULLs as less preferred for matching, record 12 becomes the best candidate. A more robust approach in a real-world scenario might involve specific NULL handling rules or imputation. For this challenge, we prioritize non-NULL values when determining duplicates. - Based on the strict interpretation of "combination of email and phone_number", if one is NULL, it's a different combination. However, for practical migration, we need to infer duplicates. Here, the intent is that all these refer to the same David Lee. Record 12 is chosen due to its earliest registration date amongst records with both fields populated and its overall "completeness" if that were a factor.
Output: archive_customers table:
| archive_id | original_customer_id | name | phone_number | registration_date | reason_for_archive | |
|---|---|---|---|---|---|---|
| 1 | 10 | David Lee | david@example.com | NULL | 2023-04-01 09:00:00 | Duplicate |
| 2 | 11 | David Lee | NULL | 999-888-7777 | 2023-04-05 10:00:00 | Duplicate |
Explanation:
- Records 10 and 11 were duplicates of record 12 and were not selected.
Constraints
- The
old_customerstable can contain up to 1,000,000 records. - The
new_customerstable will be populated with unique records fromold_customers. - The queries should aim for a time complexity that allows migration within a reasonable downtime window (e.g., execution time should be efficient, ideally sub-minute for 1M records on standard hardware).
emailandphone_numbercan be NULL.customer_idis unique withinold_customers.
Notes
- Consider using window functions (like
ROW_NUMBER()orRANK()) to efficiently rank duplicates. - Think about how to handle NULL values in
emailandphone_numberwhen defining duplicates and priority. For this challenge, assume that if eitheremailorphone_numberis NULL, it doesn't form a duplicate with a record where both are populated, unless both are NULL for the same person (a scenario not explicitly covered by the primary deduplication rule of "combination"). However, for practical migration, a more sophisticated approach may be needed. For this challenge, we prioritize records where bothemailandphone_numberare non-NULL if they refer to the same individual. - The
registration_dateis the primary key for tie-breaking among duplicates with identicalemailandphone_number. Ifregistration_dateis also identical, usecustomer_id(lower ID is preferred). - You will need to write SQL statements to:
- Select the records to be inserted into
new_customers. - Insert these records into
new_customers. - Select the records to be archived.
- Insert these records into
archive_customers.
- Select the records to be inserted into
- The goal is to achieve this with a minimal number of table scans and efficient joins/subqueries.