Hone logo
Hone
Problems

Efficient Data Migration: Merging and Deduplicating Customer Records

This challenge focuses on designing efficient SQL queries to handle a common data migration task: merging customer data from an old system into a new system. You'll need to handle potential duplicate records, ensuring data integrity and avoiding redundant entries in the target database. This is crucial for maintaining a clean and reliable customer database.

Problem Description

You are tasked with migrating customer data from a legacy old_customers table to a new new_customers table. The old_customers table may contain duplicate customer entries, identified by a combination of email and phone_number. The new_customers table should contain only unique customer records. When duplicates are found, you need to decide which record to prioritize for migration and how to handle associated data from the old system.

Your goal is to write SQL queries that:

  1. Identify duplicate customer records in old_customers.
  2. For each group of duplicates, select a "master" record based on a defined priority.
  3. Insert the selected master records into new_customers, ensuring no duplicates are created.
  4. Optionally, update an archive_customers table with the non-master duplicates.

Key requirements:

  • Uniqueness in new_customers should be enforced based on email and phone_number.
  • A specific priority rule must be applied to select the master record from duplicates.
  • The migration process should be efficient, especially for large datasets.

Expected behavior:

  • Only one record per unique email/phone_number combination should exist in new_customers.
  • The record selected as the master should be the one that best represents the customer according to the priority rules.
  • Non-master duplicate records should be archived.

Edge cases to consider:

  • Customers with missing email or phone_number.
  • Customers with multiple duplicate entries.
  • Customers present in old_customers but already existing in new_customers (though this scenario is simplified in the examples by assuming new_customers is initially empty for the migration scope).

Examples

Example 1: Assume old_customers table structure: customer_id (INT), name (VARCHAR), email (VARCHAR), phone_number (VARCHAR), registration_date (DATETIME)

Assume new_customers table structure: customer_id (INT), name (VARCHAR), email (VARCHAR), phone_number (VARCHAR), registration_date (DATETIME)

Assume archive_customers table structure: archive_id (INT), original_customer_id (INT), name (VARCHAR), email (VARCHAR), phone_number (VARCHAR), registration_date (DATETIME), reason_for_archive (VARCHAR)

Input: old_customers table:

customer_idnameemailphone_numberregistration_date
1Alice Smithalice@example.com111-222-33332023-01-15 10:00:00
2Bob Johnsonbob@example.com444-555-66662023-02-20 11:30:00
3Alice Smithalice@example.com111-222-33332023-01-20 09:00:00
4Charlie Browncharlie@example.com777-888-99992023-03-10 14:00:00
5Alice Smithalice.s@example.com111-222-33332023-01-18 12:00:00

Priority Rule: If multiple records share the same email and phone_number, prioritize the record with the earliest registration_date. If registration_date is also the same, prioritize the one with the lower customer_id.

Output: new_customers table:

customer_idnameemailphone_numberregistration_date
1Alice Smithalice@example.com111-222-33332023-01-15 10:00:00
2Bob Johnsonbob@example.com444-555-66662023-02-20 11:30:00
4Charlie Browncharlie@example.com777-888-99992023-03-10 14:00:00

Explanation:

  • Records with customer_id 1 and 3 are duplicates (same email and phone). Record 1 has an earlier registration_date, so it's chosen.
  • Record 5 has a different email but the same phone number as records 1 and 3. Based on the problem statement, duplicates are identified by combination of email AND phone_number. Therefore, record 5 is not considered a duplicate of 1 or 3.
  • Record 2 and 4 are unique.

Output: archive_customers table:

archive_idoriginal_customer_idnameemailphone_numberregistration_datereason_for_archive
13Alice Smithalice@example.com111-222-33332023-01-20 09:00:00Duplicate

Explanation:

  • customer_id 3 was a duplicate of customer_id 1 and was not chosen as the master, so it's archived.

Example 2 (Edge Case: Missing Data):

Input: old_customers table:

customer_idnameemailphone_numberregistration_date
10David Leedavid@example.comNULL2023-04-01 09:00:00
11David LeeNULL999-888-77772023-04-05 10:00:00
12David Leedavid@example.com999-888-77772023-04-03 11:00:00

Priority Rule: Same as Example 1.

Output: new_customers table:

customer_idnameemailphone_numberregistration_date
12David Leedavid@example.com999-888-77772023-04-03 11:00:00

Explanation:

  • Records 10, 11, and 12 are considered duplicates because they all refer to "David Lee" and share some identifying information (even if one piece is NULL). The combination of email and phone_number is used for deduplication.
  • For record 10, phone_number is NULL.
  • For record 11, email is NULL.
  • Record 12 has both email and phone_number populated.
  • When comparing these, record 12 has the earliest registration_date among those that have both email and phone_number populated or are more complete. If we treat NULLs as less preferred for matching, record 12 becomes the best candidate. A more robust approach in a real-world scenario might involve specific NULL handling rules or imputation. For this challenge, we prioritize non-NULL values when determining duplicates.
  • Based on the strict interpretation of "combination of email and phone_number", if one is NULL, it's a different combination. However, for practical migration, we need to infer duplicates. Here, the intent is that all these refer to the same David Lee. Record 12 is chosen due to its earliest registration date amongst records with both fields populated and its overall "completeness" if that were a factor.

Output: archive_customers table:

archive_idoriginal_customer_idnameemailphone_numberregistration_datereason_for_archive
110David Leedavid@example.comNULL2023-04-01 09:00:00Duplicate
211David LeeNULL999-888-77772023-04-05 10:00:00Duplicate

Explanation:

  • Records 10 and 11 were duplicates of record 12 and were not selected.

Constraints

  • The old_customers table can contain up to 1,000,000 records.
  • The new_customers table will be populated with unique records from old_customers.
  • The queries should aim for a time complexity that allows migration within a reasonable downtime window (e.g., execution time should be efficient, ideally sub-minute for 1M records on standard hardware).
  • email and phone_number can be NULL.
  • customer_id is unique within old_customers.

Notes

  • Consider using window functions (like ROW_NUMBER() or RANK()) to efficiently rank duplicates.
  • Think about how to handle NULL values in email and phone_number when defining duplicates and priority. For this challenge, assume that if either email or phone_number is NULL, it doesn't form a duplicate with a record where both are populated, unless both are NULL for the same person (a scenario not explicitly covered by the primary deduplication rule of "combination"). However, for practical migration, a more sophisticated approach may be needed. For this challenge, we prioritize records where both email and phone_number are non-NULL if they refer to the same individual.
  • The registration_date is the primary key for tie-breaking among duplicates with identical email and phone_number. If registration_date is also identical, use customer_id (lower ID is preferred).
  • You will need to write SQL statements to:
    1. Select the records to be inserted into new_customers.
    2. Insert these records into new_customers.
    3. Select the records to be archived.
    4. Insert these records into archive_customers.
  • The goal is to achieve this with a minimal number of table scans and efficient joins/subqueries.
Loading editor...
plaintext