Hone logo
Hone
Problems

Implementing Slowly Changing Dimensions Type 2 in SQL

In data warehousing, it's crucial to track historical changes in dimension attributes over time. Slowly Changing Dimensions (SCD) are a common technique to handle this. Specifically, implementing SCD Type 2 (SCD Type 2) aims to preserve the complete history of data by creating new records for each change, while marking current and expired records. This challenge focuses on simulating this process using SQL.

Problem Description

You are tasked with designing and implementing a process to load data into a dimension table that supports SCD Type 2. This means that when an attribute of a dimension record changes, the existing record should be marked as expired, and a new record should be inserted to represent the updated state.

Key Requirements:

  1. Insert New Records: When a new dimension member arrives, insert it as the current record.
  2. Update Existing Records: When an existing dimension member's relevant attributes change:
    • Mark the current record as expired by setting an end_date to the current date (or a suitable past date like CURRENT_DATE - 1 day).
    • Insert a new record representing the updated state, with the start_date set to the current date and end_date set to a future sentinel value (e.g., '9999-12-31').
  3. Track History: Ensure that each distinct state of a dimension member is uniquely identifiable and traceable.
  4. Identify Current Records: Provide a mechanism to easily identify the most current version of any dimension member.

Expected Behavior:

The dimension table should contain multiple rows for a single logical dimension member if its relevant attributes have changed over time. Each row should represent a specific time period during which that version of the dimension member was active.

Important Considerations:

  • Primary Key: The dimension table should have a surrogate key for each row (the actual record in the table), distinct from the natural key of the dimension member.
  • Natural Key: A natural key (or business key) will be used to identify the logical dimension member.
  • SCD Attributes: You'll need to define which attributes are considered "SCD attributes" that trigger a new version when changed.
  • Date Tracking: The start_date and end_date columns are crucial for defining the validity period of each dimension record.

Examples

Example 1: Loading initial data

Assume an Employees dimension table with the following structure:

EmployeeKey (Surrogate PK) | EmployeeID (Natural Key) | Name   | Department | StartDate  | EndDate
--------------------------|--------------------------|--------|------------|------------|------------
1                         | 101                      | Alice  | Sales      | 2023-01-01 | 9999-12-31
2                         | 102                      | Bob    | IT         | 2023-01-01 | 9999-12-31

And a source table SourceEmployees with new data:

EmployeeID | Name   | Department
-----------|--------|------------
101        | Alice  | Sales
102        | Bob    | IT
103        | Charlie| HR

Input: Employees table (as above) and SourceEmployees table (as above). CURRENT_DATE is '2023-01-01'.

Output: Employees table after loading:

EmployeeKey | EmployeeID | Name   | Department | StartDate  | EndDate
------------|------------|--------|------------|------------|------------
1           | 101        | Alice  | Sales      | 2023-01-01 | 9999-12-31
2           | 102        | Bob    | IT         | 2023-01-01 | 9999-12-31
3           | 103        | Charlie| HR         | 2023-01-01 | 9999-12-31

Explanation: Employee 103 is new, so a new record is inserted with StartDate as CURRENT_DATE and EndDate as the sentinel. Existing records remain unchanged as no SCD attributes have changed.

Example 2: Updating an existing record

Assume Employees table as of '2023-01-02':

EmployeeKey | EmployeeID | Name   | Department | StartDate  | EndDate
------------|------------|--------|------------|------------|------------
1           | 101        | Alice  | Sales      | 2023-01-01 | 9999-12-31
2           | 102        | Bob    | IT         | 2023-01-01 | 9999-12-31
3           | 103        | Charlie| HR         | 2023-01-01 | 9999-12-31

And SourceEmployees table after an update:

EmployeeID | Name   | Department
-----------|--------|------------
101        | Alice  | Marketing
102        | Bob    | IT

Input: Employees table (as above) and SourceEmployees table (as above). CURRENT_DATE is '2023-01-02'. SCD attributes: Department.

Output: Employees table after loading:

EmployeeKey | EmployeeID | Name   | Department | StartDate  | EndDate
------------|------------|--------|------------|------------|------------
1           | 101        | Alice  | Sales      | 2023-01-01 | 2023-01-01
4           | 101        | Alice  | Marketing  | 2023-01-02 | 9999-12-31
2           | 102        | Bob    | IT         | 2023-01-01 | 9999-12-31
3           | 103        | Charlie| HR         | 2023-01-01 | 9999-12-31

Explanation: Employee 101's Department has changed from 'Sales' to 'Marketing'. The existing record for Alice (EmployeeKey 1) is updated to set its EndDate to CURRENT_DATE - 1 day (which is '2023-01-01' in this example). A new record for Alice (EmployeeKey 4) is inserted with Department as 'Marketing', StartDate as CURRENT_DATE ('2023-01-02'), and EndDate as the sentinel.

Example 3: Handling records that are not changing

Assume Employees table as of '2023-01-03':

EmployeeKey | EmployeeID | Name   | Department | StartDate  | EndDate
------------|------------|--------|------------|------------|------------
1           | 101        | Alice  | Sales      | 2023-01-01 | 2023-01-01
4           | 101        | Alice  | Marketing  | 2023-01-02 | 9999-12-31
2           | 102        | Bob    | IT         | 2023-01-01 | 9999-12-31
3           | 103        | Charlie| HR         | 2023-01-01 | 9999-12-31

And SourceEmployees table:

EmployeeID | Name   | Department
-----------|--------|------------
101        | Alice  | Marketing
102        | Bob    | IT
103        | Charlie| HR

Input: Employees table (as above) and SourceEmployees table (as above). CURRENT_DATE is '2023-01-03'. SCD attributes: Department.

Output: Employees table after loading:

EmployeeKey | EmployeeID | Name   | Department | StartDate  | EndDate
------------|------------|--------|------------|------------|------------
1           | 101        | Alice  | Sales      | 2023-01-01 | 2023-01-01
4           | 101        | Alice  | Marketing  | 2023-01-02 | 9999-12-31
2           | 102        | Bob    | IT         | 2023-01-01 | 9999-12-31
3           | 103        | Charlie| HR         | 2023-01-01 | 9999-12-31

Explanation: All EmployeeIDs present in SourceEmployees have their current records in Employees that match the Name and Department. No changes require updates or new inserts.

Constraints

  • The dimension table will contain at least the columns: SurrogateKey (integer, auto-incrementing), NaturalKey (string/integer), Attribute1, Attribute2, ..., StartDate (date), EndDate (date).
  • The NaturalKey uniquely identifies a logical dimension member.
  • StartDate will always be less than or equal to EndDate.
  • EndDate will be the sentinel value '9999-12-31' for currently active records.
  • The CURRENT_DATE used for processing will be a specific, consistent date for each batch load.
  • You will be provided with a SourceTable and an existing DimensionTable. Your task is to write the logic (e.g., SQL statements) to update the DimensionTable based on the SourceTable.

Notes

  • Consider using a MERGE statement (or equivalent constructs depending on your specific SQL dialect) for efficiency.
  • The definition of "SCD attributes" is critical. For this challenge, assume the relevant SCD attributes are all columns in the source table except for the natural key.
  • Think about how to efficiently find records that need to be expired and records that need to be inserted.
  • You are not required to implement the auto-incrementing surrogate key mechanism itself; assume it's handled by the database system. Your focus is on the logic of handling dimension changes.
  • For simplicity, assume no deletions occur in the source data.
Loading editor...
plaintext