Designing an Efficient ETL Pipeline for Customer Order Data

This challenge focuses on designing an efficient Extract, Transform, and Load (ETL) pipeline using SQL to consolidate and enrich customer order data from multiple sources. Building robust ETL pipelines is crucial for data warehousing and business intelligence, enabling organizations to derive valuable insights from disparate data sources. Your task is to design a SQL-based pipeline that extracts data, performs necessary transformations, and loads it into a target table.

Problem Description

You are tasked with designing an ETL pipeline to combine customer order data from two sources: a raw_orders table and a customer_details table. The raw_orders table contains raw order information, while the customer_details table holds customer-specific information. The goal is to create a consolidated enriched_orders table containing order details combined with customer information, and calculated metrics.

What needs to be achieved:

Extract: Retrieve data from raw_orders and customer_details tables.
Transform:
- Join the data based on customer_id.
- Calculate the total order value by multiplying quantity and price from raw_orders.
- Create a new column customer_region by extracting the region from the customer_city field in customer_details (assume the region is the first word of the city).
Load: Insert the transformed data into the enriched_orders table.

Key Requirements:

The solution must be implemented using standard SQL.
The pipeline should be efficient, minimizing unnecessary operations.
The enriched_orders table should contain all the required columns.
Handle cases where a customer_id exists in raw_orders but not in customer_details (include these orders in the output with NULL values for customer details).
Handle cases where a customer_id exists in customer_details but not in raw_orders (these customers should not be included in the output).

Expected Behavior:

The final enriched_orders table should contain the following columns: order_id, customer_id, order_date, quantity, price, total_order_value, customer_name, customer_city, customer_region. The total_order_value should be calculated correctly, and the customer_region should be extracted as described.

Edge Cases to Consider:

Null values in any of the input columns.
Empty tables for either raw_orders or customer_details.
Invalid city names in customer_details that might cause errors when extracting the region. (Handle gracefully, e.g., set customer_region to NULL).
Large datasets – the solution should be reasonably performant.

Examples

Example 1:

raw_orders:
order_id | customer_id | order_date | quantity | price
---------|-------------|------------|----------|-------
1        | 101         | 2023-01-15 | 2        | 25.00
2        | 102         | 2023-02-20 | 1        | 50.00
3        | 101         | 2023-03-10 | 3        | 10.00

customer_details:
customer_id | customer_name | customer_city
-------------|---------------|---------------
101         | Alice Smith   | New York City
102         | Bob Johnson   | London UK
103         | Charlie Brown | Paris France

enriched_orders:
order_id | customer_id | order_date | quantity | price | total_order_value | customer_name | customer_city | customer_region
---------|-------------|------------|----------|-------|-------------------|---------------|---------------|---------------
1        | 101         | 2023-01-15 | 2        | 25.00 | 50.00             | Alice Smith   | New York City | New
2        | 102         | 2023-02-20 | 1        | 50.00 | 50.00             | Bob Johnson   | London UK     | London
3        | 101         | 2023-03-10 | 3        | 10.00 | 30.00             | Alice Smith   | New York City | New

Explanation: The data from raw_orders and customer_details is joined on customer_id. total_order_value is calculated, and customer_region is extracted from customer_city.

Example 2:

raw_orders:
order_id | customer_id | order_date | quantity | price
---------|-------------|------------|----------|-------
1        | 101         | 2023-01-15 | 2        | 25.00
2        | 104         | 2023-02-20 | 1        | 50.00

customer_details:
customer_id | customer_name | customer_city
-------------|---------------|---------------
101         | Alice Smith   | New York City
102         | Bob Johnson   | London UK

enriched_orders:
order_id | customer_id | order_date | quantity | price | total_order_value | customer_name | customer_city | customer_region
---------|-------------|------------|----------|-------|-------------------|---------------|---------------|---------------
1        | 101         | 2023-01-15 | 2        | 25.00 | 50.00             | Alice Smith   | New York City | New

Explanation: Order with customer_id 104 is excluded because there is no corresponding entry in customer_details.

Constraints

The solution must be implemented using standard SQL (no vendor-specific extensions).
The input tables raw_orders and customer_details can contain up to 1,000,000 rows each.
The customer_city field in customer_details has a maximum length of 255 characters.
The pipeline should complete within 30 seconds for the given dataset size.
The order_date column is of type DATE.
The quantity and price columns are of type INTEGER and DECIMAL respectively.

Notes

Consider using appropriate JOIN types (e.g., LEFT JOIN) to handle missing data.
Pay attention to data types when performing calculations.
Optimize the query for performance by using indexes where appropriate (though indexing is not part of the solution itself, consider it in your design).
The region extraction logic is simplified; a more robust solution might involve a lookup table or more sophisticated string parsing.
Assume the target table enriched_orders already exists with the specified columns and appropriate data types. You only need to provide the INSERT statement.
Pseudocode for the overall process:
1. SELECT from raw_orders and customer_details using LEFT JOIN on customer_id.
2. Calculate total_order_value as quantity * price.
3. Extract customer_region from customer_city (first word).
4. INSERT the resulting data into enriched_orders.