Designing an Efficient Data Aggregation Pipeline in SQL
This challenge focuses on designing an efficient Extract, Transform, Load (ETL) pipeline using SQL to aggregate and process data from multiple sources. You will be tasked with creating a solution that can handle large volumes of data, ensure data integrity, and provide accurate, summarized information for business intelligence purposes. This is a common and crucial task in data engineering.
Problem Description
You are tasked with building an ETL pipeline that extracts data from two source tables, orders and customers, transforms it by joining the tables and aggregating order details, and loads it into a new customer_order_summary table. The goal is to create a summary table that shows the total spending and number of orders for each customer, along with their registration date.
Key Requirements:
- Data Extraction: Read data from two distinct source tables:
ordersandcustomers. - Data Transformation:
- Join
ordersandcustomerstables on a commoncustomer_id. - For each customer, calculate the sum of
order_amountand count oforder_id. - Include the
registration_datefrom thecustomerstable. - Handle cases where a customer might exist in the
customerstable but have no orders.
- Join
- Data Loading: Insert the transformed data into a new table named
customer_order_summary. - Efficiency: The pipeline should be designed to perform efficiently, especially when dealing with a large number of records in the source tables.
Expected Behavior:
The customer_order_summary table should contain one row per unique customer. Each row should represent:
customer_id: The unique identifier for the customer.registration_date: The date the customer registered.total_spent: The sum of allorder_amountfor that customer.order_count: The total number of orders placed by that customer.
Edge Cases:
- Customers who have registered but have not placed any orders.
- Potentially large volumes of orders and customers.
Examples
Example 1:
Input orders Table:
| order_id | customer_id | order_amount | order_date |
|----------|-------------|--------------|------------|
| 101 | 1 | 50.00 | 2023-01-15 |
| 102 | 2 | 75.50 | 2023-01-16 |
| 103 | 1 | 120.00 | 2023-01-20 |
| 104 | 3 | 30.00 | 2023-01-21 |
| 105 | 1 | 25.00 | 2023-01-25 |
Input customers Table:
| customer_id | customer_name | registration_date |
|-------------|---------------|-------------------|
| 1 | Alice | 2022-11-01 |
| 2 | Bob | 2022-12-15 |
| 3 | Charlie | 2023-01-01 |
| 4 | David | 2023-02-10 |
Output customer_order_summary Table:
| customer_id | registration_date | total_spent | order_count |
|-------------|-------------------|-------------|-------------|
| 1 | 2022-11-01 | 195.00 | 3 |
| 2 | 2022-12-15 | 75.50 | 1 |
| 3 | 2023-01-01 | 30.00 | 1 |
| 4 | 2023-02-10 | 0.00 | 0 |
Explanation:
- Customer 1 (Alice) has 3 orders totaling 50 + 120 + 25 = 195.00.
- Customer 2 (Bob) has 1 order totaling 75.50.
- Customer 3 (Charlie) has 1 order totaling 30.00.
- Customer 4 (David) has no orders, so
total_spentis 0.00 andorder_countis 0. Theregistration_dateis still included.
Example 2:
Input orders Table: (Empty)
Input customers Table:
| customer_id | customer_name | registration_date |
|-------------|---------------|-------------------|
| 1 | Alice | 2022-11-01 |
| 2 | Bob | 2022-12-15 |
Output customer_order_summary Table:
| customer_id | registration_date | total_spent | order_count |
|-------------|-------------------|-------------|-------------|
| 1 | 2022-11-01 | 0.00 | 0 |
| 2 | 2022-12-15 | 0.00 | 0 |
Explanation:
- With no orders, all customers have 0
total_spentand 0order_count, but their registration dates are preserved.
Constraints
- The
orderstable can contain up to 10 million rows. - The
customerstable can contain up to 5 million rows. - The
customer_idin both tables is an integer. - The
order_amountis a decimal type. - The
registration_dateis a date type. - The solution must be written in standard SQL syntax.
- The pipeline should aim for a execution time of under 5 minutes on typical hardware.
Notes
- Consider the type of join that will correctly handle customers without orders.
- Think about how to aggregate data efficiently. Subqueries, Common Table Expressions (CTEs), or window functions might be useful.
- The problem asks for the design of the ETL pipeline, which implies creating the SQL query that performs the transformation and loading. You do not need to create separate
CREATE TABLEstatements for the source tables, but you should assume they exist with the described schema. - For customers with no orders, ensure their
total_spentis 0 andorder_countis 0.