Python ETL Pipeline for Customer Data Analysis
This challenge involves building a robust Extract, Transform, Load (ETL) pipeline in Python. You will process raw customer data from a file, clean and enrich it, and then load it into a structured format suitable for analysis. This is a fundamental skill for data engineers and analysts.
Problem Description
Your task is to create a Python script that performs an ETL process on a given dataset of customer information. The dataset contains raw entries that may have inconsistencies, missing values, and require transformations before being ready for analysis.
Key Requirements:
- Extract: Read customer data from a CSV file.
- Transform:
- Data Cleaning:
- Handle missing values: For 'age' and 'income', fill missing values with the mean of the respective column. For 'country', fill missing values with the most frequent country.
- Standardize 'email': Ensure all email addresses are lowercase.
- Clean 'phone_number': Remove any non-digit characters (e.g., spaces, hyphens, parentheses).
- Convert 'registration_date' to a datetime object.
- Data Enrichment:
- Create a 'customer_segment' based on 'age' and 'income':
- "Young Professional":
age< 30 andincome>= 50000 - "Established Professional": 30 <=
age< 50 andincome>= 75000 - "Senior Executive":
age>= 50 andincome>= 100000 - "Budget-Conscious":
income< 40000 - "General": All other cases
- "Young Professional":
- Create a 'customer_segment' based on 'age' and 'income':
- Data Formatting:
- Ensure 'age' is an integer.
- Format 'income' to two decimal places.
- Data Cleaning:
- Load: Save the transformed and enriched data to a new CSV file.
Expected Behavior:
The script should take an input CSV file path and an output CSV file path as arguments. It should then perform the described transformations and save the results to the output file.
Edge Cases to Consider:
- Input file not found.
- Empty input file.
- Columns missing from the input file.
- Data types that cannot be converted (e.g., non-numeric age). For simplicity in this challenge, assume valid numeric data for 'age' and 'income' if present, but robust error handling for these conversions is a good practice for real-world scenarios.
Examples
Example 1:
Input CSV (input_customers.csv):
customer_id,name,age,email,phone_number,country,income,registration_date
1,Alice Smith,28,ALICE.SMITH@EXAMPLE.COM,(123) 456-7890,USA,60000,2022-01-15
2,Bob Johnson,,bob.johnson@example.com,987-654-3210,Canada,45000,2021-05-20
3,Charlie Brown,35,charlie.b@example.com,+1-555-123-4567,USA,80000,2023-03-10
4,Diana Prince,45,DIANA.P@EXAMPLE.COM,5551112222,UK,,2020-11-01
5,Ethan Hunt,,ethan.h@example.com,001-800-555-0100,Australia,95000,2022-07-22
6,Fiona Glenanne,55,fiona.g@example.com,(555) 555-5555,USA,120000,2019-09-30
7,George Jetson,22,george.j@example.com,111-222-3333,,30000,2023-01-01
Output CSV (output_customers.csv):
customer_id,name,age,email,phone_number,country,income,registration_date,customer_segment
1,Alice Smith,28,alice.smith@example.com,1234567890,USA,60000.00,2022-01-15 00:00:00,Young Professional
2,Bob Johnson,33,bob.johnson@example.com,9876543210,Canada,45000.00,2021-05-20 00:00:00,General
3,Charlie Brown,35,charlie.b@example.com,15551234567,USA,80000.00,2023-03-10 00:00:00,Established Professional
4,Diana Prince,45,diana.p@example.com,5551112222,UK,79428.57,2020-11-01 00:00:00,Established Professional
5,Ethan Hunt,35,ethan.h@example.com,18005550100,Australia,95000.00,2022-07-22 00:00:00,Senior Executive
6,Fiona Glenanne,55,fiona.g@example.com,5555555555,USA,120000.00,2019-09-30 00:00:00,Senior Executive
7,George Jetson,22,george.j@example.com,1112223333,USA,30000.00,2023-01-01 00:00:00,Budget-Conscious
Explanation:
- Alice's email was lowercased, phone number cleaned. Her segment is "Young Professional".
- Bob's missing age was filled with the mean (calculated from other valid ages). His email was lowercased, phone cleaned. His segment is "General".
- Charlie's phone was cleaned. His segment is "Established Professional".
- Diana's missing income was filled with the mean (calculated from other valid incomes). Her email was lowercased, phone cleaned. Her segment is "Established Professional".
- Ethan's missing age was filled with the mean. His email was lowercased, phone cleaned. His segment is "Senior Executive".
- Fiona's email was lowercased, phone cleaned. Her segment is "Senior Executive".
- George's missing country was filled with the most frequent country ("USA"). His email was lowercased, phone cleaned. His segment is "Budget-Conscious".
- All ages are integers. Incomes are formatted to two decimal places. Registration dates are datetime objects.
Example 2:
Input CSV (input_empty.csv):
customer_id,name,age,email,phone_number,country,income,registration_date
Output CSV (output_empty.csv):
customer_id,name,age,email,phone_number,country,income,registration_date,customer_segment
Explanation: An empty input file results in an empty output file with the added 'customer_segment' column header.
Example 3: (Handling missing 'income' for mean calculation)
Input CSV (input_missing_income.csv):
customer_id,name,age,email,phone_number,country,income,registration_date
1,Alice Smith,28,ALICE@EXAMPLE.COM,(123) 456-7890,USA,,2022-01-15
2,Bob Johnson,30,bob@example.com,987-654-3210,Canada,40000,2021-05-20
Output CSV (output_missing_income.csv):
customer_id,name,age,email,phone_number,country,income,registration_date,customer_segment
1,Alice Smith,28,alice@example.com,1234567890,USA,40000.00,2022-01-15 00:00:00,General
2,Bob Johnson,30,bob@example.com,9876543210,Canada,40000.00,2021-05-20 00:00:00,General
Explanation:
- When calculating the mean for 'income', only the valid entry for Bob is used, resulting in a mean of 40000.
- Alice's missing income is filled with 40000. Her segment is "General".
- Bob's segment is "General".
Constraints
- The input data will be provided as a CSV file.
- The input CSV will have at least the following columns:
customer_id,name,age,email,phone_number,country,income,registration_date. Some of these may be empty. - The
ageandincomecolumns will contain numeric data if not empty. - The
registration_datecolumn will be in a format parseable by Python'sdatetimemodule. - Your solution should use standard Python libraries such as
csvanddatetime, or a library likepandasfor a more streamlined approach. - The script should be runnable from the command line with two arguments: input file path and output file path.
Notes
- Consider using the
pandaslibrary for a more efficient and readable solution, especially for data manipulation tasks. - For calculating the mean, ensure you handle cases where a column might be entirely empty. In such cases, filling with 0 or another sensible default would be appropriate (though for this specific challenge, we assume at least one valid entry for mean calculation where applicable, or handle it gracefully).
- The order of transformations is important. For example, cleaning phone numbers should happen before any potential validation or formatting that might rely on digits.
- Think about how to handle potential errors during file operations or data conversions. While not strictly enforced by a strict failure, a robust solution would log or report these issues.
- The 'customer_segment' column should be the last column in the output CSV.