Hone logo
Hone
Problems

Building a Simple ETL Pipeline in Python

This challenge asks you to build a basic Extract, Transform, and Load (ETL) pipeline in Python. ETL processes are fundamental in data engineering, used to move data from various sources, clean and transform it, and load it into a target destination for analysis or reporting. This exercise will help you understand the core concepts of ETL and practice working with data manipulation in Python.

Problem Description

You are tasked with creating an ETL pipeline that extracts data from a CSV file, transforms it by converting a specific column to a different data type and adding a new calculated column, and then loads the transformed data into a new CSV file. The input CSV file contains customer data with columns: customer_id, name, email, and total_spent. The total_spent column is currently stored as a string.

What needs to be achieved:

  1. Extract: Read the data from an input CSV file.
  2. Transform:
    • Convert the total_spent column from string to float. Handle potential ValueError exceptions if the string cannot be converted to a float (set these values to 0.0).
    • Create a new column called discounted_spent by applying a 10% discount to the total_spent (now a float).
  3. Load: Write the transformed data (including the new discounted_spent column) to a new CSV file.

Key Requirements:

  • The code should be modular and well-structured.
  • Error handling is crucial, especially when converting the total_spent column.
  • The output CSV file should have a header row with the column names: customer_id, name, email, total_spent, discounted_spent.

Expected Behavior:

The program should read the input CSV, perform the transformations as described, and write the transformed data to the output CSV. If a total_spent value cannot be converted to a float, it should be treated as 0.0. The discounted_spent column should accurately reflect a 10% discount on the total_spent.

Edge Cases to Consider:

  • Empty input CSV file.
  • CSV file with missing values in any column.
  • total_spent column containing non-numeric values that cannot be converted to floats.
  • CSV file with incorrect number of columns.

Examples

Example 1:

Input:
customer_id,name,email,total_spent
1,Alice,alice@example.com,100.00
2,Bob,bob@example.com,200.50
3,Charlie,charlie@example.com,abc
4,David,david@example.com,300
Output:
customer_id,name,email,total_spent,discounted_spent
1,Alice,alice@example.com,100.0,90.0
2,Bob,bob@example.com,200.5,180.45
3,Charlie,charlie@example.com,0.0,0.0
4,David,david@example.com,300.0,270.0

Explanation: The total_spent values are converted to floats. "abc" is converted to 0.0. A 10% discount is applied to each total_spent value to calculate discounted_spent.

Example 2:

Input:
customer_id,name,email,total_spent
1,Alice,alice@example.com,
2,Bob,bob@example.com,200.50
Output:
customer_id,name,email,total_spent,discounted_spent
1,Alice,alice@example.com,0.0,0.0
2,Bob,bob@example.com,200.5,180.45

Explanation: An empty string in total_spent is treated as 0.0.

Constraints

  • The input CSV file will be relatively small (less than 1MB).
  • The input CSV file will be comma-separated.
  • The code should be able to handle a reasonable number of rows (up to 1000).
  • The output CSV file should also be comma-separated.
  • The code should be written in Python 3.

Notes

  • You can use the csv module for reading and writing CSV files.
  • Consider using a try-except block to handle potential ValueError exceptions during the conversion of the total_spent column.
  • Think about how to handle missing values in the input CSV file gracefully. Treating them as 0.0 for total_spent is a reasonable approach.
  • Focus on creating a clear, readable, and maintainable solution. Modularity is key.
  • The discount calculation should be accurate to two decimal places.
Loading editor...
python