Hone logo
Hone
Problems

Customer Churn Prediction Pipeline

This challenge asks you to build a complete machine learning pipeline in Python to predict customer churn for a telecommunications company. Customer churn, or the rate at which customers stop doing business with a company, is a critical metric. Accurately predicting churn allows businesses to proactively intervene and retain valuable customers.

Problem Description

You are provided with a dataset containing information about customers of a telecommunications company, including demographic data, service usage, and account details. Your task is to build a machine learning pipeline that can accurately predict which customers are likely to churn. The pipeline should include data loading, preprocessing, feature engineering, model training, and evaluation.

What needs to be achieved:

  1. Data Loading: Load the provided dataset into a Pandas DataFrame.
  2. Data Preprocessing: Handle missing values (if any) and encode categorical features using appropriate techniques (e.g., one-hot encoding).
  3. Feature Engineering: Create new features that might improve model performance. This could involve combining existing features or transforming them.
  4. Data Splitting: Split the data into training and testing sets.
  5. Model Training: Train a classification model (e.g., Logistic Regression, Random Forest, Support Vector Machine) on the training data.
  6. Model Evaluation: Evaluate the model's performance on the testing data using appropriate metrics (e.g., accuracy, precision, recall, F1-score, AUC).
  7. Pipeline Orchestration: Structure the entire process into a cohesive pipeline, making it reusable and maintainable.

Key Requirements:

  • The solution must be written in Python.
  • Use appropriate libraries for data manipulation (Pandas), machine learning (Scikit-learn), and potentially visualization (Matplotlib/Seaborn - optional but encouraged for understanding the data).
  • The code should be well-documented and easy to understand.
  • The pipeline should be modular, allowing for easy modification and experimentation with different data preprocessing techniques, feature engineering steps, and machine learning models.

Expected Behavior:

The pipeline should take the raw dataset as input and output the following:

  • A trained machine learning model.
  • Evaluation metrics (accuracy, precision, recall, F1-score, AUC) on the testing set.
  • (Optional) Visualizations to help understand the data and model performance.

Edge Cases to Consider:

  • Missing Values: The dataset might contain missing values. Handle these appropriately (e.g., imputation, removal).
  • Categorical Features: Categorical features need to be encoded numerically before being used in the model.
  • Imbalanced Classes: The churn rate might be low, leading to an imbalanced dataset. Consider techniques to address this (e.g., oversampling, undersampling, cost-sensitive learning).
  • Feature Scaling: Some models perform better when features are scaled. Consider scaling numerical features.

Examples

Example 1:

Input: A Pandas DataFrame with columns like 'CustomerID', 'Gender', 'SeniorCitizen', 'Partner', 'Dependents', 'Tenure', 'PhoneService', 'MultipleLines', 'InternetService', 'OnlineSecurity', 'OnlineBackup', 'DeviceProtection', 'TechSupport', 'StreamingTV', 'StreamingMovies', 'Contract', 'PaperlessBilling', 'PaymentMethod', 'MonthlyCharges', 'TotalCharges', 'Churn'
Output: A trained Logistic Regression model, accuracy: 0.80, precision: 0.75, recall: 0.65, F1-score: 0.70, AUC: 0.78
Explanation: The pipeline loads the data, preprocesses it (one-hot encoding categorical features, handling missing values in 'TotalCharges'), splits it into training and testing sets, trains a Logistic Regression model, and evaluates its performance on the testing set.

Example 2:

Input: Same as Example 1, but with a higher proportion of churned customers (e.g., 20% instead of 5%).
Output: A trained Random Forest model, accuracy: 0.82, precision: 0.78, recall: 0.72, F1-score: 0.75, AUC: 0.82
Explanation:  Due to the class imbalance, a Random Forest model might perform better than Logistic Regression. The pipeline adapts to the imbalance and provides appropriate evaluation metrics.

Example 3: (Edge Case - Missing Values)

Input: A Pandas DataFrame with missing values in the 'TotalCharges' column.
Output: A trained model (e.g., Logistic Regression), accuracy: 0.79, precision: 0.74, recall: 0.64, F1-score: 0.69, AUC: 0.77
Explanation: The pipeline handles the missing values in 'TotalCharges' by imputing them with the mean or median of the column. The model is then trained and evaluated as usual.

Constraints

  • Dataset Size: The dataset will contain approximately 10,000 rows.
  • Feature Count: The dataset will have around 20 features.
  • Time Limit: The pipeline should execute within 5 minutes.
  • Memory Limit: The pipeline should use less than 2GB of memory.
  • Libraries: You are allowed to use Pandas, Scikit-learn, and Matplotlib/Seaborn. No external APIs or libraries are permitted.

Notes

  • Focus on building a robust and well-structured pipeline rather than achieving the absolute highest accuracy.
  • Experiment with different feature engineering techniques to see how they impact model performance.
  • Consider the trade-offs between different classification models (e.g., Logistic Regression vs. Random Forest).
  • Document your code clearly, explaining the purpose of each step in the pipeline.
  • Think about how you would handle new data coming in – how would you apply the trained model to predict churn for new customers?
Loading editor...
python