Feature Engineering for Customer Churn Prediction

Feature engineering is a crucial step in machine learning, often having a greater impact on model performance than the choice of algorithm itself. This challenge asks you to implement several common feature engineering techniques on a dataset representing customer information and their likelihood of churn. The goal is to transform raw data into features that better represent the underlying patterns and improve the predictive power of a churn prediction model.

Problem Description

You are given a dataset (represented as a Pandas DataFrame) containing information about customers of a telecommunications company. The dataset includes features like contract length, monthly charges, total charges, and demographic information. Your task is to implement a series of feature engineering steps to create new, potentially more informative features from the existing ones. Specifically, you need to:

Handle Missing Values: Address missing values in the 'total_charges' column by imputing them with the median total charge for customers with similar contract lengths.
Create Interaction Feature: Generate a new feature called 'price_per_month' by dividing 'total_charges' by 'tenure' (customer tenure in months). Handle potential division by zero errors gracefully.
Binning/Discretization: Create a new feature called 'tenure_group' by binning the 'tenure' column into three groups: 'New' (0-12 months), 'Mid' (13-48 months), and 'Loyal' (49+ months).
One-Hot Encoding: Convert the 'contract' column (which contains categorical values like 'Month-to-month', 'One year', 'Two year') into a set of binary (0/1) features using one-hot encoding.
Combine Features: Create a new feature 'family_size_per_tenure' by dividing 'dependents' by 'tenure'. Handle potential division by zero errors.

You should return a Pandas DataFrame containing the original features plus the newly engineered features.

Examples

Example 1:

Input:
DataFrame with columns: ['customerID', 'gender', 'seniorCitizen', 'partner', 'dependents', 'tenure', 'phoneService', 'multipleLines', 'internetService', 'onlineSecurity', 'onlineBackup', 'deviceProtection', 'techSupport', 'streamingTV', 'streamingMovies', 'contract', 'paperlessBilling', 'paymentMethod', 'monthlyCharges', 'totalCharges', 'churn'] and some sample data including missing values in 'total_charges' and various categorical values.

Output:
DataFrame with the original columns plus 'price_per_month', 'tenure_group', one-hot encoded 'contract' columns (e.g., 'contract_Month-to-month', 'contract_One year', 'contract_Two year'), and 'family_size_per_tenure'.  Missing 'total_charges' values are imputed. Division by zero errors are handled.

Explanation: The output DataFrame includes the original data along with the newly engineered features, demonstrating the successful application of the specified feature engineering techniques.

Example 2:

Input:
DataFrame with 'tenure' values ranging from 0 to 72 months.

Output:
DataFrame with 'tenure_group' column containing values 'New', 'Mid', and 'Loyal' based on the defined tenure bins.

Explanation: The 'tenure_group' column correctly categorizes customers based on their tenure, demonstrating the binning functionality.

Constraints

The input DataFrame will always contain the columns specified in the Problem Description.
'total_charges' may contain missing values (NaN).
'tenure' will always be a non-negative integer.
The number of unique values in the 'contract' column will be between 2 and 4.
The solution should be efficient and avoid unnecessary loops where possible. Pandas operations are preferred.
The solution should handle division by zero errors gracefully, assigning a value of 0 in such cases.

Notes

Use Pandas for data manipulation.
Consider using pd.cut for binning.
Use pd.get_dummies for one-hot encoding.
Pay close attention to handling missing values and division by zero errors.
The order of the columns in the output DataFrame does not matter, but all the required features should be present.
Focus on clarity and readability of your code. Good variable names and comments are encouraged.

Feature Engineering for Customer Churn Prediction

Problem Description

Handle Missing Values: Address missing values in the 'total_charges' column by imputing them with the median total charge for customers with similar contract lengths.

Create Interaction Feature: Generate a new feature called 'price_per_month' by dividing 'total_charges' by 'tenure' (customer tenure in months). Handle potential division by zero errors gracefully.

Binning/Discretization: Create a new feature called 'tenure_group' by binning the 'tenure' column into three groups: 'New' (0-12 months), 'Mid' (13-48 months), and 'Loyal' (49+ months).

One-Hot Encoding: Convert the 'contract' column (which contains categorical values like 'Month-to-month', 'One year', 'Two year') into a set of binary (0/1) features using one-hot encoding.

Combine Features: Create a new feature 'family_size_per_tenure' by dividing 'dependents' by 'tenure'. Handle potential division by zero errors.

You should return a Pandas DataFrame containing the original features plus the newly engineered features.

Examples

Example 1:

Input: DataFrame with columns: ['customerID', 'gender', 'seniorCitizen', 'partner', 'dependents', 'tenure', 'phoneService', 'multipleLines', 'internetService', 'onlineSecurity', 'onlineBackup', 'deviceProtection', 'techSupport', 'streamingTV', 'streamingMovies', 'contract', 'paperlessBilling', 'paymentMethod', 'monthlyCharges', 'totalCharges', 'churn'] and some sample data including missing values in 'total_charges' and various categorical values. Output: DataFrame with the original columns plus 'price_per_month', 'tenure_group', one-hot encoded 'contract' columns (e.g., 'contract_Month-to-month', 'contract_One year', 'contract_Two year'), and 'family_size_per_tenure'. Missing 'total_charges' values are imputed. Division by zero errors are handled.

Explanation: The output DataFrame includes the original data along with the newly engineered features, demonstrating the successful application of the specified feature engineering techniques.

Example 2:

Input: DataFrame with 'tenure' values ranging from 0 to 72 months. Output: DataFrame with 'tenure_group' column containing values 'New', 'Mid', and 'Loyal' based on the defined tenure bins.

Explanation: The 'tenure_group' column correctly categorizes customers based on their tenure, demonstrating the binning functionality.

Constraints

The input DataFrame will always contain the columns specified in the Problem Description.

'total_charges' may contain missing values (NaN).

'tenure' will always be a non-negative integer.

The number of unique values in the 'contract' column will be between 2 and 4.

The solution should be efficient and avoid unnecessary loops where possible. Pandas operations are preferred.

The solution should handle division by zero errors gracefully, assigning a value of 0 in such cases.

Notes

Use Pandas for data manipulation.

Consider using pd.cut for binning.

Use pd.get_dummies for one-hot encoding.

Pay close attention to handling missing values and division by zero errors.

The order of the columns in the output DataFrame does not matter, but all the required features should be present.

Focus on clarity and readability of your code. Good variable names and comments are encouraged.