Hone logo
Hone
Problems

Feature Engineering for Customer Churn Prediction

Feature engineering is a crucial step in building effective machine learning models. It involves transforming raw data into features that better represent the underlying problem to the predictive models, leading to improved accuracy and interpretability. This challenge focuses on creating new, informative features from existing customer data to predict churn.

Problem Description

Your task is to implement a series of feature engineering techniques on a given dataset of customer information. The goal is to create new features that could potentially improve the performance of a churn prediction model. You will be provided with a Pandas DataFrame containing various customer attributes, such as demographics, usage patterns, and contract details.

What needs to be achieved:

  1. Create interaction features: Combine existing numerical features to capture synergistic effects.
  2. Create polynomial features: Generate higher-order terms of existing numerical features to capture non-linear relationships.
  3. Discretize numerical features: Group continuous numerical features into bins to handle non-linearities and outliers.
  4. Handle categorical features: Convert categorical features into numerical representations suitable for machine learning models.

Key requirements:

  • You will need to use libraries like Pandas and Scikit-learn.
  • The output should be a new Pandas DataFrame with the original features and the newly engineered features.
  • The engineered features should be clearly named to indicate their origin and transformation.

Expected behavior:

The function should accept a Pandas DataFrame as input and return a modified Pandas DataFrame. The returned DataFrame should contain all original columns plus the newly generated features.

Edge cases to consider:

  • Missing values in numerical and categorical columns.
  • Columns with very few unique categorical values.
  • Columns with a large range of numerical values.

Examples

Example 1: Basic Feature Creation

Input DataFrame:
   CustomerID  Age  MonthlyCharges  TotalCharges  Gender  Churn
0           1   25            50.0        1200.0    Male     No
1           2   40            75.0        3000.0  Female     No
2           3   30            60.0         600.0    Male    Yes
3           4   55            80.0        4400.0  Female     No

Output DataFrame (simplified for illustration, showing new columns):
   CustomerID  Age  MonthlyCharges  TotalCharges  Gender  Churn  Age_x_MonthlyCharges  MonthlyCharges_squared  Age_binned
0           1   25            50.0        1200.0    Male                1250.0                  2500.0         (20, 30]
1           2   40            75.0        3000.0  Female                3000.0                  5625.0         (35, 45]
2           3   30            60.0         600.0    Male                1800.0                  3600.0         (25, 35]
3           4   55            80.0        4400.0  Female                4400.0                  6400.0         (50, 60]

Explanation:

  • Age_x_MonthlyCharges: Created by multiplying Age and MonthlyCharges.
  • MonthlyCharges_squared: Created by squaring MonthlyCharges.
  • Age_binned: Age has been discretized into bins like (20, 30], (35, 45], etc. (actual bin ranges depend on the discretization strategy).

Example 2: Handling Categorical Features and Multiple Transformations

Input DataFrame:
   CustomerID  Tenure  MonthlyCharges  ContractType  InternetService  Churn
0           1       1            50.0      Month-to-month     DSL           No
1           2      12            75.0      One year           Fiber optic   No
2           3       5            60.0      Month-to-month     DSL           Yes
3           4      24            80.0      Two year           Fiber optic   No
4           5       3            55.0      Month-to-month     DSL           Yes

Output DataFrame (simplified, showing new columns):
   CustomerID  Tenure  MonthlyCharges  ContractType  InternetService  Churn  Tenure_x_MonthlyCharges  MonthlyCharges_poly2  ContractType_Month-to-month  ContractType_One year  ContractType_Two year  InternetService_DSL  InternetService_Fiber optic
0           1       1            50.0      Month-to-month     DSL           No                  50.0                 2500.0                          1                      0                      0                    1                           0
1           2      12            75.0      One year           Fiber optic   No                 900.0                 5625.0                          0                      1                      0                    0                           1
2           3       5            60.0      Month-to-month     DSL           Yes                300.0                 3600.0                          1                      0                      0                    1                           0
3           4      24            80.0      Two year           Fiber optic   No                1920.0                 6400.0                          0                      0                      1                    0                           1
4           5       3            55.0      Month-to-month     DSL           Yes                165.0                 3025.0                          1                      0                      0                    1                           0

Explanation:

  • Tenure_x_MonthlyCharges: Interaction feature.
  • MonthlyCharges_poly2: Polynomial feature (e.g., MonthlyCharges^2).
  • Categorical columns (ContractType, InternetService) are one-hot encoded into new binary columns (e.g., ContractType_Month-to-month, InternetService_DSL).

Example 3: Handling Missing Values and Different Binning Strategies

Input DataFrame:
   CustomerID  Age  MonthlyCharges  TotalCharges  Gender  Churn
0           1   25            50.0        1200.0    Male     No
1           2   40             NaN        3000.0  Female     No
2           3   30            60.0         600.0    Male    Yes
3           4   55            80.0           NaN  Female     No
4           5   NaN            70.0        1500.0    Male     No


Output DataFrame (showing handling of NaNs and example binning):
   CustomerID  Age  MonthlyCharges  TotalCharges  Gender  Churn  Age_x_MonthlyCharges  MonthlyCharges_binned  TotalCharges_binned
0           1   25            50.0        1200.0    Male     No                1250.0              (40, 60]           (1000, 2000]
1           2   40            50.0        3000.0  Female     No                2000.0              (40, 60]           (2500, 3500]
2           3   30            60.0         600.0    Male    Yes                1800.0              (25, 40]             (500, 1000]
3           4   55            80.0        3000.0  Female     No                4400.0              (60, 80]           (2500, 3500]
4           5   30            70.0        1500.0    Male     No                2100.0              (60, 80]           (1000, 2000]

Explanation:

  • Missing MonthlyCharges in row 1 was imputed (e.g., with the mean or median) before feature creation.
  • Missing TotalCharges in row 3 was also imputed.
  • Missing Age in row 5 was imputed.
  • MonthlyCharges and TotalCharges are discretized into bins. The binning strategy will be specified by you (e.g., using pd.qcut or pd.cut with a specified number of bins).

Constraints

  • The input DataFrame will have at least the columns CustomerID, Age, MonthlyCharges, TotalCharges, Gender, ContractType, InternetService, and Churn. Other columns may be present but are not the focus of this challenge.
  • CustomerID is a unique identifier and should not be used for feature engineering.
  • Churn is the target variable and should not be used to create features (unless explicitly stated for target encoding, which is not required here).
  • The numerical columns (Age, MonthlyCharges, TotalCharges, Tenure - if present) may contain missing values.
  • Categorical columns (Gender, ContractType, InternetService) will contain string values.
  • Your solution should aim to be reasonably efficient for datasets of moderate size (e.g., up to 100,000 rows).

Notes

  • Consider strategies for handling missing values before applying transformations. You might impute them with the mean, median, or a constant value, depending on the feature.
  • For discretization, you can choose between equal-width binning (pd.cut) or quantile-based binning (pd.qcut). Quantile-based binning is often preferred as it ensures an approximately equal number of observations in each bin.
  • For categorical features, one-hot encoding is a common and effective method.
  • When creating interaction or polynomial features, be mindful of feature names to avoid collisions and ensure clarity.
  • The specific implementation details (e.g., number of bins, imputation strategy) are left to your discretion, but should be justifiable. Document your choices.
  • The goal is to demonstrate a good understanding of common feature engineering techniques.
Loading editor...
python