Hone logo
Hone
Problems

Feature Engineering for Customer Churn Prediction

This challenge focuses on using advanced SQL techniques to prepare data for a machine learning model that predicts customer churn. You will be tasked with creating new, informative features from existing customer data, mimicking a crucial step in the machine learning pipeline. This is essential for improving model accuracy and understanding customer behavior.

Problem Description

Your goal is to create a dataset enriched with engineered features that can be used to train a machine learning model to predict customer churn. You are given a Customers table and a Transactions table. You need to join these tables and derive several new features for each customer.

Key Requirements:

  1. Customer Segmentation: Categorize customers based on their total spending.
  2. Recency Feature: Calculate the number of days since a customer's last transaction.
  3. Frequency Feature: Count the total number of transactions per customer.
  4. Monetary Feature: Calculate the average transaction amount per customer.
  5. Engagement Score: Create a composite score that combines recency, frequency, and monetary values. This score should reward more recent, frequent, and high-spending customers.
  6. Churn Indicator: Based on the ChurnedDate in the Customers table, create a binary indicator (IsChurned) where 1 means the customer has churned and 0 means they have not.

Expected Behavior:

The output should be a table where each row represents a unique customer, along with the original customer information and the newly engineered features.

Edge Cases:

  • Customers with no transactions: These customers should still be included in the output. Their recency, frequency, and monetary values should reflect this absence of activity (e.g., nulls or appropriate default values, and a churn status).
  • Customers who churned on their first transaction date (if such a scenario is possible and ChurnedDate is populated).

Examples

Example 1:

Input Tables:

Customers table:

CustomerIDSignUpDateChurnedDate
1012023-01-15NULL
1022023-02-012023-03-10
1032023-03-20NULL

Transactions table:

TransactionIDCustomerIDTransactionDateAmount
11012023-01-2050.00
21012023-02-1575.00
31022023-02-20100.00
41012023-03-0160.00
51032023-04-05200.00

Output Table (Sample - CurrentDate assumed to be 2023-04-15):

CustomerIDSignUpDateChurnedDateIsChurnedTotalSpendingTransactionCountAvgTransactionAmountDaysSinceLastTransactionCustomerSegmentEngagementScore
1012023-01-15NULL0185.00361.6745High95
1022023-02-012023-03-101100.001100.0036Medium70
1032023-03-20NULL0200.001200.0010High85

Explanation:

  • IsChurned for Customer 101 and 103 is 0 because ChurnedDate is NULL. For Customer 102, it's 1 because ChurnedDate is populated.
  • TotalSpending is the sum of Amount for each customer.
  • TransactionCount is the count of transactions for each customer.
  • AvgTransactionAmount is TotalSpending / TransactionCount.
  • DaysSinceLastTransaction is calculated as CurrentDate - TransactionDate of the latest transaction for each customer.
  • CustomerSegment is determined by TotalSpending (e.g., High > 150, Medium 50-150, Low < 50). This is an illustrative example of segmentation.
  • EngagementScore is a hypothetical combination: (365 - DaysSinceLastTransaction) * 0.4 + TransactionCount * 0.3 + AvgTransactionAmount * 0.3. (Actual formula might vary, this is for illustration).

Example 2:

Input Tables:

Customers table:

CustomerIDSignUpDateChurnedDate
1042023-05-01NULL
1052023-05-012023-05-15

Transactions table:

TransactionIDCustomerIDTransactionDateAmount
61042023-05-0530.00

Output Table (Sample - CurrentDate assumed to be 2023-05-20):

CustomerIDSignUpDateChurnedDateIsChurnedTotalSpendingTransactionCountAvgTransactionAmountDaysSinceLastTransactionCustomerSegmentEngagementScore
1042023-05-01NULL030.00130.0015Low60
1052023-05-012023-05-1510.000NULLNULLLow0

Explanation:

  • Customer 105 has no transactions, so TotalSpending, TransactionCount, AvgTransactionAmount, and DaysSinceLastTransaction are handled appropriately (0, NULL, NULL). The EngagementScore is also 0.

Constraints

  • CustomerID is unique within the Customers table.
  • TransactionID is unique within the Transactions table.
  • TransactionDate and ChurnedDate are valid date formats.
  • Amount is a non-negative numeric value.
  • You may assume a CurrentDate variable or function is available for calculating recency. For testing, you can hardcode a specific date.
  • The number of customers can be up to 1,000,000.
  • The number of transactions can be up to 100,000,000.
  • The SQL query should execute within 30 seconds on a typical database setup.

Notes

  • Consider using window functions or aggregate functions with GROUP BY for efficient calculation of aggregated metrics.
  • Think about how to handle missing values gracefully, especially for customers with no transactions.
  • The specific definitions of CustomerSegment and the EngagementScore formula can be tailored. For this challenge, define clear logic for them. For example, CustomerSegment could be defined by terciles of TotalSpending. For EngagementScore, consider a weighted sum of normalized features.
  • This exercise aims to build a feature set. The actual churn prediction model is outside the scope of this challenge.
Loading editor...
plaintext