Feature Engineering for Customer Churn Prediction
This challenge focuses on using advanced SQL techniques to prepare data for a machine learning model that predicts customer churn. You will be tasked with creating new, informative features from existing customer data, mimicking a crucial step in the machine learning pipeline. This is essential for improving model accuracy and understanding customer behavior.
Problem Description
Your goal is to create a dataset enriched with engineered features that can be used to train a machine learning model to predict customer churn. You are given a Customers table and a Transactions table. You need to join these tables and derive several new features for each customer.
Key Requirements:
- Customer Segmentation: Categorize customers based on their total spending.
- Recency Feature: Calculate the number of days since a customer's last transaction.
- Frequency Feature: Count the total number of transactions per customer.
- Monetary Feature: Calculate the average transaction amount per customer.
- Engagement Score: Create a composite score that combines recency, frequency, and monetary values. This score should reward more recent, frequent, and high-spending customers.
- Churn Indicator: Based on the
ChurnedDatein theCustomerstable, create a binary indicator (IsChurned) where 1 means the customer has churned and 0 means they have not.
Expected Behavior:
The output should be a table where each row represents a unique customer, along with the original customer information and the newly engineered features.
Edge Cases:
- Customers with no transactions: These customers should still be included in the output. Their recency, frequency, and monetary values should reflect this absence of activity (e.g., nulls or appropriate default values, and a churn status).
- Customers who churned on their first transaction date (if such a scenario is possible and
ChurnedDateis populated).
Examples
Example 1:
Input Tables:
Customers table:
| CustomerID | SignUpDate | ChurnedDate |
|---|---|---|
| 101 | 2023-01-15 | NULL |
| 102 | 2023-02-01 | 2023-03-10 |
| 103 | 2023-03-20 | NULL |
Transactions table:
| TransactionID | CustomerID | TransactionDate | Amount |
|---|---|---|---|
| 1 | 101 | 2023-01-20 | 50.00 |
| 2 | 101 | 2023-02-15 | 75.00 |
| 3 | 102 | 2023-02-20 | 100.00 |
| 4 | 101 | 2023-03-01 | 60.00 |
| 5 | 103 | 2023-04-05 | 200.00 |
Output Table (Sample - CurrentDate assumed to be 2023-04-15):
| CustomerID | SignUpDate | ChurnedDate | IsChurned | TotalSpending | TransactionCount | AvgTransactionAmount | DaysSinceLastTransaction | CustomerSegment | EngagementScore |
|---|---|---|---|---|---|---|---|---|---|
| 101 | 2023-01-15 | NULL | 0 | 185.00 | 3 | 61.67 | 45 | High | 95 |
| 102 | 2023-02-01 | 2023-03-10 | 1 | 100.00 | 1 | 100.00 | 36 | Medium | 70 |
| 103 | 2023-03-20 | NULL | 0 | 200.00 | 1 | 200.00 | 10 | High | 85 |
Explanation:
IsChurnedfor Customer 101 and 103 is 0 becauseChurnedDateis NULL. For Customer 102, it's 1 becauseChurnedDateis populated.TotalSpendingis the sum ofAmountfor each customer.TransactionCountis the count of transactions for each customer.AvgTransactionAmountisTotalSpending/TransactionCount.DaysSinceLastTransactionis calculated asCurrentDate-TransactionDateof the latest transaction for each customer.CustomerSegmentis determined byTotalSpending(e.g., High > 150, Medium 50-150, Low < 50). This is an illustrative example of segmentation.EngagementScoreis a hypothetical combination:(365 - DaysSinceLastTransaction) * 0.4 + TransactionCount * 0.3 + AvgTransactionAmount * 0.3. (Actual formula might vary, this is for illustration).
Example 2:
Input Tables:
Customers table:
| CustomerID | SignUpDate | ChurnedDate |
|---|---|---|
| 104 | 2023-05-01 | NULL |
| 105 | 2023-05-01 | 2023-05-15 |
Transactions table:
| TransactionID | CustomerID | TransactionDate | Amount |
|---|---|---|---|
| 6 | 104 | 2023-05-05 | 30.00 |
Output Table (Sample - CurrentDate assumed to be 2023-05-20):
| CustomerID | SignUpDate | ChurnedDate | IsChurned | TotalSpending | TransactionCount | AvgTransactionAmount | DaysSinceLastTransaction | CustomerSegment | EngagementScore |
|---|---|---|---|---|---|---|---|---|---|
| 104 | 2023-05-01 | NULL | 0 | 30.00 | 1 | 30.00 | 15 | Low | 60 |
| 105 | 2023-05-01 | 2023-05-15 | 1 | 0.00 | 0 | NULL | NULL | Low | 0 |
Explanation:
- Customer 105 has no transactions, so
TotalSpending,TransactionCount,AvgTransactionAmount, andDaysSinceLastTransactionare handled appropriately (0, NULL, NULL). TheEngagementScoreis also 0.
Constraints
CustomerIDis unique within theCustomerstable.TransactionIDis unique within theTransactionstable.TransactionDateandChurnedDateare valid date formats.Amountis a non-negative numeric value.- You may assume a
CurrentDatevariable or function is available for calculating recency. For testing, you can hardcode a specific date. - The number of customers can be up to 1,000,000.
- The number of transactions can be up to 100,000,000.
- The SQL query should execute within 30 seconds on a typical database setup.
Notes
- Consider using window functions or aggregate functions with
GROUP BYfor efficient calculation of aggregated metrics. - Think about how to handle missing values gracefully, especially for customers with no transactions.
- The specific definitions of
CustomerSegmentand theEngagementScoreformula can be tailored. For this challenge, define clear logic for them. For example,CustomerSegmentcould be defined by terciles ofTotalSpending. ForEngagementScore, consider a weighted sum of normalized features. - This exercise aims to build a feature set. The actual churn prediction model is outside the scope of this challenge.