View Notebook

Churn Prediction

Project Overview

The project aims to develop a 'predictive model' for customer churn for an energy supplier.

Churn prediction is crucial as it can help the company identify at-risk customers and implement strategies to improve customer retention, ultimately leading to better customer satisfaction and increased profitability.

The task is to go through the model development process, starting from defining the business problem to the model implementation recommendation.

Outline

Part 1: Defining the business problem and research design

Management Dilemma
Management Question
Research Question
Sub - Research Questions

Part 2: Data preprocessing

Data Cleaning: Address missing values, outliers, and inconsistencies
Data Transformation: Standardization or normalization of variables, encoding categorical variables, and engineering new features if necessary.

Part 3: Exploratory data analysis (EDA)

Summary Statistics
Univariate Analysis
Multivariate analysis
Initial Hypothesis Testing

Part 4: Modeling

Model Specification
Model Estimation
Model Validation

Part 5: Results, Conclusions, and Managerial Implications

Key Findings
Recommendations

Data Descriptions

The data consists of 20,000 customer records, with an even split between churned and retained customers. The dataset includes 14 variables potentially influencing churns, such as demographic data, contract details, energy usage, and other customer-related information.

Customer_ID: a unique customer identification number.
Gender: a dummy variable indicating if the customer who signed the contract is male (0) or female (1).
Age: the age of the customer in years.
Income: the monthly income of the customer’s household in dollars.
Relation_length: the amount of months the customer has been with the firm.
Contract_length: the amount of months the customer still has a contract with the firm. (0) means the customer has a flexible contract, i.e., (s)he can leave anytime without paying a fine. If the contract is more than zero months, the customer can still leave, but has to pay a fine when leaving.
Start_channel: indicating if the contract was filled out by the customer on the firm’s website (“Online”) or by calling up the firm (“Phone”).
Email_list: indicating if the customer’s email address is known by the firm (1=yes, 0=no).
Home_age: the age of the home of the customer in years.
Home_label: energy label of the home of the customer, ranging from A (good) to G (bad).
Electricity_usage: the yearly electricity usage in kWh.
Gas_usage: the yearly gas usage in cubic meters.
Province: the province where the customer is living.
Churn: a dummy variable indicating if the customer has churned (1) or not (0).

Part 1: Defining the business problem and research design

To address the concern of the energy supplier on how it can increase customer retention and prevent churn, the supplier must identify ‘which’ customers are at risk of churning, as losing customers can significantly impact revenue and market share.

We start with a well-defined business question that targets the core concern of the energy provider:

How can identifying at-risk customers for churn be improved to facilitate retention strategies?

The overarching research question is designed to guide the investigative process:

What are the key predictors of customer churn for an energy provider, and how can they be quantified to forecast the likelihood of churn?

To comprehensively approach the research question, we break it down into targeted sub-questions, as illustrated in the table below. These questions systematically dissect the problem into manageable segments, focusing on specific factors such as energy usage patterns, contract details, customer demographics, etc.

Each sub-question is followed by a hypothesis based on empirical data and previous literature. Developing these hypotheses helps set the stage for rigorous statistical testing and data analysis, leading to more reliability and validity of our predictive model.

Part 2: Data Preprocessing - Data Wrangling & Cleaning

Pre-processing involved verifying data integrity, handling anomalies, missing values, and outliers, and ensuring appropriate data transformation for modeling purposes.

In the data preprocessing phase, we aim to simplify the variables to accommodate the comprehensive models planned for this study. Our objective was to include a broad range of variables to achieve clarity in the data, aiming for a model that is both simple and complete. Consequently, we streamlined the data to minimize complexity.

Initial inspection reveals a total of 2000 observations and 14 variables in the dataset with no missing values (NAs).

Categorical (Factor) Variables

In our data preprocessing workflow, we transformed binary variables into factors to facilitate their integration into our models. Specifically, the 'Churn' variable, indicating customer attrition status with 0 for 'no churn' and 1 for 'churned', was converted to a factor. Similarly, 'Gender' was encoded as a factor where 0 represents 'Male' and 1 represents 'Female'. The 'Email_list' variable, denoting the presence of an email address, with 0 for 'no' and 1 for 'yes', and 'Start_channel', indicating the initiation channel, with 0 for 'Phone' and 1 for 'Online', were also converted to factors.

For variables with more than two categories, such as 'Province' and 'Home_label', we adopted a factorization approach where each unique category was assigned a distinct identifier. 'Province', originally containing 12 distinct names, was transformed into a factor with levels 1 to 12, corresponding to each unique province. The 'Home_label', representing the energy efficiency rating from A to G, was converted to a factor with numerical levels ranging from 1 to 7, each mapping to a respective label from A to G. This conversion not only simplifies the dataset but also primes it for the application of various predictive modeling techniques.

Subsequent to this factorization, we implemented one-hot encoding on these variables to optimize their utility for machine learning algorithms. One-hot encoding is crucial for preparing categorical data for machine learning, transforming factor levels into binary columns suitable for various algorithms. This approach is essential for logistic regression, which relies on linear relationships, and enhances decision trees by providing clear, binary features for more accurate splits. By applying one-hot encoding to variables like 'Province' and 'Home_label', we avoid ordinal interpretations, enabling our models to better capture the intricacies of our data and improve overall predictive accuracy.

Numerical Variables

In the preprocessing of numerical variables, our primary focus was on outlier management and normalization to ensure model robustness and accuracy.

1) Outlier Management:

We addressed outliers by capping extreme values, which can disproportionately influence model outcomes. For instance, ages above 85 were capped at 85, reflecting a reasonable upper limit in the context of our data. Income levels above 50,000 were adjusted to a monthly scale to align with most of the data distribution. Additionally, for electricity and gas usage, we imposed upper limits of 6000 kWh and 2900 cubic meters, respectively, to mitigate the impact of anomalously high consumption figures. These steps help to prevent the influence of outliers on the learning process.

Log Transformation: Skewed distributions were normalized through log transformation, which is a common technique to stabilize variance and reduce skewness. This was applied to age and income to produce a more symmetric distribution. Adding 1 to income before taking the logarithm avoids mathematical issues with zero values.

Note: The figure below shows the data distribution before and after adjustment.

2) Standardization (z-score normalization) :

After addressing outliers and skewness, we proceeded to standardize our numerical variables to ensure that each variable contributes equally to the analysis by giving them a standard scale without distorting differences in the ranges of values. This is especially important for models that are sensitive to the scale of the data, such as logistic regression and neural networks, where gradient descent can converge more quickly if all variables are on the same scale.

The culmination of these steps resulted in a refined dataset named ‘finalize_data’, which combines the adjustments made to both factor and numerical variables, setting the stage for effective model training and validation.

Part 3: Exploratory Data Analysis

The Exploratory Data Analysis (EDA) phase of our project provides an overview of trends and patterns within the data and critical insights into the distinctions between churned and retained customers. Our visual investigation, illustrated in the figure below, presents a comparative analysis of critical variables for the churn (red) and non-churn (blue) groups.

The density plots highlight distinct behavioral patterns among churned and retained customers. Specifically, we see a trend where customers with short 'Contract Lengths', who face no penalty for leaving, exhibit a higher churn rate. To streamline our analysis, we've transformed this variable into a binary 'PenaltyFee' indicator, where '0' signifies no penalty, and '1' represents a potential fee upon departure.

Further, the 'Relation Length' density suggests a greater churn likelihood among newer customers compared to those with a longer tenure. This aligns with the intuition that longer-standing customers may have stronger loyalty or higher satisfaction levels.

Additionally, the consumption data for 'Electricity_Usage' and 'Gas_Usage' also provide insights into churn behavior, potentially serving as indicators of churn risk. Patterns in these variables may reflect underlying factors influencing a customer's decision to stay with or leave the company.

The stacked bar charts for categorical variables such as 'Gender', 'Start_channel', 'Email_list', 'Province', and 'Home_label' reveal further layers of insight. Particularly evident is the trend in 'Home_label', where the data suggests a correlation between energy efficiency and churn rate. Homes with a poorer energy label, as visualized in the bar plot from A to G, tend to have a higher propensity for churn. This implies that customers living in less energy-efficient homes are more inclined to discontinue their services, highlighting the need for targeted retention strategies in these segments.

Part 4: Modeling & Results

At the core of our analysis, we focus on creating models that help us understand customer behavior and predict who might leave the service (churn/attrition). We use various algorithms to achieve this goal, from simple and clear approaches such as logistic regression to more complex models like neural networks.

Each tool has its own way of looking at the data to find patterns. Along with these, we also use decision trees including CART, CHAID, and C50, as well as other methods such as SVM with its variants (linear, polynomial, radial, and sigmoid) and ensemble methods like bagging and boosting, which combine several models for better predictions.

Our goal is to find the best way to use our data to make accurate predictions.

Model 1: Baseline Model (Logistic Regression)

Our baseline model employs logistic regression, a fundamental approach for binary classification. The main benefits of this model lie in interpretability and computational efficiency, allowing us to establish a benchmark for performance. While it assumes linear relationships between predictors and the log-odds of the outcome, which can be a limitation for complex datasets, it serves as a crucial starting point for our analysis. All predictors in the model are chosen based on existing literature or studies.

The following shows how we train the model; all predictor variables show significant results. The model offers a solid foundation with a good balance between the hit rate, top decile lift, and GINI coefficient. However, it may not capture complex nonlinear relationships as effectively as other models.

Hit rate: 0.7352 , Top Decile Lift: 1.828198, GINI Coefficient: 0.6247152

Model 2: Stepwise Regression

In the stepwise regression model, we utilize both the Akaike Information Criterion (AIC) and the Bayesian Information Criterion (BIC). Stepwise regression identifies a model that balances fit and simplicity. The automated selection process refines our model by including only significant variables, thereby minimizing the risk of overfitting. Its efficiency is counterbalanced by potential instability and the inability to capture complex interactions without manual intervention.

The following shows how we train the model, and the top 5 important variables generated by this model are LogIncome, LogAge, Electricity_usage, Gas_usage, and Relation_length, which most of them in line with the baseline model.

Stepwise regression with AIC selection further refines predictor selection, slightly improving upon the baseline model's metrics. This suggests that the iterative process of adding and removing predictors based on statistical significance can enhance predictive performance.

Hit rate: 0.752 , Top Decile Lift: 1.844522, GINI Coefficient: 0.6524961

The table above summarizes important variables from the stepwise model. The model emphasizes key factors that influence customer churn.

The presence of a penalty fee strongly deters churn, implying that financial disincentives are effective in retaining customers. Higher electricity and gas usage are linked to a greater churn likelihood, implying potential service satisfaction or pricing issues.

Longer customer relationships correlate with reduced churn, underscoring the value of customer loyalty and sustained service satisfaction. Higher income levels are associated with lower churn, suggesting that more affluent customers are less price-sensitive or more satisfied with their services. Variables such as email list subscription, province, initial contact channel, home energy label, and customer age play smaller yet significant roles in predicting churn.

These insights suggest targeted strategies to improve customer retention, focusing on service usage patterns financial incentives, and enhancing customer engagement and satisfaction over time.

Model 3: Decision Tree

Decision trees segment the dataset into homogenous subsets, offering an intuitive and interpretable structure. They are good at handling non-linear relationships and feature interactions. We adopt three variants of decision trees for this project - CART, CHAID, and Entrophy C50.

In our project, we delve into three key types of decision tree algorithms: CART, CHAID, and Entropy-based C50, each with its unique method for building a decision tree.

3.1 CART

The CART algorithm simplifies decision-making by using the Gini index to perform binary splits, making it a solid choice for data that fits well with such splits. Despite its straightforwardness, CART may overfit the data, which is why we sometimes trim the tree back—a process known as pruning.

The following tree and table summarize variable importance generated by the CART model.

Hit rate: 0.7178, Top Decile Lift: 1.505815, GINI Coefficient: 0.520387

The CART model's analysis underscores 'PenaltyFee' as the most critical factor for churn prediction, indicating that financial aspects are pivotal in customer retention decisions. Subsequent splits on 'Electricity_usage' and 'Gas_usage' suggest that customers' consumption patterns are significant indicators of churn risk.

The appearance of 'Relation_length' indicates the impact of customer tenure on loyalty. The variable importance scores corroborate the tree's splits, with 'Electricity_usage' emerging as a top predictor alongside 'PenaltyFee' and 'Gas_usage'.

These findings suggest that managing penalty fees, monitoring consumption patterns, and fostering long-term customer relationships are key strategies to mitigate churn.

3.2 CHAID

CHAID, on the other hand, creates more complex trees through multi-way splits, made possible by chi-squared tests. This approach is particularly good for dealing with categorical data and uncovering the more complex relationships between variables. However, CHAID requires a larger dataset to work effectively since it relies on statistical tests that need a substantial number of samples.

The following tree and table summarize variable importance generated by the CHAID model.

Hit rate: 0.7244, Top Decile Lift: 1.828198, GINI Coefficient: 0.5939995

The CHAID decision tree model reveals 'Home_label' and 'PenaltyFee' as primary splitting variables, indicating their substantial role in predicting churn. The tree complexity, with its multiple branches, points to the interaction between different variables such as 'Email_list', 'Start_channel', and 'Gender', affecting churn.

The variable importance table aligns with the tree, showing 'PenaltyFee' as a top factor. Other significant variables include 'Electricity_usage', 'Gas_usage', and 'LogIncome', suggesting that customer engagement levels and financial status are key indicators of churn risk.

In worth noting that in the CHAID decision tree model, 'Home_label' emerges as a primary splitter, indicating its statistical significance in distinguishing churn. However, the variable importance table prioritizes 'Electricity_usage' over 'Home_label', showcasing its consistent contribution to the model's accuracy across various splits. This difference emphasize the need to consider both individual statistical significance and cumulative impact of variables.

3.3 Entrophy-based C50

Entropy-based C50 (known as C5.0) is an extension of the decision tree algorithm that employs entropy to construct a decision tree.

Entropy is a measure of disorder or uncertainty, and the goal of the C5.0 algorithm is to split the dataset into subsets with less entropy (i.e., more homogeneity) concerning the target variable, which in our case is churn.

Unlike other models, the C5.0 algorithm uses information gain as the criterion for making splits, which is inherently different from methods like the Gini impurity used in CART models. This often leads to different tree structures and can affect which variables are deemed most important. The following table summarize variable importance generated by the Entrophy C50 (boosting) model.

Hit rate: 0.7482, Top Decile Lift: 1.860845, GINI Coefficient: 0.644975

In our churn prediction project, we explored three variants of the Entropy C50 model: a simple model, a rule-based model, and a boosting model.

Boosting is a technique where multiple models are trained sequentially with each model learning from the errors of the previous ones. This approach aims to improve the predictive performance, especially on a dataset with complex variable interactions.

The Entropy C50 boosting model stood out as the most effective approach among the variants we tested. It demonstrated a higher Top Decile Lift and GINI coefficient compared to the simple and rule-based models.

The Entropy C50 model's findings (Table above) indicate that gas consumption patterns and penalty fees are the most critical factors in predicting churn, each with a 100% attribute usage rate. This highlights the significance of behavioral and contractual elements in customer retention.

Other notable predictors include energy level (Home_label) and electricity usage, suggesting that utility consumption is a key indicator of churn risk.

The model also values the length of customer relationships and income levels, which are less dominant but still relevant factors.

Model 4: Bagging & Boosting (Ensemble Technique)

In our exploration of ensemble methods, we begin with bagging and boosting.

4.1 Bagging

Bagging (Bootstrap Aggregating) reduces the variance of our predictive model by generating numerous versions of our predictor and combining them. This is achieved by creating multiple subsets of our original dataset, allowing for different models to be trained independently. The final output is an average of these models, leading to a more stable and accurate prediction.

Hit rate: 0.7462, Top Decile Lift: 1.893491, GINI Coefficient: 0.6323962

The variable importance results from the bagging model highlight 'Electricity_usage' and 'Gas_usage' as the most significant predictors of churn, with importance scores of 100.00 and 96.008, respectively. These variables likely reflect customer engagement and satisfaction levels, as utility usage can strongly indicate customer behavior and preferences.

Income and relationship length with the company ('LogIncome' and 'Relation_length') also emerge as influential, with scores of 77.081 and 73.439. This suggests that financial status and tenure with the service could influence a customer's decision to churn.

Other variables such as 'Home_age' and 'LogAge' have moderate importance, indicating that the age of the customer's home and the customer themselves may have some bearing on churn, though less so than usage and income.

Interestingly, 'PenaltyFee' has a lower importance score of 21.953, which could imply that while fees associated with service termination are relevant, they are not as decisive in predicting churn as usage patterns and customer demographics.

However, some cautions must be exercised as we observe that 'PenaltyFee' is an important predictors in other models, but is not ranked as the most significant predictor in the bagging model. The reason could comes from variable interactions. Bagging model may dilute the significance of 'PenaltyFee' due to its interaction with other variables. If 'PenaltyFee' is part of complex interactions that are not easily isolated, its standalone influence might be underestimated.

4.2 XGBoost

XGBoost (Extreme Gradient Boosting) is a sophisticated ensemble technique that builds sequential decision trees, with each one correcting the mistakes of its predecessor to minimize predictive errors.

This approach and its ability to handle large datasets efficiently make XGBoost a preferred choice for complex modeling tasks. Its strengths lie in its speed, robust regularization that prevents overfitting, an in-built cross-validation mechanism, and capability to manage missing data and provide feature importance scores. However, XGBoost may still be computationally intensive with large datasets, and has lower interpretability compared to more simple models.

The following plot summarizes variable importance generated by the XG Boost model with corresponding tables.

Hit rate: 0.755, Top Decile Lift: 1.930218, GINI Coefficient: 0.6693015

The XGBoost model's variable importance analysis reveals critical predictors of customer churn. Notably, 'PenaltyFee0' (Customers incur no penalty fee if they decide to leave), 'Electricity_usage', and 'Gas_usage' emerge as the most influential variables, suggesting that fee structures and energy consumption patterns are pivotal in forecasting churn.

The findings also indicate that the length of the customer relationship and income levels play a considerable role in churn prediction, pointing to the value of fostering long-term relationships and understanding the financial contexts of customers. Variables with lower importance, such as certain provinces, imply that we can shift our focus away from region-specific strategies.

Model 5: Random Forest (Ensemble Technique)

Random forest combines the simplicity of decision trees with the flexibility of ensemble learning. By building a large number of trees and then averaging their predictions, Random Forest can improve the potential overfitting of a single decision tree, thereby enhancing the overall accuracy.

The algorithm introduces randomness in the selection of features when splitting a node, which ensures the diversity among the trees in the forest. This diversity, in turn, results in a more robust model that can handle a wide variety of data without the risk of fitting too closely to our training data. Despite the increased complexity and difficulty in model interpretation, Random Forest remains a promising approach for its ability to generalize and provide reliable predictions.

Hit rate: 0.7528, Top Decile Lift: 1.917976, GINI Coefficient: 0.6567712

The plot above shows MeanDecreaseAccuracy - how much accuracy each variable contributes to the model. A higher value indicates that a variable is more important for the model's accuracy. 'PenaltyFee', 'Electricity_usage', and 'Gas_usage' are the top three variables that decrease the accuracy the most when they are removed or shuffled. This suggests that these features are critical predictors of churn.

MeanDecreaseGINI measures each feature's contribution to the homogeneity of the nodes and leaves in the model. Higher values mean the variable is better at splitting the data into pure subsets. Similar to the MeanDecreaseAccuracy plot, 'PenaltyFee', 'Electricity_usage', and 'Gas_usage' are significant in creating pure nodes, indicating their strong predictive power.

Overall, Random Forest analysis has pinpointed the key factors influencing customer churn: 'PenaltyFee', 'Electricity_usage', and 'Gas_usage'. These stand out as the most potent indicators, with 'PenaltyFee' as the top predictor.

This suggests customers might stay to avoid fees, or leave when there is no financial burden. Usage patterns of electricity and gas also play a significant role, possibly reflecting customer satisfaction levels.

Other factors like how long a customer has been with the company ('Relation_length') and their income ('LogIncome') also matter, but they're not as impactful. The least influential factors were related to specific provinces, which seem to have little effect on churn.

To summarize, penalty fees and usage patterns can influence customer retention. Communication strategies can be tailored to these insights, and nurturing long-term customer relationships could also help reduce churn.

Model 6: Support Vector Machine (SVM)

Support Vector Machines (SVM) are a set of supervised learning methods used for classification, regression, and outliers detection.

An SVM model represents examples as points in space, mapped in such a way that examples of separate categories are divided by a clear gap that is as wide as possible.

In our project, we tested four SVM variants—linear, polynomial, radial, and sigmoid kernels—using variables identified as significant in previous models.

The linear kernel is suited for datasets that a hyperplane can separate, while polynomial and radial kernels can navigate more intricate structures. The sigmoid kernel, inspired by neural activation functions, adds another layer of versatility.

The linear kernel SVM emerged as the superior performer, excelling in hit rate, top decile lift, and GINI coefficient. This kernel functions well when there is a linear relationship between the variables and the outcome.

Compared to other models, the linear SVM's high hit rate indicates a strong ability to correctly classify customers who are likely to churn. The top decile lift suggests that the model is effective at ranking customers by their probability of churning, which is valuable for targeting interventions. The favorable GINI coefficient points to the model's good discriminatory power.

In conclusion, the linear SVM's better performance may be due to its ability to find the optimal separating hyperplane.

The figure below shows the result from the SVM model with a linear kernel.

Hit rate:0.7376, Top Decile Lift: 1.820037, GINI Coefficient: 0.623836

The variable importance plot derived from the SVM model with a linear kernel underscored 'PenaltyFee' as the most impactful variable, followed by Electricity_usage, and Gas_usage in churn prediction, which is in line with many other models.

Model 7: ANN (Artificial Neural Network)

The ANN is a powerful model capable of capturing intricate patterns within the data. Its strength lies in modeling complex relationships through layers of interconnected nodes. Although it requires extensive data for training and is often criticized for its "black box" nature, the ANN's ability to learn from non-linear and high-dimensional data is unparalleled.

The Artificial Neural Network (ANN) model demonstrated limited predictive power in this project. The model's evenly distributed predictions across deciles, with little concentration in the top deciles, indicate a performance not much better than random guessing. This suggests a lack of strong differentiation in identifying customers likely to churn.

Using a simplified network structure with fewer hidden layers, we aimed for a more efficient computational process. However, even with a focused input of four critical variables—Penalty Fee, Income, Electricity, and Gas Usage—the hit rate only reached 51.74%, falling short of expectations.

The ANN model's shortcomings include its computational intensity, making it impractical for extensive parameter tuning, and its opaque "black-box" nature, which impedes interpretability. Given these factors, the ANN model is considered the least suitable for our objectives of identifying and understanding customer churn, leading to its exclusion from further comparative analysis.

Part 5: Conclusions and Managerial Implications

The table below summarizes the performance of each model regarding hit rate, top decile lift, and GINI coefficient, which are evaluation metrics suited to classification problems.

The analysis of various predictive models for churn prediction reveals distinct insights and effectiveness in identifying likely churners and the influential factors. The key results indicate that models with higher hit rates, top decile lift, and GINI coefficients are more effective at distinguishing between churners and non-churners. Across models, variables such as 'PenaltyFee,' 'Electricity_usage,' 'Gas_usage,' and 'Relation_length' consistently emerge as significant predictors, highlighting their importance across different algorithmic approaches.

Differences in results can be attributed to how each model processes data and identifies patterns. For instance, decision trees inherently perform feature selection and provide a clear hierarchy of feature importance, which can differ from models like SVM that may weigh variables differently based on the chosen kernel and the separation of data in the feature space.

Considering the project's goal to accurately identify customers at risk of churn and understand the driving factors, the ensemble models, particularly XGBoost and Random Forest, stand out due to their robust performance across all evaluation metrics. They not only deliver high predictive accuracy but also offer insights into variable importance, which aids interpretability—a crucial aspect for developing actionable strategies. Their computational efficiency and ability to handle large datasets with numerous features make them practical for operational use.

From a managerial perspective, the insights from these models can inform targeted interventions. For example, customers with high 'PenaltyFee' are less likely to churn, suggesting that retention efforts could focus on reviewing fee structures. High usage metrics indicate potential dissatisfaction, signaling areas for service improvement. Strategies can be tailored based on these insights to proactively engage customers, enhance satisfaction, and ultimately reduce churn rates. The chosen model should facilitate these goals by providing a balance of accuracy, speed, and clarity on the key drivers of churn, enabling the business to act effectively on the predictive insights.