Telecom Customer Churn Prediction
Supervised Learning Capstone Project
In this notebook, telecom customer data was read in to determine whether customer churn can be predicted. As shown below, both random forest and logistic regression modelling yielded similar results with accuracies of ~80% on the test set data.
One key insight from the data was also that customers with month-to-month contracts are more likely to churn than other customers. In this subset of customers, the shorter tenure a customer has the higher they are to churn.
Note that the code for this project can also be found at the following Github repository: here
Please note that this project was done as part of the 2021 Python for Machine Learning & Data Science Masterclass on Udemy
Summary of Results
Based on the models run, customer churn can be predicted with ~79% accuracy via a random forest or logistic regression model.
From our EDA, it appears that contract type in particular can be important in predicting churn. Specifically, customers who are on a month to month plan are more likely to churn than other contract types, and especially those who have had plans for 0-12 months.
In the future, a company could use models to predict whether a customer is likely to churn, enact an intervention strategy to prevent churn, and optimize business strategy to proactively minimize churn.
Import required packages
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
EDA
Load Data
First I load in the relevant data.
df = pd.read_csv('./DATA/Telco-Customer-Churn.csv')
Data Structure And Information Exploration
After loading in the data I look at the data structure, check for null values to determine whether imputation/deletion is required, and view column descriptive statistics to get a high-level summary of the quantitative data.
df.head()
Data Structure And Information Exploration
After loading in the data I look at the data structure, check for null values to determine whether imputation/deletion is required, and view column descriptive statistics to get a high-level summary of the quantitative data.
df.head()
customerID | gender | Dependents | tenure | PhoneService | MultipleLines | InternetService | OnlineSecurity | … | MonthlyCharges | TotalCharges | Churn |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | Female | No | 1 | No | No phone service | DSL | No | … | 29.85 | 29.85 | No |
1 | Male | No | 34 | Yes | No DSL | Yes | Yes | … | 56.95 | 1889.50 | No |
2 | Male | No | 2 | Yes | No DSL | Yes | Yes | … | 53.85 | 108.15 | Yes |
3 | Male | No | 45 | No | No phone service | DSL | Yes | … | 42.30 | 1840.75 | No |
4 | Female | No | 2 | Yes | No Fiber optic | No | Yes | … | 70.70 | 151.65 | Yes |
df.info()
# | Column | Non-Null Count | Dtype |
---|---|---|---|
0 | customerID | 7032 non-null | object |
1 | gender | 7032 non-null | object |
2 | SeniorCitizen | 7032 non-null | int64 |
3 | Partner | 7032 non-null | object |
4 | Dependents | 7032 non-null | object |
5 | tenure | 7032 non-null | int64 |
6 | PhoneService | 7032 non-null | object |
7 | MultipleLines | 7032 non-null | object |
8 | InternetService | 7032 non-null | object |
9 | OnlineSecurity | 7032 non-null | object |
10 | OnlineBackup | 7032 non-null | object |
11 | DeviceProtection | 7032 non-null | object |
12 | TechSupport | 7032 non-null | object |
13 | StreamingTV | 7032 non-null | object |
14 | StreamingMovies | 7032 non-null | object |
15 | Contract | 7032 non-null | object |
16 | PaperlessBilling | 7032 non-null | object |
17 | PaymentMethod | 7032 non-null | object |
18 | MonthlyCharges | 7032 non-null | float64 |
19 | TotalCharges | 7032 non-null | float64 |
20 | Churn | 7032 non-null | object |
Based on the above, I see that there are no null values in the data and thus no imputation/deletion is needed.
df.describe()
SeniorCitizen | tenure | MonthlyCharges | TotalCharges | |
---|---|---|---|---|
count | 7032.000000 | 7032.000000 | 7032.000000 | 7032.000000 |
mean | 0.162400 | 32.421786 | 64.798208 | 2283.300441 |
std | 0.368844 | 24.545260 | 30.085974 | 2266.771362 |
min | 0.000000 | 1.000000 | 18.250000 | 18.800000 |
25% | 0.000000 | 9.000000 | 35.587500 | 401.450000 |
50% | 0.000000 | 29.000000 | 70.350000 | 1397.475000 |
75% | 0.000000 | 55.000000 | 89.862500 | 3794.737500 |
max | 1.000000 | 72.000000 | 118.750000 | 8684.800000 |
Visual Exploration
Now that I have a better sense of data structure and overall statistics, I can explore the data visually.
First I will drop the customerID column since this is unique for each row and not useful for classification or visualization.
df.drop('customerID', axis=1, inplace=True)
Next I will generate a countplot of customer churn to see whether the target data is imbalanced.
plt.figure(figsize=(15,5))
sns.countplot(data=df, x='Churn')
plt.grid(False)
sns.set(rc={'figure.facecolor':'white'})
plt.ylabel("Count")
plt.xlabel("Churn Status")
Based on the above plot, I see that the data is imbalanced, with ~2.5x No’s than Yes’
Next I will plot customer churn vs TotalCharges via a violin plot. This can help me understand the distribution of the target variable and if there are any trends or target areas for further analysis.
plt.figure(figsize=(15,5))
sns.violinplot(data=df,y='TotalCharges',x='Churn')
plt.grid(False)
sns.set(rc={'figure.facecolor':'white'})
plt.xlabel("Churn Status")
plt.ylabel("Total Charges ($)")
In the above, I see that there is a jump in TotalChares at ~$1000.
Next, I will plot contract types vs TotalCharges, with a hue of Churn, in boxplots. This can help me determine whether contract type appears to have an influence on churn.
plt.figure(figsize=(15,5))
sns.boxplot(data=df,x='Contract',y='TotalCharges',hue='Churn')
plt.grid(False)
sns.set(rc={'figure.facecolor':'white'})
plt.xlabel("Contract Type")
plt.ylabel("Total Charges ($)")
Finally, I will create a correlation matrix for features with the churn variable
#use df.head as a refresher of the df structure and columns
df.head()
customerID | gender | Dependents | tenure | PhoneService | MultipleLines | InternetService | OnlineSecurity | … | MonthlyCharges | TotalCharges | Churn |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | Female | No | 1 | No | No phone service | DSL | No | … | 29.85 | 29.85 | No |
1 | Male | No | 34 | Yes | No DSL | Yes | Yes | … | 56.95 | 1889.50 | No |
2 | Male | No | 2 | Yes | No DSL | Yes | Yes | … | 53.85 | 108.15 | Yes |
3 | Male | No | 45 | No | No phone service | DSL | Yes | … | 42.30 | 1840.75 | No |
4 | Female | No | 2 | Yes | No Fiber optic | No | Yes | … | 70.70 | 151.65 | Yes |
corr_df = pd.get_dummies(df[['gender', 'SeniorCitizen', 'Partner', 'Dependents','PhoneService', 'MultipleLines', 'InternetService', 'OnlineSecurity', 'OnlineBackup', 'DeviceProtection', 'TechSupport','StreamingTV', 'StreamingMovies', 'Contract', 'PaperlessBilling', 'PaymentMethod','Churn']]).corr()
corr_df['Churn_Yes'].sort_values().iloc[1:-1]
|Column|Correlation| |—|—| |Contract_Two year |-0.301552| |StreamingMovies_No internet service |-0.227578| |StreamingTV_No internet service |-0.227578| |TechSupport_No internet service |-0.227578| |DeviceProtection_No internet service |-0.227578| |OnlineBackup_No internet service |-0.227578| |OnlineSecurity_No internet service |-0.227578| |InternetService_No |-0.227578| |PaperlessBilling_No |-0.191454| |Contract_One year |-0.178225| |OnlineSecurity_Yes |-0.171270| |TechSupport_Yes |-0.164716| |…|…| |SeniorCitizen | 0.150541| |Dependents_No | 0.163128| |PaperlessBilling_Yes | 0.191454| |DeviceProtection_No | 0.252056| |OnlineBackup_No | 0.267595| |PaymentMethod_Electronic check | 0.301455| |InternetService_Fiber optic | 0.307463| |TechSupport_No | 0.336877| |OnlineSecurity_No | 0.342235| |Contract_Month-to-month | 0.404565|
Plot the correlation of features with churn
plt.figure(figsize=(15,5))
plt.xticks(rotation=90)
sns.barplot(x=corr_df['Churn_Yes'].sort_values().iloc[1:-1].index, y=corr_df['Churn_Yes'].sort_values().iloc[1:-1].values)
plt.grid(False)
sns.set(rc={'figure.facecolor':'white'})
plt.xlabel("Variables")
plt.ylabel("Correlation with Churn")
Based on the above, contract month-to-month appears to have the highest correlation to churn. Let’s conduct more analysis with the contract features below.
Churn Analysis
Now that I have explored the data, I can begin to analyze churn in the dataset.
Tenure and Contract Type Anaysis
First I will analyze tenure and contract type as they relate to churn, since churn is correlated highly with monthly contracts and we want customers to have higher tenure with the company.
To start, I will confirm the different contract types.
df['Contract'].unique()
# ['Month-to-month', 'One year', 'Two year']
Next I will create a historgram displaying the distribution of the tenure column, which is the amount of time a customer has been / was a customer.
plt.figure(figsize=(15,5))
sns.histplot(data=df,x='tenure',bins=25)
plt.grid(False)
sns.set(rc={'figure.facecolor':'white'})
plt.xlabel("Tenure Bins (years)")
plt.ylabel("Count")
There is a wide distribution of tenure in this dataset, with several apparent spikes around 3 months and 70 months.
Next, I will plot tenure partitioned by each contract type and customer churn target values.
plt.figure(figsize=(15,5),dpi=200)
sns.displot(data=df,x='tenure',bins=70,col='Contract',row='Churn')
plt.grid(False)
sns.set(rc={'figure.facecolor':'white'})
Based on the above, it would appear that customers with month-to-month contracts tend to have high churn early and then relatively small churn numbers thereafter.
For one and two year contracts, there does not appear to be a spike or pattern indicating how long it take these customers to churn on average.
Monthly Charge Analysis
In addition to tenure, another characteristic at our disposal is the monthly charge figure. To see whether monthly charges have an impact on churn, I make a scatter plot of Total Charges vs Monthly Charges, and color hue by Churn.
plt.figure(figsize=(15,5),dpi=200)
sns.scatterplot(data=df, x='MonthlyCharges', y='TotalCharges', hue='Churn')
plt.grid(False)
sns.set(rc={'figure.facecolor':'white'})
plt.xlabel("Monthly Charges ($)")
plt.ylabel("Total Charges ($)")
As seen above, customers are shown to churn at many different MonthlyCharges values. However, there does appear to be more churn as MonthlyCharges get higher.
Creating Cohorts based on Tenure
Next, I will treat each tenure group as a cohort, and calculate the Churn rate per cohort.
no_churn = df.groupby(['Churn', 'tenure']).count().transpose()['No']
yes_churn = df.groupby(['Churn', 'tenure']).count().transpose()['Yes']
churn_rate = 100*yes_churn / (no_churn+yes_churn)
churn_rate.transpose()['gender']
# | tenure |
---|---|
1 | 61.990212 |
2 | 51.680672 |
3 | 47.000000 |
4 | 47.159091 |
5 | 48.120301 |
… | … |
68 | 9.000000 |
69 | 8.421053 |
70 | 9.243697 |
71 | 3.529412 |
72 | 1.657459 |
With the above data created, I will now plot churn rate per month.
plt.figure(figsize=(15,5))
churn_rate.iloc[0].plot()
plt.ylabel('Churn Rate')
plt.grid(False)
sns.set(rc={'figure.facecolor':'white'})
plt.xlabel("Tenure (years)")
As seen above, it appears that the generally, higher tenure correlates with lower churn rates.
Based on the tenure column values above, I can create a new column called Tenure Cohort that create 4 categories:
- 0-12 months
- 12-24 months
- 24-48 months
- over 48 months
def tenure_cohort(tenure):
if tenure<13:
return '0-12 months'
elif tenure<25:
return '12-24 months'
elif tenure<49:
return '24-48 months'
return 'over 48 months'
df['tenure_cohort'] = df['tenure'].apply(tenure_cohort)
With the new cohort column created, I can now create a scatter plot of total charges vs monthly costs, colored by tenure cohort
plt.figure(figsize=(10,5),dpi=200)
sns.scatterplot(data=df, x='MonthlyCharges', y='TotalCharges', hue='tenure_cohort', alpha=0.5)
sns.set(rc={'figure.facecolor':'white'})
plt.grid(False)
plt.xlabel("Monthly Charges ($)")
plt.ylabel("Monthly Charges ($)")
The chart above appears to show that customers with higher tenure tend to have higher TotalCharges and MonthlyCharges. The TotalCharges trend makes sense, since older customers have had more pay cycles and thus have higher cumulative charges on their accounts.
There could be several explanations for why newer customers tend to have lower rates, including that they may get low promotional rates to join the company and they haven’t had as much time to have their monthly rates increased.
Since it appears that the cohorts are in clear ‘bands’ above, I will create a count plot showing the churn count per cohort to better quantify this trend.
plt.figure(figsize=(15,5),dpi=200)
sns.countplot(data=df,x='tenure_cohort',hue='Churn')
plt.grid(False)
sns.set(rc={'figure.facecolor':'white'})
plt.xlabel("Tenure Cohort")
plt.ylabel("Count")
plt.figure(figsize=(15,5),dpi=200)
g = sns.catplot(x="tenure_cohort", hue="Churn", col="Contract",data=df, kind="count")
sns.set(rc={'figure.facecolor':'white'})
From the above two plots, I see that churn numbers are highest within the first 12 months of tenure.
Further, much of the churn both in the first 12 months and overall occurs in month-to-month contracts.
Predicting Customer Churn
After exploring and analyzing the provided data, I can now create a predictive model to help the telecom company identify likelihood of churn and perform an intervention / program to provent churn.
Prepare the data for Modelling
First I split the data into input and target feature sets
X=df.drop('Churn', axis=1)
y=df['Churn']
Next I get dummies for categorical values so that they can be included in machine learning model processes.
X = pd.get_dummies(X,drop_first=True)
I perform a train test split of the data so that data can be used to both train a model and validate it’s effectiveness.
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
Some models, like random forests, do not require feature scaling. Scaling will be performed for models that require it in the below.
Random Forest Modelling
Import accuracy metrics and model
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import accuracy_score, plot_confusion_matrix,classification_report
Perform grid search to find optimal params
rf_model = RandomForestClassifier()
param_grid = {'n_estimators':[50,100], 'max_depth': [2,4,6,8,10]}
grid = GridSearchCV(rf_model,param_grid)
grid.fit(X_train, y_train)
grid.best_params_
# max_depth: 8, estimators: 100
Use the best parameters to create a random forest model and assess its accuracy/metrics.
plt.figure(figsize=(15,5))
plot_confusion_matrix(grid,X_test,y_test)
plt.grid(False)
sns.set(rc={'figure.facecolor':'white'})
accuracy_score(y_test,preds)
accuracy: 0.7933649289099526
print(classification_report(y_test,preds))
. | precision | recall | f1-score | support |
---|---|---|---|---|
No | 0.82 | 0.91 | 0.87 | 1549 |
Yes | 0.66 | 0.46 | 0.54 | 561 |
accuracy | … | … | 0.79 | 2110 |
macro avg | 0.74 | 0.69 | 0.70 | 2110 |
weighted avg | 0.78 | 0.79 | 0.78 | 2110 |
From the above, we see that a random forest model is ~79% accurate overall.
Logistic Regression
Now I will try implementing a logistic regression model on this data, to see whether we can get better results than the above.
The data is already split, however we need to scale our data so the model can run optimally, since regressors are more sensitive to data being placed at different scales.
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaled_X_train = scaler.fit_transform(X_train)
scaled_X_test = scaler.transform(X_test)
log_model = LogisticRegression()
penalty = ['l1','l2','elastic_net']
l1_ratio = np.linspace(0,10,20)
C = np.logspace(0,10,20)
param_grid = {'l1_ratio':l1_ratio, 'penalty':penalty, 'C':C}
grid_model = GridSearchCV(log_model, param_grid=param_grid)
grid_model.fit(scaled_X_train,y_train)
grid_model.best_params_
# {'C': 11.28837891684689, 'l1_ratio': 0.0, 'penalty': 'l2'}
y_pred = grid_model.predict(scaled_X_test)
accuracy_score(y_test,y_pred)
# accuracy: 0.7919431279620853
plt.figure(figsize=(15,5))
plot_confusion_matrix(grid_model,scaled_X_test,y_test)
plt.grid(False)
sns.set(rc={'figure.facecolor':'white'})
print(classification_report(y_test,y_pred))
. | precision | recall | f1-score | support |
---|---|---|---|---|
No | 0.83 | 0.90 | 0.86 | 1549 |
Yes | 0.64 | 0.49 | 0.56 | 561 |
accuracy | … | … | 0.79 | 2110 |
macro avg | 0.74 | 0.70 | 0.71 | 2110 |
weighted avg | 0.78 | 0.79 | 0.78 | 2110 |
I see very similar accuracy scores and confusion matrix results for both random forest and logistic regression models.
KNN
Finally, I conduct a KNN model.
from sklearn.neighbors import KNeighborsClassifier
knn_clf = KNeighborsClassifier()
param_grid_knn = {'n_neighbors':[2,5,7,10]}
grid_knn = GridSearchCV(knn_clf,param_grid_knn)
grid_knn.fit(scaled_X_train,y_train)
grid_knn.best_params_
# {'n_neighbors': 7}
knn_preds = grid_knn.predict(scaled_X_test)
plt.figure(figsize=(15,5))
plot_confusion_matrix(grid_knn,scaled_X_test,y_test)
plt.grid(False)
sns.set(rc={'figure.facecolor':'white'})
print(classification_report(y_test,knn_preds))
. | precision | recall | f1-score | support |
---|---|---|---|---|
No | 0.83 | 0.86 | 0.85 | 1549 |
Yes | 0.58 | 0.52 | 0.55 | 561 |
accuracy | … | … | 0.77 | 2110 |
macro avg | 0.70 | 0.69 | 0.70 | 2110 |
weighted avg | 0.76 | 0.77 | 0.77 | 2110 |
The KNN model performs slightly worse than our other two models.
Results
Summary of Model Performance
Model | Weighted Avg Precision | Weighted Avg Recall | Weighted Avg F1-score | Accuracy |
---|---|---|---|---|
Random Forest | 0.78 | 0.79 | 0.78 | 0.79 |
Logistic Regression | 0.78 | 0.79 | 0.78 | 0.79 |
KNN | 0.76 | 0.77 | 0.77 | 0.77 |
Conclusion
Based on the models run, customer churn can be predicted with ~79% accuracy via a random forest or logistic regression model.
From our EDA, it appears that contract type in particular can be important in predicting churn. Specifically, customers who are on a month to month plan are more likely to churn than other contract types, and especially those who have had plans for 0-12 months.
Next Steps
In the future, a company could use one of the above models to predict whether a customer is likely to churn. With this information, the company could then decide how (and whether) to intervene and prevent the customer from cancelling service.
The company could also evaluate the factors most correlated with churn and determine whether they can alter their strategy and reduce churn from particular components of the business.