Telecom Customer Churn Prediction

11 minute read

Supervised Learning Capstone Project

In this notebook, telecom customer data was read in to determine whether customer churn can be predicted. As shown below, both random forest and logistic regression modelling yielded similar results with accuracies of ~80% on the test set data.

One key insight from the data was also that customers with month-to-month contracts are more likely to churn than other customers. In this subset of customers, the shorter tenure a customer has the higher they are to churn.

Note that the code for this project can also be found at the following Github repository: here

Please note that this project was done as part of the 2021 Python for Machine Learning & Data Science Masterclass on Udemy

Summary of Results

Based on the models run, customer churn can be predicted with ~79% accuracy via a random forest or logistic regression model.

From our EDA, it appears that contract type in particular can be important in predicting churn. Specifically, customers who are on a month to month plan are more likely to churn than other contract types, and especially those who have had plans for 0-12 months.

In the future, a company could use models to predict whether a customer is likely to churn, enact an intervention strategy to prevent churn, and optimize business strategy to proactively minimize churn.

Import required packages

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

EDA

Load Data

First I load in the relevant data.

df = pd.read_csv('./DATA/Telco-Customer-Churn.csv')

Data Structure And Information Exploration

After loading in the data I look at the data structure, check for null values to determine whether imputation/deletion is required, and view column descriptive statistics to get a high-level summary of the quantitative data.

df.head()

Data Structure And Information Exploration

df.head()

customerID	gender	Dependents	tenure	PhoneService	MultipleLines	InternetService	OnlineSecurity	…	MonthlyCharges	TotalCharges	Churn
0	Female	No	1	No	No phone service	DSL	No	…	29.85	29.85	No
1	Male	No	34	Yes	No DSL	Yes	Yes	…	56.95	1889.50	No
2	Male	No	2	Yes	No DSL	Yes	Yes	…	53.85	108.15	Yes
3	Male	No	45	No	No phone service	DSL	Yes	…	42.30	1840.75	No
4	Female	No	2	Yes	No Fiber optic	No	Yes	…	70.70	151.65	Yes

df.info()

#	Column	Non-Null Count	Dtype
0	customerID	7032 non-null	object
1	gender	7032 non-null	object
2	SeniorCitizen	7032 non-null	int64
3	Partner	7032 non-null	object
4	Dependents	7032 non-null	object
5	tenure	7032 non-null	int64
6	PhoneService	7032 non-null	object
7	MultipleLines	7032 non-null	object
8	InternetService	7032 non-null	object
9	OnlineSecurity	7032 non-null	object
10	OnlineBackup	7032 non-null	object
11	DeviceProtection	7032 non-null	object
12	TechSupport	7032 non-null	object
13	StreamingTV	7032 non-null	object
14	StreamingMovies	7032 non-null	object
15	Contract	7032 non-null	object
16	PaperlessBilling	7032 non-null	object
17	PaymentMethod	7032 non-null	object
18	MonthlyCharges	7032 non-null	float64
19	TotalCharges	7032 non-null	float64
20	Churn	7032 non-null	object

Based on the above, I see that there are no null values in the data and thus no imputation/deletion is needed.

 df.describe()

SeniorCitizen	tenure	MonthlyCharges	TotalCharges
count	7032.000000	7032.000000	7032.000000	7032.000000
mean	0.162400	32.421786	64.798208	2283.300441
std	0.368844	24.545260	30.085974	2266.771362
min	0.000000	1.000000	18.250000	18.800000
25%	0.000000	9.000000	35.587500	401.450000
50%	0.000000	29.000000	70.350000	1397.475000
75%	0.000000	55.000000	89.862500	3794.737500
max	1.000000	72.000000	118.750000	8684.800000

Visual Exploration

Now that I have a better sense of data structure and overall statistics, I can explore the data visually.

First I will drop the customerID column since this is unique for each row and not useful for classification or visualization.

df.drop('customerID', axis=1, inplace=True)

Next I will generate a countplot of customer churn to see whether the target data is imbalanced.

plt.figure(figsize=(15,5))
sns.countplot(data=df, x='Churn')
plt.grid(False)
sns.set(rc={'figure.facecolor':'white'})
plt.ylabel("Count")
plt.xlabel("Churn Status")

Based on the above plot, I see that the data is imbalanced, with ~2.5x No’s than Yes’

Next I will plot customer churn vs TotalCharges via a violin plot. This can help me understand the distribution of the target variable and if there are any trends or target areas for further analysis.

plt.figure(figsize=(15,5))
sns.violinplot(data=df,y='TotalCharges',x='Churn')
plt.grid(False)
sns.set(rc={'figure.facecolor':'white'})
plt.xlabel("Churn Status")
plt.ylabel("Total Charges ($)")

In the above, I see that there is a jump in TotalChares at ~$1000.

Next, I will plot contract types vs TotalCharges, with a hue of Churn, in boxplots. This can help me determine whether contract type appears to have an influence on churn.

plt.figure(figsize=(15,5))
sns.boxplot(data=df,x='Contract',y='TotalCharges',hue='Churn')
plt.grid(False)
sns.set(rc={'figure.facecolor':'white'})
plt.xlabel("Contract Type")
plt.ylabel("Total Charges ($)")

Finally, I will create a correlation matrix for features with the churn variable

#use df.head as a refresher of the df structure and columns
df.head()

customerID	gender	Dependents	tenure	PhoneService	MultipleLines	InternetService	OnlineSecurity	…	MonthlyCharges	TotalCharges	Churn
0	Female	No	1	No	No phone service	DSL	No	…	29.85	29.85	No
1	Male	No	34	Yes	No DSL	Yes	Yes	…	56.95	1889.50	No
2	Male	No	2	Yes	No DSL	Yes	Yes	…	53.85	108.15	Yes
3	Male	No	45	No	No phone service	DSL	Yes	…	42.30	1840.75	No
4	Female	No	2	Yes	No Fiber optic	No	Yes	…	70.70	151.65	Yes

corr_df  = pd.get_dummies(df[['gender', 'SeniorCitizen', 'Partner', 'Dependents','PhoneService', 'MultipleLines', 'InternetService', 'OnlineSecurity', 'OnlineBackup', 'DeviceProtection', 'TechSupport','StreamingTV', 'StreamingMovies', 'Contract', 'PaperlessBilling', 'PaymentMethod','Churn']]).corr() 
corr_df['Churn_Yes'].sort_values().iloc[1:-1]

|Column|Correlation| |—|—| |Contract_Two year |-0.301552| |StreamingMovies_No internet service |-0.227578| |StreamingTV_No internet service |-0.227578| |TechSupport_No internet service |-0.227578| |DeviceProtection_No internet service |-0.227578| |OnlineBackup_No internet service |-0.227578| |OnlineSecurity_No internet service |-0.227578| |InternetService_No |-0.227578| |PaperlessBilling_No |-0.191454| |Contract_One year |-0.178225| |OnlineSecurity_Yes |-0.171270| |TechSupport_Yes |-0.164716| |…|…| |SeniorCitizen | 0.150541| |Dependents_No | 0.163128| |PaperlessBilling_Yes | 0.191454| |DeviceProtection_No | 0.252056| |OnlineBackup_No | 0.267595| |PaymentMethod_Electronic check | 0.301455| |InternetService_Fiber optic | 0.307463| |TechSupport_No | 0.336877| |OnlineSecurity_No | 0.342235| |Contract_Month-to-month | 0.404565|

Plot the correlation of features with churn

plt.figure(figsize=(15,5))
plt.xticks(rotation=90)
sns.barplot(x=corr_df['Churn_Yes'].sort_values().iloc[1:-1].index, y=corr_df['Churn_Yes'].sort_values().iloc[1:-1].values)
plt.grid(False)
sns.set(rc={'figure.facecolor':'white'})
plt.xlabel("Variables")
plt.ylabel("Correlation with Churn")

Based on the above, contract month-to-month appears to have the highest correlation to churn. Let’s conduct more analysis with the contract features below.

Churn Analysis

Now that I have explored the data, I can begin to analyze churn in the dataset.

Tenure and Contract Type Anaysis

First I will analyze tenure and contract type as they relate to churn, since churn is correlated highly with monthly contracts and we want customers to have higher tenure with the company.

To start, I will confirm the different contract types.

df['Contract'].unique()
# ['Month-to-month', 'One year', 'Two year']

Next I will create a historgram displaying the distribution of the tenure column, which is the amount of time a customer has been / was a customer.

plt.figure(figsize=(15,5))
sns.histplot(data=df,x='tenure',bins=25)
plt.grid(False)
sns.set(rc={'figure.facecolor':'white'})
plt.xlabel("Tenure Bins (years)")
plt.ylabel("Count")

There is a wide distribution of tenure in this dataset, with several apparent spikes around 3 months and 70 months.

Next, I will plot tenure partitioned by each contract type and customer churn target values.

plt.figure(figsize=(15,5),dpi=200)
sns.displot(data=df,x='tenure',bins=70,col='Contract',row='Churn')
plt.grid(False)
sns.set(rc={'figure.facecolor':'white'})

Based on the above, it would appear that customers with month-to-month contracts tend to have high churn early and then relatively small churn numbers thereafter.

For one and two year contracts, there does not appear to be a spike or pattern indicating how long it take these customers to churn on average.

Monthly Charge Analysis

In addition to tenure, another characteristic at our disposal is the monthly charge figure. To see whether monthly charges have an impact on churn, I make a scatter plot of Total Charges vs Monthly Charges, and color hue by Churn.

plt.figure(figsize=(15,5),dpi=200)
sns.scatterplot(data=df, x='MonthlyCharges', y='TotalCharges', hue='Churn')
plt.grid(False)
sns.set(rc={'figure.facecolor':'white'})
plt.xlabel("Monthly Charges ($)")
plt.ylabel("Total Charges ($)")

As seen above, customers are shown to churn at many different MonthlyCharges values. However, there does appear to be more churn as MonthlyCharges get higher.

Creating Cohorts based on Tenure

Next, I will treat each tenure group as a cohort, and calculate the Churn rate per cohort.

no_churn = df.groupby(['Churn', 'tenure']).count().transpose()['No']
yes_churn = df.groupby(['Churn', 'tenure']).count().transpose()['Yes']
churn_rate = 100*yes_churn / (no_churn+yes_churn)

churn_rate.transpose()['gender']

#	tenure
1	61.990212
2	51.680672
3	47.000000
4	47.159091
5	48.120301
…	…
68	9.000000
69	8.421053
70	9.243697
71	3.529412
72	1.657459

With the above data created, I will now plot churn rate per month.

plt.figure(figsize=(15,5))
churn_rate.iloc[0].plot()
plt.ylabel('Churn Rate')
plt.grid(False)
sns.set(rc={'figure.facecolor':'white'})
plt.xlabel("Tenure (years)")

As seen above, it appears that the generally, higher tenure correlates with lower churn rates.

Based on the tenure column values above, I can create a new column called Tenure Cohort that create 4 categories:

0-12 months
12-24 months
24-48 months
over 48 months

def tenure_cohort(tenure):
    if tenure<13:
        return '0-12 months'
    elif tenure<25:
        return '12-24 months'
    elif tenure<49:
        return '24-48 months'
    return 'over 48 months' 

df['tenure_cohort'] = df['tenure'].apply(tenure_cohort)

With the new cohort column created, I can now create a scatter plot of total charges vs monthly costs, colored by tenure cohort

plt.figure(figsize=(10,5),dpi=200)
sns.scatterplot(data=df, x='MonthlyCharges', y='TotalCharges', hue='tenure_cohort', alpha=0.5)
sns.set(rc={'figure.facecolor':'white'})
plt.grid(False)
plt.xlabel("Monthly Charges ($)")
plt.ylabel("Monthly Charges ($)")

The chart above appears to show that customers with higher tenure tend to have higher TotalCharges and MonthlyCharges. The TotalCharges trend makes sense, since older customers have had more pay cycles and thus have higher cumulative charges on their accounts.

There could be several explanations for why newer customers tend to have lower rates, including that they may get low promotional rates to join the company and they haven’t had as much time to have their monthly rates increased.

Since it appears that the cohorts are in clear ‘bands’ above, I will create a count plot showing the churn count per cohort to better quantify this trend.

plt.figure(figsize=(15,5),dpi=200)
sns.countplot(data=df,x='tenure_cohort',hue='Churn')
plt.grid(False)
sns.set(rc={'figure.facecolor':'white'})
plt.xlabel("Tenure Cohort")
plt.ylabel("Count")

plt.figure(figsize=(15,5),dpi=200)
g = sns.catplot(x="tenure_cohort", hue="Churn", col="Contract",data=df, kind="count")
sns.set(rc={'figure.facecolor':'white'})

From the above two plots, I see that churn numbers are highest within the first 12 months of tenure.

Further, much of the churn both in the first 12 months and overall occurs in month-to-month contracts.

Predicting Customer Churn

After exploring and analyzing the provided data, I can now create a predictive model to help the telecom company identify likelihood of churn and perform an intervention / program to provent churn.

Prepare the data for Modelling

First I split the data into input and target feature sets

X=df.drop('Churn', axis=1)
y=df['Churn']

Next I get dummies for categorical values so that they can be included in machine learning model processes.

X = pd.get_dummies(X,drop_first=True)

I perform a train test split of the data so that data can be used to both train a model and validate it’s effectiveness.

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

Some models, like random forests, do not require feature scaling. Scaling will be performed for models that require it in the below.

Random Forest Modelling

Import accuracy metrics and model

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import accuracy_score, plot_confusion_matrix,classification_report

Perform grid search to find optimal params

rf_model = RandomForestClassifier()
param_grid = {'n_estimators':[50,100], 'max_depth': [2,4,6,8,10]}
grid = GridSearchCV(rf_model,param_grid)
grid.fit(X_train, y_train)
grid.best_params_
# max_depth: 8, estimators: 100

Use the best parameters to create a random forest model and assess its accuracy/metrics.

plt.figure(figsize=(15,5))
plot_confusion_matrix(grid,X_test,y_test)
plt.grid(False)
sns.set(rc={'figure.facecolor':'white'})

accuracy_score(y_test,preds)

accuracy: 0.7933649289099526

print(classification_report(y_test,preds))

.	precision	recall	f1-score	support
No	0.82	0.91	0.87	1549
Yes	0.66	0.46	0.54	561
accuracy	…	…	0.79	2110
macro avg	0.74	0.69	0.70	2110
weighted avg	0.78	0.79	0.78	2110

From the above, we see that a random forest model is ~79% accurate overall.

Logistic Regression

Now I will try implementing a logistic regression model on this data, to see whether we can get better results than the above.

The data is already split, however we need to scale our data so the model can run optimally, since regressors are more sensitive to data being placed at different scales.

from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

scaled_X_train = scaler.fit_transform(X_train)
scaled_X_test = scaler.transform(X_test)

log_model = LogisticRegression()

penalty = ['l1','l2','elastic_net']
l1_ratio = np.linspace(0,10,20)
C = np.logspace(0,10,20)
param_grid = {'l1_ratio':l1_ratio, 'penalty':penalty, 'C':C}

grid_model = GridSearchCV(log_model, param_grid=param_grid)
grid_model.fit(scaled_X_train,y_train)

grid_model.best_params_
# {'C': 11.28837891684689, 'l1_ratio': 0.0, 'penalty': 'l2'}

y_pred = grid_model.predict(scaled_X_test)

accuracy_score(y_test,y_pred)
# accuracy: 0.7919431279620853

plt.figure(figsize=(15,5))
plot_confusion_matrix(grid_model,scaled_X_test,y_test)
plt.grid(False)
sns.set(rc={'figure.facecolor':'white'})

print(classification_report(y_test,y_pred))

.	precision	recall	f1-score	support
No	0.83	0.90	0.86	1549
Yes	0.64	0.49	0.56	561
accuracy	…	…	0.79	2110
macro avg	0.74	0.70	0.71	2110
weighted avg	0.78	0.79	0.78	2110

I see very similar accuracy scores and confusion matrix results for both random forest and logistic regression models.

KNN

Finally, I conduct a KNN model.

from sklearn.neighbors import KNeighborsClassifier
knn_clf = KNeighborsClassifier()
param_grid_knn = {'n_neighbors':[2,5,7,10]}
grid_knn = GridSearchCV(knn_clf,param_grid_knn)
grid_knn.fit(scaled_X_train,y_train)

grid_knn.best_params_
# {'n_neighbors': 7}

knn_preds = grid_knn.predict(scaled_X_test)

plt.figure(figsize=(15,5))
plot_confusion_matrix(grid_knn,scaled_X_test,y_test)
plt.grid(False)
sns.set(rc={'figure.facecolor':'white'})

print(classification_report(y_test,knn_preds))

.	precision	recall	f1-score	support
No	0.83	0.86	0.85	1549
Yes	0.58	0.52	0.55	561
accuracy	…	…	0.77	2110
macro avg	0.70	0.69	0.70	2110
weighted avg	0.76	0.77	0.77	2110

The KNN model performs slightly worse than our other two models.

Results

Summary of Model Performance

Model	Weighted Avg Precision	Weighted Avg Recall	Weighted Avg F1-score	Accuracy
Random Forest	0.78	0.79	0.78	0.79
Logistic Regression	0.78	0.79	0.78	0.79
KNN	0.76	0.77	0.77	0.77

Conclusion

Based on the models run, customer churn can be predicted with ~79% accuracy via a random forest or logistic regression model.

Next Steps

In the future, a company could use one of the above models to predict whether a customer is likely to churn. With this information, the company could then decide how (and whether) to intervene and prevent the customer from cancelling service.

The company could also evaluate the factors most correlated with churn and determine whether they can alter their strategy and reduce churn from particular components of the business.

Allan Khariton

Data Science Portfolio

Telecom Customer Churn Prediction

Supervised Learning Capstone Project

Summary of Results

Import required packages

EDA

Load Data

Data Structure And Information Exploration

Data Structure And Information Exploration

Visual Exploration

Plot the correlation of features with churn

Churn Analysis

Tenure and Contract Type Anaysis

Monthly Charge Analysis

Creating Cohorts based on Tenure

Predicting Customer Churn

Prepare the data for Modelling

Random Forest Modelling

Import accuracy metrics and model

Logistic Regression

KNN

Results

Summary of Model Performance

Conclusion

Next Steps