Embarking on a journey into the realm of customer service call data analysis, our project aims to construct a robust classification model using the churn_df dataset. Focused on predicting customer churn, we've carefully selected two key features, namely "account_length" and "customer_service_calls," as the pillars of our predictive framework. The target variable, aptly named "churn," will serve as the compass guiding our model's discernment between customer retention and attrition.

To initiate the model-building process, our first step involves transforming the selected features and the target variable into NumPy arrays. This conversion lays the foundation for leveraging the power of numerical computation and array manipulation in our subsequent analysis. The seamless integration of these arrays into the model creation pipeline is vital for ensuring compatibility with the algorithms we'll employ.

Central to our predictive journey is the utilization of the K-Nearest Neighbors (KNN) classifier, a versatile algorithm renowned for its simplicity and effectiveness in classification tasks. By creating an instance of the KNN classifier, we position ourselves to harness its ability to identify patterns and relationships within the given dataset.

Once armed with our KNN classifier, the subsequent stride involves fitting it to the dataset. This process entails training the model on the provided data, allowing it to learn the inherent patterns and associations between the features and the target variable. Through this immersive learning experience, the classifier becomes adept at making informed predictions on customer churn, paving the way for proactive measures in customer retention strategies.

As we navigate through the intricacies of this project, the amalgamation of data transformation, algorithmic selection, and model training underscores our commitment to developing a robust classification model capable of discerning the subtleties of customer behavior in the context of service call interactions.

Import Libraries

# Import Pandas
import pandas as pd
# Import Numpy
import numpy as np
# Import Matplotlib
import matplotlib.pyplot as plt


# Import KNeighborsClassifier
from sklearn.neighbors import KNeighborsClassifier
# Import the module
from sklearn.model_selection import train_test_split

Import dataset

# Read CSV dataset
churn_df = pd.read_csv('datasets/telecom_churn_clean.csv')

# Print dataset
print(churn_df.head(5))

# Create arrays for the features and the target variable
y = churn_df["churn"].values
X = churn_df[["account_length", "customer_service_calls"]].values

k-Nearest Neighbors: Fit

# Create a KNN classifier with 6 neighbors
knn = KNeighborsClassifier(n_neighbors=6)

# Fit the classifier to the data
knn.fit(X, y)

# Create a sample X_test
X_new = np.array([[30.0, 17.5],
                  [107.0, 24.1],
                  [213.0, 10.9]])

# Predict the labels for the X_new
y_pred = knn.predict(X_new)

k-Nearest Neighbors: Predict

# Print the predictions for X_new
print("Predictions: {}".format(y_pred))

Predictions: [0 1 0]

Train/test split

X = churn_df.drop("churn", axis=1).values
y = churn_df["churn"].values

# Split into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)
knn = KNeighborsClassifier(n_neighbors=5)

Fit the model

# Fit the classifier to the training data
knn.fit(X_train, y_train)

Accuracy

# Print the accuracy
print(knn.score(X_test, y_test))

0.8545727136431784

Overfitting and underfitting

# Create neighbors
neighbors = np.arange(1, 13)
train_accuracies = {}
test_accuracies = {}

for neighbor in neighbors:

	# Set up a KNN Classifier
	knn = KNeighborsClassifier(n_neighbors=neighbor)

	# Fit the model
	knn.fit(X_train, y_train)

	# Compute accuracy
	train_accuracies[neighbor] = knn.score(X_train, y_train)
	test_accuracies[neighbor] = knn.score(X_test, y_test)
print(neighbors, '\n', train_accuracies, '\n', test_accuracies)

[ 1 2 3 4 5 6 7 8 9 10 11 12] {1: 1.0, 2: 0.8885971492873218, 3: 0.8994748687171793, 4: 0.8750937734433608, 5: 0.878469617404351, 6: 0.8660915228807202, 7: 0.8705926481620405, 8: 0.8615903975993998, 9: 0.86384096024006, 10: 0.858589647411853, 11: 0.8604651162790697, 12: 0.8574643660915229} {1: 0.7856071964017991, 2: 0.8470764617691154, 3: 0.8320839580209896, 4: 0.856071964017991, 5: 0.8545727136431784, 6: 0.8590704647676162, 7: 0.8605697151424287, 8: 0.8620689655172413, 9: 0.863568215892054, 10: 0.8605697151424287, 11: 0.8605697151424287, 12: 0.8605697151424287}

Visualizing model complexity

# Add a title
plt.title("KNN: Varying Number of Neighbors")

# Plot training accuracies
plt.plot(neighbors, train_accuracies.values(), label="Training Accuracy")

# Plot test accuracies
plt.plot(neighbors, test_accuracies.values(), label="Testing Accuracy")

plt.legend()
plt.xlabel("Number of Neighbors")
plt.ylabel("Accuracy")

# Display the plot
plt.show()

Conclusion

As could be seen in the chart above, training accuracy decreases and test accuracy increases as the number of neighbors gets larger. For the test set, accuracy peaks with 7 neighbours, suggesting it is the optimal value for our model.

Reference:

https://app.datacamp.com/

Customer Service Churn with KNN