Diabetes Prediction with KNN Classifier

In the realm of healthcare and data-driven diagnostics, the focus on predicting the likelihood of diabetes has become paramount. This predictive endeavor hinges on the analysis of two pivotal features: Body Mass Index (BMI) and age in years. The fusion of these variables propels us into a binary classification conundrum where the objective is to discern whether an individual is predisposed to diabetes or not. In this quest, we employ two potent methodologies, namely K-Nearest Neighbors (KNN) and logistic regression, each wielding its unique prowess in unraveling the intricacies of this health-related binary classification puzzle. This convergence of statistical and machine learning techniques marks a crucial stride toward enhancing our ability to preemptively identify individuals at risk, paving the way for more proactive and personalized healthcare interventions.

Import Libraries

# Import Pandas
import pandas as pd
# Import Numpy
import numpy as np
# Import Matplotlib
import matplotlib.pyplot as plt

from sklearn.neighbors import KNeighborsClassifier
# Import confusion matrix
from sklearn.metrics import classification_report, confusion_matrix

from sklearn.model_selection import train_test_split

# Import LogisticRegression
from sklearn.linear_model import LogisticRegression

from sklearn.linear_model import Lasso

from sklearn.model_selection import KFold

# Import roc_curve
from sklearn.metrics import roc_curve
# Import roc_auc_score
from sklearn.metrics import roc_auc_score
# Import GridSearchCV
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RandomizedSearchCV

Dataset

diabetes_df = pd.read_csv('diabetes_clean.csv')
print(diabetes_df.head(5))

Split data to train and test

X = diabetes_df.drop('diabetes', axis=1)
y = diabetes_df[['diabetes']]

# Split into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)
knn = KNeighborsClassifier(n_neighbors=5)

KNN model

knn = KNeighborsClassifier(n_neighbors=6)

# Fit the model to the training data
knn.fit(X_train, y_train)

# Predict the labels of the test data: y_pred
y_pred = knn.predict(X_test)

Confusion Matrix

# Generate the confusion matrix and classification report
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))

Logistic Regression model

# Instantiate the model
logreg = LogisticRegression()

# Fit the model
logreg.fit(X_train, y_train)

# Predict probabilities
y_pred_probs = logreg.predict_proba(X_test)[:, 1]

print(y_pred_probs[:10])

[0.62560803 0.10510601 0.2681658 0.29003726 0.00409551 0.1892188 0.46169468 0.92877036 0.10115643 0.78211143]

ROC Curve

# Generate ROC curve values: fpr, tpr, thresholds
fpr, tpr, thresholds = roc_curve(y_test, y_pred_probs)

plt.plot([0, 1], [0, 1], 'k--')

# Plot tpr against fpr
plt.plot(fpr, tpr)
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve for Diabetes Prediction')
plt.show()

Score

# Calculate roc_auc_score
print(roc_auc_score(y_test, y_pred_probs))

# Calculate the confusion matrix
print(confusion_matrix(y_test, y_pred))

# Calculate the classification report
print(classification_report(y_test, y_pred))

Conclusion

The logistic regression performs better than the KNN model across all the metrics you calculated? A ROC AUC score of 0.8002 means this model is 60% better than a chance model at correctly predicting labels.

GridSearchCV

# Set up the parameter grid
param_grid = {"alpha": np.linspace(0.00001, 1, 20)}

#Lasso model
lasso = Lasso()

#Kfold
kf = KFold(n_splits=5, shuffle=True, random_state=42)

# Instantiate lasso_cv
lasso_cv = GridSearchCV(lasso, param_grid, cv=kf)

# Fit to the training data
lasso_cv.fit(X_train, y_train)
print("Tuned lasso paramaters: {}".format(lasso_cv.best_params_))
print("Tuned lasso score: {}".format(lasso_cv.best_score_))

Tuned lasso paramaters: {'alpha': 0.05264105263157895} Tuned lasso score: 0.2651011761660329

Conclusion

Unfortunately, the best model only has an R-squared score of 0.33, highlighting that using the optimal hyperparameters does not guarantee a high performing model!

Diabetes Prediction with KNN Classifier

Import Libraries

Dataset

Split data to train and test

KNN model

Confusion Matrix

Logistic Regression model

ROC Curve

Score

Conclusion

GridSearchCV

Conclusion

Recent Posts

Comments