top of page

Diabetes Prediction with KNN Classifier

  • Writer: saman aboutorab
    saman aboutorab
  • Jan 3, 2024
  • 1 min read

Updated: Jan 8, 2024

In the realm of healthcare and data-driven diagnostics, the focus on predicting the likelihood of diabetes has become paramount. This predictive endeavor hinges on the analysis of two pivotal features: Body Mass Index (BMI) and age in years. The fusion of these variables propels us into a binary classification conundrum where the objective is to discern whether an individual is predisposed to diabetes or not. In this quest, we employ two potent methodologies, namely K-Nearest Neighbors (KNN) and logistic regression, each wielding its unique prowess in unraveling the intricacies of this health-related binary classification puzzle. This convergence of statistical and machine learning techniques marks a crucial stride toward enhancing our ability to preemptively identify individuals at risk, paving the way for more proactive and personalized healthcare interventions.





Import Libraries

# Import Pandas
import pandas as pd
# Import Numpy
import numpy as np
# Import Matplotlib
import matplotlib.pyplot as plt

from sklearn.neighbors import KNeighborsClassifier
# Import confusion matrix
from sklearn.metrics import classification_report, confusion_matrix

from sklearn.model_selection import train_test_split

# Import LogisticRegression
from sklearn.linear_model import LogisticRegression

from sklearn.linear_model import Lasso

from sklearn.model_selection import KFold

# Import roc_curve
from sklearn.metrics import roc_curve
# Import roc_auc_score
from sklearn.metrics import roc_auc_score
# Import GridSearchCV
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RandomizedSearchCV

Dataset

diabetes_df = pd.read_csv('diabetes_clean.csv')
print(diabetes_df.head(5))

Split data to train and test

X = diabetes_df.drop('diabetes', axis=1)
y = diabetes_df[['diabetes']]
# Split into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)
knn = KNeighborsClassifier(n_neighbors=5)

KNN model

knn = KNeighborsClassifier(n_neighbors=6)

# Fit the model to the training data
knn.fit(X_train, y_train)

# Predict the labels of the test data: y_pred
y_pred = knn.predict(X_test)

Confusion Matrix

# Generate the confusion matrix and classification report
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))

Logistic Regression model

# Instantiate the model
logreg = LogisticRegression()

# Fit the model
logreg.fit(X_train, y_train)

# Predict probabilities
y_pred_probs = logreg.predict_proba(X_test)[:, 1]

print(y_pred_probs[:10])

[0.62560803 0.10510601 0.2681658 0.29003726 0.00409551 0.1892188 0.46169468 0.92877036 0.10115643 0.78211143]


ROC Curve

# Generate ROC curve values: fpr, tpr, thresholds
fpr, tpr, thresholds = roc_curve(y_test, y_pred_probs)

plt.plot([0, 1], [0, 1], 'k--')

# Plot tpr against fpr
plt.plot(fpr, tpr)
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve for Diabetes Prediction')
plt.show()

Score

# Calculate roc_auc_score
print(roc_auc_score(y_test, y_pred_probs))

# Calculate the confusion matrix
print(confusion_matrix(y_test, y_pred))

# Calculate the classification report
print(classification_report(y_test, y_pred))

Conclusion


The logistic regression performs better than the KNN model across all the metrics you calculated? A ROC AUC score of 0.8002 means this model is 60% better than a chance model at correctly predicting labels.

GridSearchCV

# Set up the parameter grid
param_grid = {"alpha": np.linspace(0.00001, 1, 20)}

#Lasso model
lasso = Lasso()

#Kfold
kf = KFold(n_splits=5, shuffle=True, random_state=42)

# Instantiate lasso_cv
lasso_cv = GridSearchCV(lasso, param_grid, cv=kf)

# Fit to the training data
lasso_cv.fit(X_train, y_train)
print("Tuned lasso paramaters: {}".format(lasso_cv.best_params_))
print("Tuned lasso score: {}".format(lasso_cv.best_score_))

Tuned lasso paramaters: {'alpha': 0.05264105263157895} Tuned lasso score: 0.2651011761660329

Conclusion


Unfortunately, the best model only has an R-squared score of 0.33, highlighting that using the optimal hyperparameters does not guarantee a high performing model!


Comments


bottom of page