Diabetes Prediction with KNN Classifier
- saman aboutorab
- Jan 3, 2024
- 1 min read
Updated: Jan 8, 2024
In the realm of healthcare and data-driven diagnostics, the focus on predicting the likelihood of diabetes has become paramount. This predictive endeavor hinges on the analysis of two pivotal features: Body Mass Index (BMI) and age in years. The fusion of these variables propels us into a binary classification conundrum where the objective is to discern whether an individual is predisposed to diabetes or not. In this quest, we employ two potent methodologies, namely K-Nearest Neighbors (KNN) and logistic regression, each wielding its unique prowess in unraveling the intricacies of this health-related binary classification puzzle. This convergence of statistical and machine learning techniques marks a crucial stride toward enhancing our ability to preemptively identify individuals at risk, paving the way for more proactive and personalized healthcare interventions.

Import Libraries# Import Pandas
import pandas as pd
# Import Numpy
import numpy as np
# Import Matplotlib
import matplotlib.pyplot as plt
from sklearn.neighbors import KNeighborsClassifier
# Import confusion matrix
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.model_selection import train_test_split
# Import LogisticRegression
from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import Lasso
from sklearn.model_selection import KFold
# Import roc_curve
from sklearn.metrics import roc_curve
# Import roc_auc_score
from sklearn.metrics import roc_auc_score
# Import GridSearchCV
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RandomizedSearchCV Datasetdiabetes_df = pd.read_csv('diabetes_clean.csv')
print(diabetes_df.head(5)) Split data to train and testX = diabetes_df.drop('diabetes', axis=1)
y = diabetes_df[['diabetes']] # Split into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)
knn = KNeighborsClassifier(n_neighbors=5) KNN modelknn = KNeighborsClassifier(n_neighbors=6)
# Fit the model to the training data
knn.fit(X_train, y_train)
# Predict the labels of the test data: y_pred
y_pred = knn.predict(X_test) Confusion Matrix# Generate the confusion matrix and classification report
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred)) Logistic Regression model# Instantiate the model
logreg = LogisticRegression()
# Fit the model
logreg.fit(X_train, y_train)
# Predict probabilities
y_pred_probs = logreg.predict_proba(X_test)[:, 1]
print(y_pred_probs[:10]) [0.62560803 0.10510601 0.2681658 0.29003726 0.00409551 0.1892188 0.46169468 0.92877036 0.10115643 0.78211143] ROC Curve# Generate ROC curve values: fpr, tpr, thresholds
fpr, tpr, thresholds = roc_curve(y_test, y_pred_probs)
plt.plot([0, 1], [0, 1], 'k--')
# Plot tpr against fpr
plt.plot(fpr, tpr)
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve for Diabetes Prediction')
plt.show() Score# Calculate roc_auc_score
print(roc_auc_score(y_test, y_pred_probs))
# Calculate the confusion matrix
print(confusion_matrix(y_test, y_pred))
# Calculate the classification report
print(classification_report(y_test, y_pred)) ConclusionThe logistic regression performs better than the KNN model across all the metrics you calculated? A ROC AUC score of 0.8002 means this model is 60% better than a chance model at correctly predicting labels.
GridSearchCV# Set up the parameter grid
param_grid = {"alpha": np.linspace(0.00001, 1, 20)}
#Lasso model
lasso = Lasso()
#Kfold
kf = KFold(n_splits=5, shuffle=True, random_state=42)
# Instantiate lasso_cv
lasso_cv = GridSearchCV(lasso, param_grid, cv=kf)
# Fit to the training data
lasso_cv.fit(X_train, y_train)
print("Tuned lasso paramaters: {}".format(lasso_cv.best_params_))
print("Tuned lasso score: {}".format(lasso_cv.best_score_)) Tuned lasso paramaters: {'alpha': 0.05264105263157895} Tuned lasso score: 0.2651011761660329 |
Conclusion
Unfortunately, the best model only has an R-squared score of 0.33, highlighting that using the optimal hyperparameters does not guarantee a high performing model!
Comments