Liver Disease prediction with AdaBoost and KNN Classifier

In this project, our focus revolves around the Indian Liver Patient Dataset obtained from the UCI Machine Learning repository. The primary objective is to employ the k-Nearest Neighbors (KNN) algorithm to predict the likelihood of a patient suffering from a liver disease based on the comprehensive set of features available in the dataset. The dataset encompasses various attributes related to patients' health, and by leveraging the KNN algorithm, we aim to establish a predictive model that can classify patients into categories indicating the presence or absence of liver diseases. By instantiating three classifiers, we enhance the robustness of our predictions, considering the diverse nature of medical data. This project not only delves into the intricacies of machine learning but also addresses a critical health-related concern, demonstrating the practical applications of KNN in healthcare analytics.

Libraries

from sklearn.tree import DecisionTreeClassifier

from sklearn.linear_model import LogisticRegression

from sklearn.neighbors import KNeighborsClassifier as KNN

import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV

from sklearn.metrics import accuracy_score
from sklearn.metrics import roc_auc_score

from sklearn.ensemble import VotingClassifier
from sklearn.ensemble import BaggingClassifier
from sklearn.ensemble import AdaBoostClassifier

Dataset

df_patients = pd.read_csv('indian_liver_patient_preprocessed.csv')
print(df_patients.head())

df_patients = df_patients.drop(['Unnamed: 0'], axis=1)

Train/Test split

X = df_patients.drop(['Liver_disease'], axis=1)
y = df_patients[['Liver_disease']]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=4)

KNN Classifier models

SEED = 1
# Instantiate lr
lr = LogisticRegression(random_state = SEED)

# Instantiate KNN
knn = KNN(n_neighbors=27)

# Instantiate dr
dt = DecisionTreeClassifier(min_samples_leaf=0.13, random_state=SEED)

# List of classifiers
classifiers = [('Logistic Regression', lr), ('K Nearest Neighbors', knn), ('Classification Tree', dt)]

Evaluate Classifiers

# Iterate over classifier
for clf_name, clf in classifiers:

  # Fit to the training data
  clf.fit(X_train, y_train)

  # Predict test data
  y_pred = clf.predict(X_test)

  # Calculate accuracy
  accuracy = accuracy_score(y_test, y_pred)

  # Evaluate clf's accuracy on the test set
  print('{:s} : {:.3f}'.format(clf_name, accuracy))

Logistic Regression : 0.690 K Nearest Neighbors : 0.698 Classification Tree : 0.672

Voting classifiers

# Instantiate a VottingClassifier
vc = VotingClassifier(estimators = classifiers)

# Fit to training data
vc.fit(X_train, y_train)

# predict
y_pred = vc.predict(X_test)

# Accuracy
accuracy = accuracy_score(y_test, y_pred)

print('Voting Classifer: {:.3f}'.format(accuracy))

Voting Classifer: 0.681

Bagging classifier

# Instantiate dt
dt = DecisionTreeClassifier(random_state=1)

# Instantiate bc
bc = BaggingClassifier(base_estimator=dt, n_estimators=50, oob_score=True, random_state=1)

Fit the model

# Fit bc on training data
bc.fit(X_train, y_train)

# Predict bc on test data
bc.predict(X_test)

# Evaluate
acc_test = accuracy_score(y_test, y_pred)

# Evaluate OOB accuracy
acc_oob = bc.oob_score_

# Print acc_test and acc_oob
print('Test set accuracy: {:.3f}, OOB accuracy: {:.3f}'.format(acc_test, acc_oob))

Test set accuracy: 0.681, OOB accuracy: 0.737

Adaboost classifier

# Instantiate ada
ada = AdaBoostClassifier(base_estimator=dt, n_estimators=180, random_state=1)

# Fit to the training data
ada.fit(X_train, y_train)

# Predict on test data
y_pred_proba = ada.predict_proba(X_test)[:,1]

Evaluate Ada

# Evaluate ada classifier
ada_roc_auc_score = roc_auc_score(y_test, y_pred_proba)

print('ROC AUC score: {:.3f}'.format(ada_roc_auc_score))

ROC AUC score: 0.657

GridSearchCV: Decision Tree hyperparameters

# Define params
params_dt = {'max_depth':[2, 3, 4], 'min_samples_leaf':[0.12, 0.14, 0.16, 0.18]}

# Instantiate grid
grid_dt = GridSearchCV(estimator=dt, param_grid=params_dt, scoring='roc_auc', cv=5, n_jobs=-1)

grid_dt.fit(X_train, y_train)

GridSearch result

# Extract the best estimator
best_model = grid_dt.best_estimator_

# Predict the test set probabilities of the positive class
y_pred_proba = best_model.predict_proba(X_test)[:,1]

# Compute test_roc_auc
test_roc_auc = roc_auc_score(y_test, y_pred_proba)

# Print test_roc_auc
print('Test set ROC AUC score: {:.3f}'.format(test_roc_auc))

Test set ROC AUC score: 0.696

Reference:

https://app.datacamp.com/