Breast Cancer prediction with Decision Tree Classification

In this Decision Tree classification project, our focus is on the Wisconsin Breast Cancer Dataset sourced from the UCI Machine Learning repository. The primary objective is to develop a classification model that predicts whether a tumor is malignant or benign. To achieve this, we will utilize two crucial features from the dataset: the mean radius of the tumor (radius_mean) and the mean number of concave points (concave points_mean). These features are selected for their significance in capturing key characteristics of tumor shapes and contours, which are indicative of potential malignancy. By leveraging the Decision Tree classification algorithm, we aim to create a model that can discern patterns in the given features to make accurate predictions about tumor types. This project not only underscores the application of machine learning in medical diagnostics but also highlights the importance of feature selection in optimizing model performance for specific tasks, such as identifying malignancies in breast cancer cases.

Libraries

from sklearn.tree import DecisionTreeClassifier

from sklearn.model_selection import train_test_split

import pandas as pd
import numpy as np

from sklearn.metrics import accuracy_score

Dataset

df_cancer = pd.read_csv('wisconsin_breast_cancer.csv')

print(df_cancer.head())

mapping = {'M':1, 'B':0}
df_cancer['diagnosis'] = df_cancer['diagnosis'].map(mapping)

df_cancer.isna().sum()

df_cancer = df_cancer.drop(['Unnamed: 32'], axis=1)

# X and y data
X = df_cancer.drop(['diagnosis'], axis=1)
y = df_cancer[['diagnosis']]

Train/Test split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=4)

Train the tree model

# Instantiate decisiontreeclassifier
dt = DecisionTreeClassifier(max_depth=6, random_state=123)

# Fit to the training set
dt.fit(X_train, y_train)

# Predict test set labels
y_pred = dt.predict(X_test)

Evaluate classification

# Compute test set accuracy
acc = accuracy_score(y_test, y_pred)
print("Test set accuracy: {:.2f}".format(acc))

Test set accuracy: 0.90

Entropy criterion

# Instantiate model with entropy criterion
dt_entropy = DecisionTreeClassifier(max_depth=8, criterion='entropy', random_state=1)

# Fit the model to the training set
dt_entropy.fit(X_train, y_train)

Predict and score

# predict X_test
y_pred = dt_entropy.predict(X_test)

# Evaluate accuracy
accuracy_entropy = accuracy_score(y_test, y_pred)

print(f'Accuracy achieved by using entropy: {accuracy_entropy:.3f}')

Accuracy achieved by using entropy: 0.860

Reference:

https://www.datacamp.com/