Car Fuel Consumption prediction with Decision tree

In this exercise, our objective is to train a regression tree model to predict the miles per gallon (mpg) consumption of cars within the auto-mpg dataset. The dataset encompasses six features that provide essential information about various car attributes. By utilizing these features, we aim to develop an effective regression tree that can accurately estimate the fuel efficiency of different car models. The regression tree algorithm, a decision tree variant suited for regression tasks, will be employed to learn patterns and relationships within the data, enabling it to make predictions based on the input features. This exercise not only serves as a practical application of machine learning in automotive analytics but also highlights the versatility of regression trees in capturing complex relationships for predictive modeling. Through this process, we strive to enhance our understanding of the factors influencing fuel efficiency in automobiles and refine our ability to make accurate predictions in real-world scenarios.

Libraries

import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor

from sklearn.metrics import mean_squared_error as MSE

from sklearn.model_selection import cross_val_score

Dataset

df_car = pd.read_csv('auto.csv')

print(df_car.head())

# Create dummy variables for Origin column
origin_dummy = pd.get_dummies(df_car['origin'])
df_car = pd.concat([df_car, origin_dummy], axis=1)
df_car = df_car.drop(['origin'], axis=1)

print(df_car.head())

Train/Test split

X = df_car.drop(['mpg'], axis = 1)
y = df_car[['mpg']]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=4)

Train and fit model

# Instantiate model
dt = DecisionTreeRegressor(max_depth=4, min_samples_leaf=0.26, random_state=3)

# Fit to training data
dt.fit(X_train, y_train)

Evaluation

# Compute y_pred
y_pred = dt.predict(X_test)

# Compute mse
mse_dt = MSE(y_test, y_pred)

# Compute rmse
rmse_dt = mse_dt ** (1/2)

# Print rmse_dt
print("Test set RMSE of dt: {:.2f}".format(rmse_dt))

Test set RMSE of dt: 4.56

Evaluate 10 fold CV error

# Compute the array containing the 10-folds CV MSEs
MSE_CV_scores = - cross_val_score(dt, X_train, y_train, cv=10, scoring = 'neg_mean_squared_error', n_jobs=-1)

# Compute the 1--folds CV RMSE
RMSE_CV = (MSE_CV_scores.mean()) ** (1/2)

# Print RMSE_CV
print('CV RMSE: {:.2f}'.format(RMSE_CV))

CV RMSE: 4.90

Reference:

https://app.datacamp.com/