Grains Clustering with Kmeans

Title: Unveiling Patterns in Agricultural Data: A Journey into Grain Clustering Using K-Means

Introduction:

In the ever-evolving landscape of agricultural science, the integration of machine learning techniques has become instrumental in unlocking hidden patterns and insights from vast datasets. One such application lies in the realm of grain analysis, where the utilization of K-Means clustering offers a powerful tool to discern underlying structures based on diverse measurements. This innovative approach not only aids in categorizing grains with precision but also provides valuable information for farmers, researchers, and agricultural stakeholders.

The project at hand delves into the fascinating world of grain clustering, employing the K-Means algorithm to analyze grains' various measurements. By harnessing the potential of unsupervised machine learning, this initiative aims to streamline the classification process, enabling a more efficient and accurate assessment of grain characteristics. As grains are pivotal components of global food systems, understanding their inherent diversity can contribute to optimizing cultivation practices, enhancing crop yields, and ultimately ensuring food security.

This exploration involves collecting comprehensive data sets comprising measurements such as size, weight, and texture, among others, from different types of grains. The K-Means algorithm, a popular clustering technique, is then employed to group grains based on similarities in these measurements. Through this process, distinct clusters emerge, each representing a unique set of characteristics shared by the grains within it.

The project's significance lies in its potential to revolutionize the way we analyze and categorize grains, moving beyond traditional methods and embracing the efficiency and objectivity offered by machine learning. By uncovering patterns that may not be apparent through manual inspection, this endeavor aims to contribute to the advancement of precision agriculture, empowering farmers with data-driven insights for better decision-making.

In the subsequent sections, we will delve into the methodology, data preparation, and results, offering a comprehensive overview of the process of applying K-Means clustering to grain analysis. As we embark on this journey, the ultimate goal is to shed light on the untapped potential within agricultural data, fostering a deeper understanding of grains and paving the way for informed, data-driven practices in modern agriculture.

Import Libraries

# Perform the necessary imports
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans

# Import pandas
import pandas as pd

import matplotlib.pyplot as plt

from scipy.cluster.hierarchy import linkage, dendrogram
from scipy.cluster.hierarchy import fcluster

# Import TSNE
from sklearn.manifold import TSNE

Dataset

seeds_df = pd.read_csv('seeds.csv')

column_names = ['measure1', 'measure2','measure3','measure4','measure5','measure6','measure7','type']

seeds_df.columns = column_names

X_seeds_df = seeds_df.drop('type', axis=1)
samples = X_seeds_df.to_numpy()

grain_type = seeds_df['type']

seeds_df2 = pd.read_csv('seeds-width-vs-length.csv', header=None)

samples2 = seeds_df2.to_numpy()

Find best cluster number

ks = range(1, 6)
inertias = []

for k in ks:
    # Create a KMeans instance with k clusters: model
    model = KMeans(n_clusters=k)

    # Fit model to samples
    model.fit(samples)

    # Append the inertia to the list of inertias
    inertias.append(model.inertia_)

# Plot ks vs inertias
plt.plot(ks, inertias, '-o')
plt.xlabel('number of clusters, k')
plt.ylabel('inertia')
plt.xticks(ks)
plt.show()

Evaluating the grain clustering

# Create a KMeans model with 3 clusters: model
model = KMeans(n_clusters=3)

# Use fit_predict to fit model and obtain cluster labels: labels
labels = model.fit_predict(samples)

# Create a DataFrame with labels and varieties as columns: df
df = pd.DataFrame({'labels': labels, 'type': grain_type})

# Create crosstab: ct
ct = pd.crosstab(df['labels'], df['type'])

# Display ct
print(ct)

Hierarchical clustering of the grain data

varieties = ['Kama wheat', 'Kama wheat', 'Kama wheat', 'Kama wheat', 'Kama wheat', 'Kama wheat', 'Kama wheat', 'Kama wheat', 'Kama wheat', 'Kama wheat', 'Kama wheat', 'Kama wheat', 'Kama wheat', 'Kama wheat', 'Rosa wheat', 'Rosa wheat', 'Rosa wheat', 'Rosa wheat', 'Rosa wheat', 'Rosa wheat', 'Rosa wheat', 'Rosa wheat', 'Rosa wheat', 'Rosa wheat', 'Rosa wheat', 'Rosa wheat', 'Rosa wheat', 'Rosa wheat', 'Canadian wheat', 'Canadian wheat', 'Canadian wheat', 'Canadian wheat', 'Canadian wheat', 'Canadian wheat', 'Canadian wheat', 'Canadian wheat', 'Canadian wheat', 'Canadian wheat', 'Canadian wheat', 'Canadian wheat', 'Canadian wheat', 'Canadian wheat']

samples_variety = samples[:42]

# Calculate the linkage: mergings
mergings = linkage(samples_variety, method='complete')

# Plot the dendrogram, using varieties as labels
dendrogram(mergings,
           labels=varieties,
           leaf_rotation=90,
           leaf_font_size=6,
)
plt.show()

# Use fcluster to extract labels: labels
labels = fcluster(mergings, 6, criterion='distance')

# Create a DataFrame with labels and varieties as columns: df
df = pd.DataFrame({'labels': labels, 'varieties': varieties})

# Create crosstab: ct
ct = pd.crosstab(df['labels'], df['varieties'])

# Display ct
print(ct)