ML - Step 4: Exploratory Data Analysis

Exploratory Data Analysis (EDA) helps us understand the dataset before building Machine Learning models. We use it to discover patterns, detect outliers, and visualize relationships between features.

1️⃣ Load Dataset

We’ll use Pandas to load and inspect the dataset:

import pandas as pd

# Load dataset (CSV file)
df = pd.read_csv("data.csv")

# View first rows
print(df.head())

# Get summary statistics
print(df.describe())

# Check for missing values
print(df.isnull().sum())

2️⃣ Data Visualization with Matplotlib

We can visualize distributions and relationships with Matplotlib:

import matplotlib.pyplot as plt

# Histogram of Age
plt.hist(df["Age"], bins=10, color="skyblue", edgecolor="black")
plt.title("Age Distribution")
plt.xlabel("Age")
plt.ylabel("Count")
plt.show()

3️⃣ Data Visualization with Seaborn

Seaborn provides high-level functions for advanced visualizations:

import seaborn as sns

# Correlation heatmap
sns.heatmap(df.corr(), annot=True, cmap="coolwarm")
plt.show()

# Scatter plot
sns.scatterplot(x="Age", y="Salary", data=df)
plt.show()

✅ Summary

In this step, you learned how to:

  • Load and inspect datasets using Pandas
  • Plot distributions and relationships with Matplotlib
  • Use Seaborn for heatmaps and scatterplots
EDA helps you understand the data deeply and guides decisions for feature selection and model choice in the next steps.