ML - Step 4: Exploratory Data Analysis
Exploratory Data Analysis (EDA) helps us understand the dataset before building Machine Learning models. We use it to discover patterns, detect outliers, and visualize relationships between features.
1️⃣ Load Dataset
We’ll use Pandas to load and inspect the dataset:
import pandas as pd
# Load dataset (CSV file)
df = pd.read_csv("data.csv")
# View first rows
print(df.head())
# Get summary statistics
print(df.describe())
# Check for missing values
print(df.isnull().sum())
2️⃣ Data Visualization with Matplotlib
We can visualize distributions and relationships with Matplotlib:
import matplotlib.pyplot as plt
# Histogram of Age
plt.hist(df["Age"], bins=10, color="skyblue", edgecolor="black")
plt.title("Age Distribution")
plt.xlabel("Age")
plt.ylabel("Count")
plt.show()
3️⃣ Data Visualization with Seaborn
Seaborn provides high-level functions for advanced visualizations:
import seaborn as sns
# Correlation heatmap
sns.heatmap(df.corr(), annot=True, cmap="coolwarm")
plt.show()
# Scatter plot
sns.scatterplot(x="Age", y="Salary", data=df)
plt.show()
✅ Summary
In this step, you learned how to:
- Load and inspect datasets using Pandas
- Plot distributions and relationships with Matplotlib
- Use Seaborn for heatmaps and scatterplots