ML - Step 3 : Data Preprocessing (Cleaning, Handling Missing Data, Encoding)

Raw datasets often contain missing values, duplicates, or categorical data that must be converted into numerical form for machine learning models. In this step, we’ll learn how to clean and preprocess data properly.

1️⃣ Handling Missing Data

We can fill missing values with the mean, median, or mode using SimpleImputer:

import pandas as pd
from sklearn.impute import SimpleImputer

# Example dataset
data = {
    "Age": [25, 30, None, 40],
    "Salary": [50000, None, 60000, 80000]
}
df = pd.DataFrame(data)

# Fill missing values with mean
imputer = SimpleImputer(strategy="mean")
df[["Age", "Salary"]] = imputer.fit_transform(df[["Age", "Salary"]])

print(df)

2️⃣ Encoding Categorical Data

Convert text categories into numbers with Label Encoding or One-Hot Encoding:

from sklearn.preprocessing import LabelEncoder, OneHotEncoder
from sklearn.compose import ColumnTransformer

# Example dataset
df = pd.DataFrame({
    "Department": ["HR", "Finance", "IT", "Finance"]
})

# Label Encoding
le = LabelEncoder()
df["Dept_Label"] = le.fit_transform(df["Department"])

# One-Hot Encoding
ct = ColumnTransformer(
    transformers=[("encoder", OneHotEncoder(), ["Department"])],
    remainder="passthrough"
)
df_encoded = ct.fit_transform(df)

print(df)
print(df_encoded)

3️⃣ Feature Scaling (Optional)

Scale features to ensure fair weight for algorithms like KNN, SVM, or Neural Networks:

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
scaled = scaler.fit_transform(df[["Age", "Salary"]])
print(scaled)

✅ Summary

In this step, you learned how to:

  • Clean datasets and handle missing values
  • Encode categorical data
  • Apply feature scaling when required
These preprocessing techniques are critical for building reliable ML models.