ML - Step 3 : Data Preprocessing (Cleaning, Handling Missing Data, Encoding)
Raw datasets often contain missing values, duplicates, or categorical data that must be converted into numerical form for machine learning models. In this step, we’ll learn how to clean and preprocess data properly.
1️⃣ Handling Missing Data
We can fill missing values with the mean, median, or mode using SimpleImputer
:
import pandas as pd
from sklearn.impute import SimpleImputer
# Example dataset
data = {
"Age": [25, 30, None, 40],
"Salary": [50000, None, 60000, 80000]
}
df = pd.DataFrame(data)
# Fill missing values with mean
imputer = SimpleImputer(strategy="mean")
df[["Age", "Salary"]] = imputer.fit_transform(df[["Age", "Salary"]])
print(df)
2️⃣ Encoding Categorical Data
Convert text categories into numbers with Label Encoding or One-Hot Encoding:
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
from sklearn.compose import ColumnTransformer
# Example dataset
df = pd.DataFrame({
"Department": ["HR", "Finance", "IT", "Finance"]
})
# Label Encoding
le = LabelEncoder()
df["Dept_Label"] = le.fit_transform(df["Department"])
# One-Hot Encoding
ct = ColumnTransformer(
transformers=[("encoder", OneHotEncoder(), ["Department"])],
remainder="passthrough"
)
df_encoded = ct.fit_transform(df)
print(df)
print(df_encoded)
3️⃣ Feature Scaling (Optional)
Scale features to ensure fair weight for algorithms like KNN, SVM, or Neural Networks:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaled = scaler.fit_transform(df[["Age", "Salary"]])
print(scaled)
✅ Summary
In this step, you learned how to:
- Clean datasets and handle missing values
- Encode categorical data
- Apply feature scaling when required