Experimental Design & Evaluation

How to properly judge model performance.

Building a model is only half the battle. To trust its predictions in the real world, we need to evaluate it rigorously. This involves splitting data correctly to avoid leakage and choosing metrics that align with the business goal.

1. Data Splitting Strategies

We never train and test on the same data.

Training Set: Used to fit the model parameters.
Validation Set: Used to tune hyperparameters during development.
Test Set: Used ONLY once at the end to estimate final performance.

Random vs. Stratified Sampling

Stratified Sampling: Ensures that the proportion of classes (e.g., Spam vs Not Spam) remains the same in train and test sets. Crucial for imbalanced datasets.

Why is this used in ML?

Proper splitting prevents Data Leakage and ensures the model evaluates on unseen data. Stratification is essential for Imbalanced Classification problems (e.g., Fraud Detection) to ensure the test set represents the minority class.

Code Implementation


# Scikit-Learn: Stratified Split
from sklearn.model_selection import train_test_split
import numpy as np

# Imbalanced Data: 90% Class 0, 10% Class 1
y = np.array([0]*90 + [1]*10)
X = np.random.rand(100, 2)

# Stratify=y ensures the 9:1 ratio is preserved
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, stratify=y, random_state=42
)

# Test Set Class 1 Count: 2 (Should be 2, i.e., 10% of 20)

2. Classification Metrics

Confusion Matrix

	Predicted Positive	Predicted Negative
Actual Positive	True Positive (TP)	False Negative (FN)
Actual Negative	False Positive (FP)	True Negative (TN)

Metrics

Accuracy: $(TP+TN) / Total$. Good only for balanced classes.
Precision: $TP / (TP + FP)$. "Of all predicted positives, how many were real?" (Crucial for Spam detection).
Recall (Sensitivity): $TP / (TP + FN)$. "Of all real positives, how many did we find?" (Crucial for Cancer detection).
F1 Score: $2 \cdot \frac{Precision \cdot Recall}{Precision + Recall}$. Harmonic mean, good balance.

TPR = \frac{TP}{TP+FN} \quad FPR = \frac{FP}{FP+TN}

"True Positive Rate equals True Positives divided by Real Positives. False Positive Rate equals False Positives divided by Real Negatives."

Why is this used in ML?

Accuracy is often misleading. We choose metrics based on the cost of errors:

High Precision needed: When False Positives are costly (e.g., Spam Filters blocking real emails).
High Recall needed: When False Negatives are dangerous (e.g., Medical Diagnosis missing a disease).

ROC & AUC

ROC Curve: Plots TPR vs FPR at different thresholds.
AUC (Area Under Curve): Aggregate measure of performance. 1.0 is perfect, 0.5 is random guessing.

Code Implementation


# Scikit-Learn: Classification Report
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score, confusion_matrix

# True vs Predicted
y_true = [0, 0, 1, 1, 1, 0, 1, 0, 1, 1]
y_pred = [0, 1, 1, 1, 0, 0, 1, 0, 1, 1]

acc = accuracy_score(y_true, y_pred)
prec = precision_score(y_true, y_pred)
rec = recall_score(y_true, y_pred)
f1 = f1_score(y_true, y_pred)

auc = roc_auc_score(y_true, y_pred) 
cm = confusion_matrix(y_true, y_pred)

# Accuracy: 0.8
# F1 Score: 0.83
# AUC Score: 0.79
# Confusion Matrix: [[3, 1], [1, 5]]

3. Regression Metrics

Used when predicting continuous values.

MAE (Mean Absolute Error): $\frac{1}{n}\sum|y - \hat{y}|$. Robust to outliers.
RMSE (Root Mean Squared Error): $\sqrt{\frac{1}{n}\sum(y - \hat{y})^2}$. Penalizes large errors heavily.
MAPE (Mean Absolute Percentage Error): $\frac{100\%}{n}\sum|\frac{y - \hat{y}}{y}|$. Easy to interpret but undefined if y=0.

Why is this used in ML?

We choose RMSE when we want to punish large errors severely (e.g., predicting house prices where being off by $100k is much worse than $10k). We use MAE when we want a robust metric that treats all errors linearly.

Code Implementation


# Scikit-Learn: Regression Metrics
from sklearn.metrics import mean_absolute_error, mean_squared_error, mean_absolute_percentage_error
import numpy as np

y_true_reg = [100, 150, 200, 250]
y_pred_reg = [110, 140, 205, 260] # Some errors

mae = mean_absolute_error(y_true_reg, y_pred_reg)
rmse = np.sqrt(mean_squared_error(y_true_reg, y_pred_reg))
mape = mean_absolute_percentage_error(y_true_reg, y_pred_reg) * 100

# MAE: 8.75
# RMSE: 9.01
# MAPE: 5.79%

# Manual RMSE Implementation
manual_rmse = np.sqrt(np.mean((np.array(y_true_reg) - np.array(y_pred_reg))**2))
# Manual RMSE: 9.01

Experimental Design & Evaluation

1. Data Splitting Strategies

Random vs. Stratified Sampling

Why is this used in ML?

Code Implementation

2. Classification Metrics

Confusion Matrix

Metrics

Why is this used in ML?

ROC & AUC

Code Implementation

3. Regression Metrics

Why is this used in ML?

Code Implementation

4. References & Further Reading