Statistical Learning Concepts

The theoretical foundation of Machine Learning.

Statistical learning refers to a set of tools for modeling and understanding complex datasets. It is the theoretical framework that underpins how we train, evaluate, and select machine learning models.

1. Bias-Variance Tradeoff

The fundamental conflict in supervised learning. It explains the difficulty of minimizing the two sources of error that prevent models from generalizing:

Bias (Underfitting): Error due to strictly assuming a simple model (e.g., linear) when the real world is complex. High bias models miss relevant relations.
Variance (Overfitting): Error due to sensitivity to small fluctuations in the training set. High variance models model random noise.

E[(y - \hat{f}(x))^2] = \text{Bias}[\hat{f}(x)]^2 + \text{Var}[\hat{f}(x)] + \sigma^2

"Total Error equals Bias squared plus Variance plus Irreducible Error."

2. Cross-Validation

A resampling procedure used to evaluate machine learning models on a limited data sample. K-Fold Cross-Validation splits data into $k$ subsets (folds). The model is trained on $k-1$ folds and tested on the remaining fold. This is repeated $k$ times.

Code Implementation


# Scikit-Learn: K-Fold Cross-Validation
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LinearRegression
import numpy as np

# Synthetic Data
X = np.random.rand(50, 1)
y = 2 * X.squeeze() + 1 + np.random.randn(50) * 0.2

model = LinearRegression()

# 5-Fold Cross-Validation
scores = cross_val_score(model, X, y, cv=5, scoring='neg_mean_squared_error')
mean_mse = -1 * np.mean(scores)

# Average MSE: 0.0358

3. Bootstrapping

A powerful statistical method that involves resampling the dataset with replacement to creating many simulated samples. It allows us to estimate the uncertainty (standard error, confidence intervals) of any statistic (mean, median, coefficients).

Code Implementation


# Scikit-Learn: Bootstrapping
from sklearn.utils import resample

# Original Sample
data = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
means = []

# Generate 1000 Bootstrap Samples
for i in range(1000):
    boot = resample(data, replace=True, n_samples=len(data))
    means.append(np.mean(boot))

lower_ci = np.percentile(means, 2.5)
upper_ci = np.percentile(means, 97.5)

# 95% Confidence Interval for Mean: [3.8, 7.3]

4. Likelihood & Loss Functions

How do we measure "goodness of fit"?

Likelihood ($L$): The probability of seeing the observed data given a specific model parameter. We want to Maximize this (MLE).
Loss Function ($J$): A penalty for bad predictions (e.g., Squared Error, Cross-Entropy). We want to Minimize this.

Mathematically, minimizing the **Mean Squared Error** (MSE) is identical to maximizing the **Likelihood** for a Gaussian distribution.

5. Gradient-Based Optimization

How do models "learn"? They iteratively adjust parameters to minimize the Loss Function. Gradient Descent calculates the gradient (slope) of the loss with respect to parameters and takes a step in the opposite direction.

\theta_{new} = \theta_{old} - \alpha \nabla_{\theta} J(\theta)

"New theta equals old theta minus learning rate times the gradient of Loss w.r.t theta."

Code Implementation


# Gradient Descent Step (Simplified)
# Function to minimize: J(w) = w^2 (Parabola)
# Gradient: dJ/dw = 2w

current_w = 4.0
learning_rate = 0.1

# Take one step
gradient = 2 * current_w
next_w = current_w - (learning_rate * gradient)

# Old w: 4.0 -> New w: 3.2

6. Information Criteria (AIC, BIC)

Metrics for model selection that punish model complexity to prevent overfitting. Lower is better.

AIC (Akaike Information Criterion): $2k - 2\ln(\hat{L})$
BIC (Bayesian Information Criterion): $k \ln(n) - 2\ln(\hat{L})$

Where $k$ is the number of parameters and $n$ is sample size. BIC punishes complexity ($k$) more harshly as $n$ grows.

Code Implementation


# Calculating AIC manually
# Fit Linear Regression -> Get SSE -> Estimate Log-Likelihood
n = 50
k = 2 # Intercept + Slope
sse = np.sum((y - model.predict(X))**2) # Sum of Squared Errors

# Log-Likelihood estimation for Normal errors
log_likelihood = -n/2 * (1 + np.log(2 * np.pi) + np.log(sse/n))

aic = 2*k - 2*log_likelihood
# AIC Score: -24.79