Statistical Inference

Drawing conclusions about populations from samples.

1. Sampling Methods

The process of selecting a subset of individuals from a statistical population to estimate characteristics of the whole population.

Common Methods

  • Simple Random Sampling: Every member has an equal chance of being selected.
  • Stratified Sampling: Population is divided into subgroups (strata) and samples are taken from each.

Why is this used in ML?

Sampling is fundamental in Train/Test Splitting. We use random sampling (or stratified sampling if classes are imbalanced) to ensure our training data represents the real world, preventing bias in our models.

Code Implementation


# Numpy: Simple Random Sampling
import numpy as np

population = np.arange(1000)
sample = np.random.choice(population, size=50, replace=False)
mean_val = np.mean(sample)
# Result: 467.46
    

2. Law of Large Numbers (LLN)

As the size of a sample increases, the sample mean gets closer to the expected value (population mean).

$$ \lim_{n \to \infty} \bar{X}_n = \mu $$

"The limit of the sample mean X-bar as n approaches infinity equals the population mean mu."

Why is this used in ML?

This justifies why we need large datasets. The more data we have, the closer our model's learnt parameters (weights) are to the "true" optimal parameters that generalize well.

Code Implementation


# Simulation: Coin Flips (True Mean = 0.5)
import numpy as np

# Small N (n=10)
small_mean = np.mean(np.random.binomial(n=1, p=0.5, size=10))
# Result: 0.7 (High Variance)

# Large N (n=10000)
large_mean = np.mean(np.random.binomial(n=1, p=0.5, size=10000))
# Result: 0.504 (Converges to 0.5)
    

3. Central Limit Theorem (CLT)

The sampling distribution of the sample mean approaches a Normal Distribution as the sample size gets larger, regardless of the population's distribution.

Why is this used in ML?

CLT allows us to make statistical inferences (compute confidence intervals, p-values) about model performance metrics (like accuracy), assuming they are normally distributed, even if the underlying data isn't.

Code Implementation


# Sampling from Uniform Distribution [0, 100] (Not Normal)
import numpy as np

# Taking 1000 samples of size 30 and averaging them
sample_means = [np.mean(np.random.uniform(0, 100, 30)) for _ in range(1000)]

clt_mean = np.mean(sample_means)
clt_std = np.std(sample_means) 
# Result Mean (approx 50): 49.98
# Result Std (approx 5.2): 5.33
    

4. Estimators: Bias, Variance, MLE

An Estimator is a rule for calculating an estimate of a given quantity (parameter) based on observed data.

  • Bias: Difference between the estimator's expected value and the true value ($E[\hat{\theta}] - \theta$).
  • Variance: How much the estimate fluctuates for different samples.
  • MLE (Maximum Likelihood Estimation): A method to find parameters that maximize the probability of obtaining the observed data.

Why is this used in ML?

Bias-Variance Tradeoff is central to supervised learning. MLE is the foundation for training many models, including **Logistic Regression** and **Neural Networks** (where minimizing Cross-Entropy Loss is equivalent to maximizing likelihood).

Code Implementation


# Scipy: Maximum Likelihood Estimation (MLE)
from scipy import stats

# Fitting a Normal Distribution to data
mu_mle, std_mle = stats.norm.fit(sample_data)

# Estimated Mu: 4.76
# Estimated Std: 1.87
    

5. Confidence Intervals

A range of values so defined that there is a specified probability that the value of a parameter lies within it.

$$ \bar{x} \pm z \frac{\sigma}{\sqrt{n}} $$

"x-bar plus or minus z times (sigma divided by the square root of n)."

Why is this used in ML?

Reporting a single accuracy number (e.g., "90%") is often misleading. Providing a confidence interval (e.g., "90% ± 2%") gives insight into the reliability and stability of the model.

Code Implementation


# Scipy: 95% Confidence Interval for the Mean
from scipy import stats

ci_low, ci_high = stats.norm.interval(0.95, loc=mean, scale=sem)

# 95% CI: [4.39, 5.13]
    

6. Hypothesis Testing

A procedure to decide whether to reject a null hypothesis ($H_0$) in favor of an alternative hypothesis ($H_1$).

Common Tests

  • Z-test: For large samples or known variance.
  • t-test: For small samples with unknown variance.
  • Chi-square test: For categorical data usage.

Why is this used in ML?

Used for A/B Testing to determine if a new model version is statistically significantly better than the old one. Also used in **Feature Selection** (Chi-square test) to select relevant features.

Code Implementation


# A. One-Sample t-test (Checking if sample mean differs from 5.0)
from scipy import stats

t_stat, p_val = stats.ttest_1samp(sample_data, 5.0)
# t-stat: -1.28 | p-value: 0.2025

# B. Chi-square Goodness of Fit (Dice Roll Fairness)
# Obs: [12, 8, 11, 9, 13, 7] vs Exp: [10, 10, 10, 10, 10, 10]
chi2_stat, p_val = stats.chisquare(f_obs=observed, f_exp=expected)
# chi2-stat: 2.8 | p-value: 0.7308
    

7. p-values & Significance

The p-value is the probability of obtaining test results at least as extreme as the results observed, assuming the null hypothesis is true.

  • $p < 0.05$: Reject $H_0$ (Statistically Significant).
  • $p \ge 0.05$: Fail to reject $H_0$ (Not Significant).

Why is this used in ML?

p-values help us avoid p-hacking and overfitting. In regression analysis (e.g., OLS results in `statsmodels`), we look at the p-values of coefficients to determine if a feature has a significant impact on the target variable.

8. References & Further Reading