Bayesian Statistics

Probabilistic reasoning and updating beliefs with data.

Bayesian statistics offers a framework for reasoning about uncertainty. Unlike frequentist statistics which treats parameters as fixed constants, Bayesian methods treat parameters as random variables with their own distributions. This approach is widely used in Machine Learning, especially in probabilistic models and when incorporating prior knowledge is essential.

1. Bayes' Theorem

The fundamental rule for updating the probability of a hypothesis ($H$) given some observed evidence ($E$).

$$ P(H|E) = \frac{P(E|H) \cdot P(H)}{P(E)} $$

"The probability of Hypothesis given Evidence equals the probability of Evidence given Hypothesis times the probability of Hypothesis, divided by the probability of Evidence."

Why is this used in ML?

It allows us to update our model's beliefs (parameters) as we see more training data. It is the core of **Naive Bayes** classifiers and **Bayesian Neural Networks**.

Code Implementation


# Simple Bayes Update Example
def bayes_update(prior, likelihood, evidence):
    return (likelihood * prior) / evidence

# Medical Test Example
# P(Disease) = 0.01 (Prior)
# P(Pos|Disease) = 0.99 (Sensitivity/Likelihood)
# P(Pos) = P(Pos|Disease)P(Disease) + P(Pos|NoDisease)P(NoDisease)
# Let's say P(Pos) (Evidence) calculates to ~0.059

prior = 0.01
likelihood = 0.99
evidence = 0.059 # (0.99*0.01) + (0.05*0.99) approx

posterior = bayes_update(prior, likelihood, evidence)
# Posterior Probability: 0.1667
    

2. Priors, Likelihoods, Posteriors

  • Prior $P(H)$: Our initial belief about the hypothesis before seeing data.
  • Likelihood $P(E|H)$: How probable is the evidence assuming the hypothesis is true?
  • Posterior $P(H|E)$: Our updated belief after considering the evidence.
  • Evidence $P(E)$: The total probability of the evidence (normalizing constant).

3. MAP vs MLE

Two common ways to estimate parameters ($\theta$) of a model:

  • Maximum Likelihood Estimation (MLE): Maximize $P(Data|\theta)$.
    "What parameters make the observed data most probable?"
    Used in: Standard Linear Regression, Neural Networks (minimizing MSE/Cross-Entropy).
  • Maximum A Posteriori (MAP): Maximize $P(\theta|Data) \propto P(Data|\theta) \cdot P(\theta)$.
    "What are the most probable parameters given the data AND our prior beliefs?"
    Used in: Regularized Regression (L2 Regularization is equivalent to MAP with a Gaussian Prior).

4. Naive Bayes Classifier

A simple but powerful family of classifiers based on applying Bayes' theorem with strong (naive) "independence assumptions" between the features.

$$ P(y|x_1, \dots, x_n) \propto P(y) \prod_{i=1}^{n} P(x_i|y) $$

Why is this used in ML?

Despite assuming features are independent (which is rarely true), it works surprisingly well for **Text Classification** (Spam Detection, Sentiment Analysis) and is very fast to train.

Code Implementation


# Scikit-Learn: Gaussian Naive Bayes
from sklearn.naive_bayes import GaussianNB
import numpy as np

# Training Data (Height, Weight) -> Class (0: Child, 1: Adult)
X = np.array([
    [100, 20], [110, 22], [120, 25], # Children
    [160, 60], [170, 70], [180, 80]  # Adults
])
y = np.array([0, 0, 0, 1, 1, 1])

clf = GaussianNB()
clf.fit(X, y)

# Predict for a new person (Height 165, Weight 65)
prediction = clf.predict([[165, 65]])[0]
predicted_label = "Adult" if prediction == 1 else "Child"
# Prediction: Adult
    

5. Bayesian Networks

A probabilistic graphical model that represents a set of variables and their conditional dependencies via a Directed Acyclic Graph (DAG).

Why is this used in ML?

They are used for **Inference** and **Causal Reasoning**. For example, in medical diagnosis, nodes could represent symptoms and diseases, and edges represent causal probabilities. They handle missing data well and explain "why" a decision was made.

6. References & Further Reading