Mathematical Foundations for Machine Learning and Computer Vision

As I prepare for deeper research in computer vision and autonomous driving perception, I realized that strong research ideas must be built on a solid mathematical foundation.

Modern machine learning systems rely heavily on linear algebra, probability theory, and optimization. These mathematical tools provide the language used to describe neural networks, attention mechanisms, probabilistic models, and learning algorithms.

This note summarizes the mathematical foundations that frequently appear in machine learning and computer vision.


1. Mathematical Foundations

Mathematics forms the language of modern machine learning and computer vision.
In particular, linear algebra, probability, and optimization are the three core pillars.


1.1 Linear Algebra

Linear algebra describes how high‑dimensional data and neural network parameters behave.
Most deep learning models can be interpreted as a sequence of linear transformations followed by nonlinear functions.


Vector Spaces

A vector space (V) over a field ( \mathbb{R} ) is a set of vectors with two operations:

  • vector addition
  • scalar multiplication

that satisfy closure, associativity, distributivity, and identity properties.

Important concepts include:

Span

Given vectors (v_1, v_2, …, v_k), their span is

\[\text{span}(v_1,...,v_k) = \left\{ \sum_{i=1}^{k} a_i v_i \mid a_i \in \mathbb{R} \right\}\]

It represents all linear combinations of those vectors.

Basis

A basis is a set of vectors that:

  1. are linearly independent
  2. span the entire vector space

The number of vectors in a basis defines the dimension of the space.


Linear Transformations

A linear transformation is a mapping

\[T: \mathbb{R}^n \rightarrow \mathbb{R}^m\]

such that

\[T(ax + by) = aT(x) + bT(y)\]

Every linear transformation can be represented by a matrix:

\[y = Ax\]

This interpretation is central to neural networks because each layer performs a matrix multiplication.


Orthogonality and Projection

Two vectors are orthogonal if

\[x^T y = 0\]

The projection of vector (x) onto vector (v) is

\[\text{proj}_v(x) = \frac{x^T v}{v^T v} v\]

Projection plays an important role in least squares problems, where we find the best approximation of data in a lower‑dimensional subspace.

The least squares solution for

\[Ax = b\]

is

\[x = (A^T A)^{-1} A^T b\]

when (A^T A) is invertible.


Eigenvalues and Eigenvectors

For a square matrix (A), a vector (v) is an eigenvector if

\[Av = \lambda v\]

where ( \lambda ) is the eigenvalue.

Eigen decomposition allows matrices to be written as

\[A = Q \Lambda Q^{-1}\]

where

  • (Q) contains eigenvectors
  • (\Lambda) is a diagonal matrix of eigenvalues.

Eigenvalues describe the principal directions of transformation.

The Rayleigh quotient for vector (x) is

\[R(x) = \frac{x^T A x}{x^T x}\]

which is useful for analyzing matrix properties.


Singular Value Decomposition (SVD)

Any matrix (A \in \mathbb{R}^{m \times n}) can be decomposed as

\[A = U \Sigma V^T\]

where

  • (U) and (V) are orthogonal matrices
  • (\Sigma) contains singular values.

SVD is fundamental for low‑rank approximation:

\[A_k = U_k \Sigma_k V_k^T\]

The Eckart–Young theorem states that this provides the optimal rank‑(k) approximation under the Frobenius norm.

This concept appears frequently in:

  • dimensionality reduction
  • compression
  • efficient attention mechanisms.

Matrix Norms

Common matrix norms include:

Frobenius norm

\[||A||_F = \sqrt{\sum_{i,j} a_{ij}^2}\]

Spectral norm

\[||A||_2 = \sigma_{max}(A)\]

where ( \sigma_{max} ) is the largest singular value.

These norms help measure matrix magnitude and stability.


1.2 Probability and Statistics

Probability theory provides the framework for reasoning about uncertainty in machine learning models.


Random Variables

A random variable (X) describes the outcome of a stochastic process.

Three key functions describe its distribution:

  • PMF for discrete variables
  • PDF for continuous variables
  • CDF
\[F_X(x) = P(X \le x)\]

Expectation and Variance

The expectation of (X) is

\[\mathbb{E}[X] = \sum_x x P(X=x)\]

or

\[\mathbb{E}[X] = \int x p(x) dx\]

for continuous variables.

Variance measures spread:

\[Var(X) = \mathbb{E}[(X - \mathbb{E}[X])^2]\]

Covariance between two variables is

\[Cov(X,Y) = \mathbb{E}[(X-\mu_X)(Y-\mu_Y)]\]

Multivariate Gaussian

A multivariate Gaussian distribution is defined as

\[p(x) = \frac{1}{(2\pi)^{d/2} |\Sigma|^{1/2}} \exp\left(-\frac{1}{2}(x-\mu)^T \Sigma^{-1} (x-\mu)\right)\]

where

  • ( \mu ) is the mean vector
  • ( \Sigma ) is the covariance matrix.

Bayes’ Rule

Bayes’ theorem connects prior knowledge with observations:

\[P(\theta | x) = \frac{P(x|\theta)P(\theta)}{P(x)}\]

where

  • (P(\theta)) is the prior
  • (P(x \theta)) is the likelihood
  • (P(\theta x)) is the posterior.

Maximum Likelihood Estimation

Given data (D = {x_1,…,x_n}), the likelihood is

\[L(\theta) = \prod_{i=1}^{n} p(x_i | \theta)\]

MLE chooses

\[\theta^* = \arg\max_\theta L(\theta)\]

which is usually solved via the log‑likelihood.


KL Divergence

KL divergence measures the difference between two distributions:

\[D_{KL}(P||Q) = \sum_x P(x) \log \frac{P(x)}{Q(x)}\]

It appears frequently in deep learning loss functions.


1.3 Optimization

Training deep learning models requires solving large‑scale optimization problems.

The objective function typically has the form

\[\min_\theta L(\theta)\]

where (L(\theta)) is the loss function.


Gradient Descent

Gradient descent updates parameters as

\[\theta_{t+1} = \theta_t - \eta \nabla L(\theta_t)\]

where ( \eta ) is the learning rate.


Stochastic Gradient Descent

Instead of using the entire dataset,

SGD uses mini‑batches:

\[\theta_{t+1} = \theta_t - \eta \nabla L_i(\theta_t)\]

which improves computational efficiency.


Adam Optimizer

Adam maintains moving averages of gradients:

\[m_t = \beta_1 m_{t-1} + (1-\beta_1) g_t\] \[v_t = \beta_2 v_{t-1} + (1-\beta_2) g_t^2\]

Parameters are updated as

\[\theta_{t+1} = \theta_t - \eta \frac{m_t}{\sqrt{v_t}+\epsilon}\]

Second‑Order Methods

Second‑order methods analyze curvature using the Hessian matrix

\[H = \nabla^2 L(\theta)\]

These methods provide information about

  • curvature
  • saddle points
  • optimization stability.

Numerical Stability

One important technique is the log‑sum‑exp trick:

\[\log \sum_i e^{x_i} = m + \log \sum_i e^{x_i-m}\]

where

\[m = \max_i x_i\]

This prevents overflow in exponential computations.


1.4 Matrix Calculus (Deep Learning Derivatives)

Deep learning training relies on computing gradients of complex functions with respect to millions of parameters.
Matrix calculus provides the mathematical tools for understanding backpropagation.


Gradient of Scalar with Respect to Vector

Let a scalar function be

\[f(x) : \mathbb{R}^n \rightarrow \mathbb{R}\]

The gradient is defined as

\[\nabla_x f = \begin{bmatrix} \frac{\partial f}{\partial x_1} \\ \frac{\partial f}{\partial x_2} \\ \vdots \\ \frac{\partial f}{\partial x_n} \end{bmatrix}\]

The gradient indicates the direction of steepest ascent of the function.


Jacobian Matrix

For a vector function

\[f(x) : \mathbb{R}^n \rightarrow \mathbb{R}^m\]

the Jacobian matrix is

\[J = \begin{bmatrix} \frac{\partial f_1}{\partial x_1} & \cdots & \frac{\partial f_1}{\partial x_n} \\ \vdots & \ddots & \vdots \\ \frac{\partial f_m}{\partial x_1} & \cdots & \frac{\partial f_m}{\partial x_n} \end{bmatrix}\]

The Jacobian describes how small changes in the input affect the output vector.


Chain Rule in Neural Networks

Neural networks are compositions of functions:

\[y = f(g(h(x)))\]

Using the chain rule,

\[\frac{dy}{dx} = \frac{dy}{df} \frac{df}{dg} \frac{dg}{dh} \frac{dh}{dx}\]

Backpropagation efficiently computes these derivatives layer by layer.


1.5 Information Theory in Machine Learning

Information theory helps quantify uncertainty, information content, and distribution differences.


Entropy

The entropy of a distribution (P) is

\[H(P) = -\sum_x P(x)\log P(x)\]

Entropy measures the uncertainty of a random variable.


Cross Entropy

Cross entropy between two distributions (P) and (Q) is

\[H(P,Q) = -\sum_x P(x)\log Q(x)\]

This is the most commonly used loss in classification models.


Relationship Between Cross‑Entropy and KL Divergence

Cross entropy can be decomposed as

\[H(P,Q) = H(P) + D_{KL}(P||Q)\]

Since (H(P)) is constant with respect to the model parameters, minimizing cross‑entropy is equivalent to minimizing KL divergence.


Mutual Information

Mutual information measures the dependency between variables:

\[I(X;Y) = H(X) - H(X|Y)\]

It is often used to analyze representation learning and contrastive learning objectives.


1.6 Optimization Geometry

Understanding the geometry of the loss landscape helps explain why deep networks can be optimized effectively.


Loss Landscape

The training objective

\[L(\theta)\]

defines a high‑dimensional surface over the parameter space.

Key geometric structures include:

  • local minima
  • saddle points
  • flat regions.

Hessian and Curvature

The Hessian matrix

\[H = \nabla^2 L(\theta)\]

describes local curvature.

Eigenvalues of the Hessian indicate:

  • positive → local minimum
  • negative → saddle direction.

Sharp vs Flat Minima

A sharp minimum corresponds to large curvature:

\[\lambda_{max}(H) \text{ is large}\]

while flat minima correspond to smaller curvature.

Empirical studies suggest flat minima often lead to better generalization.


Compute–Generalization Tradeoff

Modern large models are often trained using large‑batch SGD and adaptive optimizers.
Understanding how optimization interacts with generalization remains an active research topic in machine learning.


Closing Remarks

Linear algebra, probability, and optimization together form the mathematical backbone of machine learning.

Understanding these tools is essential not only for reading research papers, but also for designing new algorithms in computer vision and autonomous systems.




Enjoy Reading This Article?

Here are some more articles you might like to read next:

  • Machine Learning Theory
  • Mathematical Foundations for Transformers
  • Building My PhD Knowledge Base for Computer Vision