As I prepare for PhD-level research in computer vision, autonomous driving perception, and embodied perception, I increasingly realize that strong research ideas must be supported by a solid mathematical foundation.

For my research direction, mathematics is not only a set of formulas. It is the language used to describe 3D geometry, attention mechanisms, optimization dynamics, probabilistic modeling, numerical stability, uncertainty estimation, and learning algorithms.

This note is my long-term study record for mathematical foundations. The current version focuses mainly on Matrix Theory, while the remaining chapters will be gradually expanded as I continue studying.

Roadmap

This note is organized into several major chapters:

Matrix Theory
Linear spaces, matrix decompositions, eigenvalues, singular values, positive definite matrices, matrix norms, projections, and matrix calculus.
Numerical Analysis
Numerical errors, conditioning, stability, solving linear systems, iterative methods, interpolation, numerical differentiation, and numerical optimization.
Probability and Statistics
Random variables, distributions, expectation, variance, covariance, Bayesian inference, maximum likelihood estimation, hypothesis testing, and uncertainty estimation.
Optimization
Gradient descent, stochastic optimization, Adam/AdamW, constrained optimization, convex optimization, Hessian analysis, and optimization geometry.
Information Theory
Entropy, cross entropy, KL divergence, mutual information, and their roles in machine learning and representation learning.
Mathematics for 3D Vision and Robotics
Coordinate transformations, Lie groups, SE(3), camera geometry, projection models, and geometric optimization.

1. Matrix Theory

Matrix theory is one of the most important mathematical foundations for machine learning and computer vision. Neural networks, attention mechanisms, least-squares estimation, dimensionality reduction, optimization, and geometric transformations can all be written in matrix form.

For my research, matrix theory appears frequently in:

attention mechanisms, where queries, keys, and values are matrix representations;
token merging, where token similarity and low-rank redundancy are measured through vector and matrix operations;
BEV and 3D representations, where coordinate transformations and feature projections are matrix operations;
optimization, where gradients, Hessians, and curvature are naturally represented by vectors and matrices;
3D vision, where camera projection, pose transformation, and multi-view geometry all rely on matrix formulations.

This chapter summarizes the matrix theory concepts that I need to master for machine learning and computer vision research.

1.1 Vector Spaces

A vector space is a set of vectors that is closed under vector addition and scalar multiplication.

Formally, a vector space (V) over (\mathbb{R}) satisfies:

if (x, y \in V), then (x + y \in V);
if (x \in V) and (a \in \mathbb{R}), then (ax \in V).

In machine learning, a data sample, an image feature, a token embedding, or a neural network parameter vector can all be regarded as vectors in a high-dimensional vector space.

For example, if a BEV token has feature dimension (D=384), then each token feature can be written as:

\[z_i \in \mathbb{R}^{384}.\]

A set of tokens can be stacked into a matrix:

\[Z = \begin{bmatrix} z_1^T \\ z_2^T \\ \vdots \\ z_N^T \end{bmatrix} \in \mathbb{R}^{N \times D}.\]

This simple representation is the starting point for many modern deep learning architectures.

1.2 Linear Combination, Span, and Basis

Given vectors (v_1, v_2, \ldots, v_k), a linear combination is written as:

\[a_1 v_1 + a_2 v_2 + \cdots + a_k v_k,\]

where (a_1, a_2, \ldots, a_k \in \mathbb{R}).

The span of these vectors is the set of all possible linear combinations:

\[\mathrm{span}(v_1, \ldots, v_k) = \left\{ \sum_{i=1}^{k} a_i v_i \mid a_i \in \mathbb{R} \right\}.\]

A basis of a vector space is a set of vectors that is both:

linearly independent, meaning no vector can be represented by the others;
spanning, meaning all vectors in the space can be represented by their linear combinations.

The number of basis vectors is the dimension of the vector space.

In representation learning, this is closely related to the idea that high-dimensional features may actually lie near a lower-dimensional subspace. This intuition appears in:

dimensionality reduction;
feature compression;
low-rank approximation;
token redundancy analysis;
efficient neural representations.

1.3 Linear Independence and Rank

A set of vectors (v_1, \ldots, v_k) is linearly independent if:

\[\sum_{i=1}^{k} a_i v_i = 0\]

only has the trivial solution:

\[a_1 = a_2 = \cdots = a_k = 0.\]

If a non-trivial solution exists, the vectors are linearly dependent.

For a matrix (A), the rank is the maximum number of linearly independent columns or rows:

\[\mathrm{rank}(A).\]

Rank measures the amount of independent information contained in a matrix.

In machine learning, rank is important because:

low-rank matrices indicate redundancy;
attention matrices may have low-rank structure;
feature matrices may contain correlated or repeated information;
compression methods often exploit approximate low-rank structure.

For token-based perception, if many tokens are similar or redundant, the token matrix may contain repeated information. This provides one mathematical motivation for token merging and compression.

1.4 Linear Transformations and Matrices

A linear transformation is a mapping:

\[T: \mathbb{R}^n \rightarrow \mathbb{R}^m\]

that satisfies:

\[T(ax + by) = aT(x) + bT(y).\]

Every linear transformation can be represented by a matrix:

\[y = Ax,\]

where:

\[A \in \mathbb{R}^{m \times n}, \quad x \in \mathbb{R}^{n}, \quad y \in \mathbb{R}^{m}.\]

In neural networks, a fully connected layer is exactly a linear transformation followed by a nonlinear activation:

\[h = \sigma(Wx + b).\]

In Transformers, queries, keys, and values are generated by linear projections:

\[Q = XW_Q, \quad K = XW_K, \quad V = XW_V.\]

Thus, understanding matrix multiplication means understanding the basic operation behind most neural network layers.

1.5 Orthogonality and Projection

Two vectors (x) and (y) are orthogonal if their inner product is zero:

\[x^T y = 0.\]

Orthogonality means the two vectors point in independent directions.

The projection of vector (x) onto vector (v) is:

\[\mathrm{proj}_v(x) = \frac{x^T v}{v^T v}v.\]

Projection is important because it gives the closest approximation of one vector in the direction of another vector.

For a subspace spanned by the columns of matrix (A), the least-squares projection of (b) onto the column space of (A) is obtained by solving:

\[\min_x \|Ax-b\|_2^2.\]

When (A^T A) is invertible, the solution is:

\[x^* = (A^T A)^{-1}A^T b.\]

This is the classical least-squares solution.

Projection appears in many vision problems, including:

fitting geometric models;
solving camera pose estimation;
projecting 3D points onto image planes;
approximating high-dimensional features with lower-dimensional representations.

1.6 Orthogonal Matrices

A matrix (Q) is orthogonal if:

\[Q^T Q = I.\]

This means:

\[Q^{-1} = Q^T.\]

Orthogonal transformations preserve vector length and inner products:

\[\|Qx\|_2 = \|x\|_2.\]

Orthogonal matrices are important because they represent rotations and stable transformations.

In 3D vision and robotics, rotation matrices satisfy:

\[R^T R = I, \quad \det(R)=1.\]

This places rotation matrices in the special orthogonal group:

\[R \in SO(3).\]

Understanding orthogonal matrices is therefore essential for coordinate transformation, ego-motion compensation, and pose alignment.

1.7 Eigenvalues and Eigenvectors

For a square matrix (A), a nonzero vector (v) is an eigenvector if:

\[Av = \lambda v,\]

where (\lambda) is the corresponding eigenvalue.

This means that applying (A) to (v) only changes its scale, not its direction.

If (A) has enough independent eigenvectors, it can be diagonalized as:

\[A = Q \Lambda Q^{-1},\]

where:

(Q) contains eigenvectors;
(\Lambda) is a diagonal matrix of eigenvalues.

Eigenvalues describe how a transformation stretches or compresses space along certain directions.

In machine learning, eigenvalues are useful for:

analyzing covariance matrices;
understanding PCA;
studying Hessian curvature;
measuring stability of iterative algorithms;
understanding graph Laplacians and spectral methods.

For a symmetric matrix (A), the eigendecomposition has a special form:

\[A = Q \Lambda Q^T,\]

where (Q) is orthogonal. This property makes symmetric matrices especially important in optimization and statistics.

1.8 Rayleigh Quotient

For a symmetric matrix (A), the Rayleigh quotient is defined as:

\[R(x) = \frac{x^T A x}{x^T x}.\]

The Rayleigh quotient measures how strongly matrix (A) stretches vector (x).

Its minimum and maximum values are the smallest and largest eigenvalues of (A):

\[\lambda_{\min}(A) \le \frac{x^T A x}{x^T x} \le \lambda_{\max}(A).\]

This is useful in optimization because the Hessian matrix describes local curvature. If the Hessian has large eigenvalues, the loss landscape is sharp in some directions. If it has small eigenvalues, the landscape is flat in those directions.

1.9 Positive Definite Matrices

A symmetric matrix (A) is positive definite if:

\[x^T A x > 0\]

for all nonzero (x).

It is positive semidefinite if:

\[x^T A x \ge 0.\]

Positive definite matrices are important because they define convex quadratic forms:

\[f(x) = x^T A x.\]

If (A) is positive definite, then the quadratic function has a unique minimum.

Positive definite and semidefinite matrices appear in:

covariance matrices;
kernel methods;
second-order optimization;
metric learning;
Gaussian distributions;
least-squares problems.

For example, a covariance matrix (\Sigma) must be positive semidefinite because:

\[x^T \Sigma x = \mathrm{Var}(x^T X) \ge 0.\]

1.10 Matrix Norms

A matrix norm measures the size of a matrix.

The Frobenius norm is:

\[\|A\|_F = \sqrt{\sum_{i,j} a_{ij}^2}.\]

It is equivalent to treating the matrix as a long vector.

The spectral norm is:

\[\|A\|_2 = \sigma_{\max}(A),\]

where (\sigma_{\max}(A)) is the largest singular value of (A).

Matrix norms are useful for:

measuring approximation error;
analyzing numerical stability;
studying generalization and regularization;
controlling weight magnitude in neural networks;
defining low-rank approximation objectives.

For example, low-rank approximation often minimizes:

\[\|A - A_k\|_F.\]

1.11 Singular Value Decomposition

Singular Value Decomposition, or SVD, is one of the most important tools in matrix theory.

Any matrix (A \in \mathbb{R}^{m \times n}) can be decomposed as:

\[A = U \Sigma V^T,\]

where:

(U) contains left singular vectors;
(V) contains right singular vectors;
(\Sigma) is a diagonal matrix containing singular values.

The singular values satisfy:

\[\sigma_1 \ge \sigma_2 \ge \cdots \ge 0.\]

SVD provides a natural way to understand the rank and structure of a matrix.

The best rank-(k) approximation of (A) is:

\[A_k = U_k \Sigma_k V_k^T.\]

The Eckart–Young theorem states that (A_k) is the optimal rank-(k) approximation under both the Frobenius norm and the spectral norm.

SVD is important for:

PCA;
compression;
denoising;
low-rank approximation;
efficient neural network layers;
analyzing redundancy in feature matrices.

For token-based models, if the token feature matrix has strong low-rank structure, it means many tokens carry redundant information. This gives a mathematical perspective on why token pruning or token merging can reduce computation and communication cost.

1.12 Principal Component Analysis

Principal Component Analysis, or PCA, finds the directions of maximum variance in data.

Given centered data matrix:

\[X \in \mathbb{R}^{N \times D},\]

the covariance matrix is:

\[C = \frac{1}{N}X^T X.\]

PCA computes the eigenvectors of (C). The eigenvectors with the largest eigenvalues define the principal directions.

Equivalently, PCA can be computed using SVD:

\[X = U \Sigma V^T.\]

The top right singular vectors in (V) correspond to principal directions.

PCA is useful because it shows that high-dimensional data may be well approximated by a lower-dimensional subspace.

This idea is closely related to:

dimensionality reduction;
representation compression;
latent variable models;
feature redundancy;
efficient perception models.

1.13 Trace and Determinant

The trace of a square matrix (A) is the sum of its diagonal elements:

\[\mathrm{tr}(A) = \sum_i A_{ii}.\]

The trace is also equal to the sum of eigenvalues:

\[\mathrm{tr}(A) = \sum_i \lambda_i.\]

Useful trace identities include:

\[\mathrm{tr}(AB) = \mathrm{tr}(BA),\]

and

\[\mathrm{tr}(A^T A) = \|A\|_F^2.\]

The determinant measures volume scaling under a linear transformation:

\[\det(A).\]

If (\det(A)=0), then the transformation collapses the space into a lower-dimensional subspace, and (A) is not invertible.

In probability, the determinant of the covariance matrix appears in the multivariate Gaussian density:

\[p(x) = \frac{1}{(2\pi)^{d/2}|\Sigma|^{1/2}} \exp\left( -\frac{1}{2}(x-\mu)^T\Sigma^{-1}(x-\mu) \right).\]

1.14 Block Matrices and Schur Complement

Many matrices in machine learning and vision are naturally written in block form.

A block matrix can be written as:

\[M = \begin{bmatrix} A & B \\ C & D \end{bmatrix}.\]

Block matrices are useful for representing systems with multiple variables, such as pose and landmark variables in SLAM, or multi-agent states in collaborative perception.

The Schur complement of (A) in (M) is:

\[S = D - C A^{-1} B.\]

Schur complements appear in:

block matrix inversion;
Gaussian elimination;
least-squares problems;
bundle adjustment;
probabilistic inference;
optimization with structured variables.

Although this topic is more advanced, it is highly useful for understanding geometric optimization and large-scale perception systems.

1.15 Matrix Calculus

Deep learning training requires gradients of scalar losses with respect to vectors, matrices, and tensors.

For a scalar function:

\[f(x): \mathbb{R}^n \rightarrow \mathbb{R},\]

the gradient is:

\[\nabla_x f = \begin{bmatrix} \frac{\partial f}{\partial x_1} \\ \frac{\partial f}{\partial x_2} \\ \vdots \\ \frac{\partial f}{\partial x_n} \end{bmatrix}.\]

For a vector function:

\[f(x): \mathbb{R}^n \rightarrow \mathbb{R}^m,\]

the Jacobian is:

\[J = \begin{bmatrix} \frac{\partial f_1}{\partial x_1} & \cdots & \frac{\partial f_1}{\partial x_n} \\ \vdots & \ddots & \vdots \\ \frac{\partial f_m}{\partial x_1} & \cdots & \frac{\partial f_m}{\partial x_n} \end{bmatrix}.\]

The Hessian of a scalar function is:

\[H = \nabla^2 f(x).\]

Common matrix derivatives include:

\[\frac{\partial}{\partial x}(a^T x) = a,\] \[\frac{\partial}{\partial x}(x^T A x) = (A + A^T)x,\]

and if (A) is symmetric:

\[\frac{\partial}{\partial x}(x^T A x) = 2Ax.\]

Matrix calculus is the foundation of backpropagation. Neural networks are compositions of functions:

\[y = f_L \circ f_{L-1} \circ \cdots \circ f_1(x),\]

and gradients are computed through the chain rule.

1.16 Connection to Attention Mechanisms

The self-attention operation can be written as:

\[\mathrm{Attention}(Q,K,V) = \mathrm{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V.\]

Here:

(Q), (K), and (V) are matrices;
(QK^T) computes pairwise similarities;
the softmax matrix defines attention weights;
multiplying by (V) aggregates features.

From a matrix theory view, attention is a learned data-dependent linear combination of value vectors.

This connects attention to:

projections;
similarity matrices;
matrix multiplication;
low-rank structure;
feature aggregation;
token interaction.

For my research on token-based collaborative occupancy prediction, matrix theory helps explain how tokens interact, how redundancy arises, and how merging or selecting tokens can reduce communication cost.

1.17 Summary of Chapter 1

Matrix theory provides the mathematical language for high-dimensional representation learning.

The most important ideas in this chapter are:

vector spaces describe feature representations;
rank measures independent information;
projections solve approximation problems;
eigenvalues describe transformation directions and curvature;
positive definite matrices appear in covariance and optimization;
SVD explains low-rank approximation and redundancy;
matrix norms measure approximation error and stability;
matrix calculus supports backpropagation;
attention mechanisms are built from matrix operations.

For future study, I will connect these concepts more deeply with multi-view geometry, SE(3) transformations, numerical optimization, and token-based 3D perception.

2. Numerical Analysis