As I prepare for PhD-level research in computer vision and autonomous driving, I have come to realize that building high-performing models is not sufficient. What ultimately matters is a deeper understanding of why models generalize, when they fail, and what principles govern their behavior.

Machine learning theory provides the conceptual and mathematical foundation for reasoning about generalization, representation quality, uncertainty, and robustness under distribution shift. These questions are particularly critical for autonomous driving systems, where deployment environments are dynamic, noisy, and often mismatched with training data.

This note summarizes three core areas of machine learning theory that I aim to understand in depth:

statistical learning theory
representation learning
generative modeling

2. Machine Learning Theory

PhD-level research requires more than the ability to train models—it demands a principled understanding of why they work, how they represent information, and how they behave under uncertainty and distribution shift.

These questions are closely tied to my research interests in autonomous driving perception, collaborative occupancy prediction, and robust visual understanding in dynamic environments.

2.1 Statistical Learning Theory

Statistical learning theory studies the relationship between training performance and test performance.
At its core lies a fundamental question:

Why does a model that fits finite training data sometimes generalize well to unseen data?

This area provides the theoretical framework for analyzing overfitting, regularization, model capacity, and robustness under distribution shift.

Empirical Risk vs Expected Risk

In supervised learning, a model is trained by minimizing the empirical risk, which is the average loss on the training set:

\[\hat{R}(f)=\frac{1}{n}\sum_{i=1}^{n}\ell(f(x_i),y_i)\]

However, what we actually care about is the expected risk over the true data distribution:

\[R(f)=\mathbb{E}_{(x,y)\sim \mathcal{D}}[\ell(f(x),y)]\]

A model can achieve very low empirical risk while still having poor expected risk if it memorizes the training data.

The gap between these two quantities characterizes generalization.

For real-world perception systems, this distinction is critical: a model may perform well on a benchmark but fail in real traffic scenes with different lighting, weather, or sensor configurations.

Bias–Variance Tradeoff

A classic theoretical idea is the bias–variance tradeoff.

Bias reflects error caused by overly restrictive assumptions in the model
Variance reflects sensitivity to fluctuations in the training data

A model with high bias may underfit, while a model with high variance may overfit. The challenge is to find a balance that yields good generalization.

This idea remains useful even in deep learning, although modern neural networks often behave in more complicated ways than classical theory predicts. Nevertheless, the bias–variance perspective remains a useful conceptual tool for reasoning about model complexity and data efficiency.

Overfitting and Regularization

Overfitting occurs when a model captures noise or accidental patterns in the training data instead of learning robust structure. Regularization techniques aim to reduce this problem.

Common forms of regularization include:

weight decay
dropout
early stopping
data augmentation
label smoothing
architectural constraints

From a theoretical perspective, regularization reduces effective model capacity or encourages smoother solutions. In practice, regularization is indispensable for building perception models that remain reliable outside a narrow training distribution.

VC Dimension

The VC dimension provides a conceptual way to measure the expressive power of a hypothesis class. Roughly speaking, it quantifies how complex a model family is by asking how many points it can shatter.

Although VC dimension is most useful in simplified settings and is rarely computed directly for modern deep networks, it remains important because it introduces the broader idea that:

generalization depends not only on fitting data, but also on the capacity of the function class being learned.

In practice, the value of VC theory is largely conceptual. It builds the foundation for later notions of model complexity and generalization bounds.

Rademacher Complexity and Model Capacity

A more flexible way to think about capacity is Rademacher complexity, which measures how well a function class can fit random noise. If a model class can align well with random signs, it has high capacity and may generalize poorly unless controlled by additional structure.

Compared with VC dimension, Rademacher complexity is often better suited to modern statistical learning analysis because it can depend on the data distribution and not only on the abstract hypothesis class.

This matters in machine learning research because modern models are extremely expressive, yet many still generalize well in practice. Understanding capacity through more refined tools helps bridge the gap between theory and deep learning behavior.

Generalization Bounds

Generalization bounds aim to upper-bound the difference between training error and test error. They typically depend on:

number of training samples
model capacity
confidence level
properties of the hypothesis class

A key intuition is that larger datasets and lower effective capacity lead to tighter guarantees.

Although many classical bounds are too loose to directly predict the success of large neural networks, they still provide valuable structure for thinking about why some models are more robust than others. In research, the exact numerical bound is often less important than the qualitative lesson: good generalization requires controlling complexity relative to available data.

Distribution Shift

One of the most important topics for real-world machine learning is distribution shift, where training and deployment data do not follow the same distribution. This is especially relevant in autonomous driving because the test environment may differ due to city layout, weather, sensor calibration, traffic density, or rare events.

Covariate Shift

Under covariate shift, the input distribution changes:

\[P_{train}(x) \neq P_{test}(x)\]

while the conditional label mechanism is assumed to remain similar:

\[P(y\mid x) \text{ is approximately unchanged}\]

An example would be training mostly on daytime scenes and deploying in snowy or nighttime conditions. The visual appearance changes significantly, even if the semantic meaning of objects remains similar.

Label Shift

Under label shift, the class prior changes:

\[P_{train}(y) \neq P_{test}(y)\]

while the class-conditional input distribution is assumed more stable. For example, the proportion of pedestrians, cyclists, or large trucks may differ across environments. This can affect calibration and prediction behavior.

Out-of-Distribution Detection

A deployed model should also identify inputs that lie outside the training distribution. This is the problem of out-of-distribution (OOD) detection.

OOD detection is crucial for safety-critical systems because model confidence can be misleading. A neural network may assign high confidence to unfamiliar or abnormal scenes even when it lacks the knowledge required for correct prediction.

For autonomous driving, these questions are not abstract theory. They directly affect whether a perception model can remain reliable under environmental change, rare events, or sensor anomalies.

Research Perspective. In practice, modern deep networks operate in regimes where classical generalization theory is incomplete. Bridging this gap—especially under distribution shift—remains an open problem. For autonomous driving, a key direction is to design models and training objectives that are inherently robust to domain shift, rather than relying solely on post-hoc adaptation.

2.2 Representation Learning

Modern deep learning owes much of its success to the ability to learn effective representations.
A representation is not just a feature vector; it is a structured encoding of information that makes downstream tasks easier.

Representation learning addresses questions such as:

What makes a feature useful?
What kinds of invariance should a model learn?
How can supervision be reduced while still learning semantic structure?
Why do some architectures produce better representations than others?

These questions are central to computer vision, perception, and multimodal learning.

Invariance and Equivariance in Feature Representations

A good representation often needs to ignore irrelevant variation while preserving important structure.

Invariance means the representation stays stable under certain transformations
Equivariance means the representation changes in a predictable way under transformations

For example:

image classification often benefits from translation invariance
pose estimation or dense perception may require equivariance to spatial structure
multi-view perception may require geometry-aware consistency instead of full invariance

This distinction is very important in autonomous driving. A model should not be overly sensitive to irrelevant pixel-level changes, but it must remain sensitive to spatial layout, object motion, and geometric relationships.

Contrastive Learning and InfoNCE Loss

A major development in modern representation learning is contrastive learning. Its core idea is to bring similar samples closer in feature space while pushing dissimilar samples apart.

A common objective is the InfoNCE loss, which encourages positive pairs to have high similarity relative to negatives. This framework has been highly influential in self-supervised learning.

Conceptually, contrastive learning tells us that useful features can emerge even without manual labels, as long as the learning objective encourages semantic consistency across views or augmentations.

For visual perception, this idea helps explain why models can learn strong structure from large unlabeled datasets. It also connects naturally to cross-view consistency, temporal correspondence, and multimodal alignment.

Self-Supervised Learning

Self-supervised learning has become a major paradigm for learning representations from raw data. Instead of using human-provided labels, the model constructs supervision signals from the data itself.

Important examples include:

MAE (Masked Autoencoders), which learn by reconstructing masked inputs
DINO, which learns through self-distillation and feature consistency
contrastive frameworks such as SimCLR and MoCo
predictive learning methods that infer missing views, patches, or future states

The theoretical importance of self-supervised learning is that it suggests semantic structure can be discovered through reconstruction, consistency, or prediction objectives.

For perception research, this matters because large-scale labeled 3D or autonomous driving data is expensive, while unlabeled sensor data is abundant. Understanding why self-supervised objectives produce transferable features is therefore highly valuable.

Information Bottleneck Theory

The Information Bottleneck perspective asks how a representation can retain task-relevant information while discarding irrelevant details.

In abstract form, one seeks a representation $Z$ that:

preserves information about the target $Y$
compresses unnecessary information from the input $X$

This view is conceptually appealing because it links representation quality to information selection. A good model should not simply memorize everything; it should preserve what matters for decision making.

Although the exact role of Information Bottleneck theory in deep learning remains debated, it provides a useful language for discussing compression, abstraction, and task-aware representation formation.

For my own research interests, this idea is especially relevant when thinking about token sparsification, communication-aware feature selection, and memory compression.

Inductive Biases in CNNs and Transformers

Different architectures succeed not only because of parameter count, but also because of their inductive biases.

CNNs encode strong priors such as:

locality
translation equivariance
hierarchical spatial composition

Transformers, in contrast, are more flexible and rely on attention-based interaction, but often need larger data or stronger training strategies because they impose weaker built-in priors.

Understanding inductive bias helps explain why:

CNNs can be data-efficient on vision tasks
Transformers can model long-range interactions and flexible token relationships
hybrid models may sometimes work better than purely convolutional or purely attention-based designs

This is particularly important for perception tasks, where geometric structure, locality, and multi-scale reasoning all matter. Theoretical understanding of inductive bias helps explain why some architectures transfer better, scale better, or behave more robustly under real-world complexity.

Research Perspective. A central question is what constitutes a good representation for downstream reasoning in dynamic 3D environments. In my research, this connects to token sparsification, communication-aware feature selection, and spatio-temporal memory, where representations must be both compact and task-relevant under bandwidth constraints.

2.3 Generative Modeling

Generative modeling aims to learn the underlying data distribution, rather than only mapping inputs to labels. This area has become increasingly central in modern machine learning because it supports synthesis, reconstruction, uncertainty estimation, and future prediction.

For computer vision and autonomous driving, generative methods are becoming highly relevant for:

scene completion
future world prediction
uncertainty-aware perception
generative occupancy prediction
simulation and scenario generation

Variational Inference and ELBO

Many probabilistic generative models involve latent variables that are difficult to infer exactly. Variational inference approximates these intractable posterior distributions using a simpler family of distributions.

A central object is the Evidence Lower Bound (ELBO), which provides a tractable objective for learning.
Optimizing the ELBO balances two goals:

reconstructing or explaining the observed data well
keeping the approximate posterior close to a prior distribution

This is one of the most important ideas in modern probabilistic deep learning because it turns difficult Bayesian inference into an optimization problem.

Variational Autoencoders (VAEs)

A Variational Autoencoder combines neural networks with latent-variable probabilistic modeling. The encoder predicts a distribution over latent variables, and the decoder reconstructs data from those latent variables.

VAEs are important because they provide a principled framework for:

representation learning
uncertainty-aware generation
latent-space reasoning
probabilistic reconstruction

Although VAEs may generate blurrier samples than some newer models, their theoretical clarity makes them foundational. For research, they are valuable because they connect latent representations, probabilistic inference, and optimization in a unified framework.

Diffusion Models and Score Matching

Diffusion models have become one of the most influential classes of generative models in recent years. Their core idea is to gradually add noise to data and then learn to reverse this process.

This connects closely to score matching, where the model learns the gradient of the log-density with respect to the data. In practice, diffusion models have shown remarkable performance in image generation, video generation, and increasingly in world modeling tasks.

From a theoretical perspective, diffusion models are interesting because they connect stochastic processes, denoising, likelihood-related objectives, and geometric structure in data space.

For autonomous driving, diffusion-based approaches are becoming relevant for trajectory generation, scene prediction, and generative occupancy forecasting.

Energy-Based Models (EBMs)

Energy-based models define an energy function over inputs, where lower energy corresponds to more plausible configurations. Instead of directly normalizing a probability distribution, the model learns a scalar landscape over possible states.

EBMs are theoretically attractive because they are flexible and expressive. They can represent complex dependencies without requiring an explicit normalized density in simple form. At the same time, their training and sampling procedures can be challenging.

Energy-based thinking is also useful beyond classical EBMs. It encourages a general view of learning as shaping a compatibility landscape over structured outputs. This perspective can be relevant for scene reasoning and structured prediction.

Bayesian Deep Learning for Uncertainty Estimation

A major limitation of standard deep networks is that their confidence estimates are often poorly calibrated. Bayesian deep learning aims to better represent uncertainty by reasoning over distributions of model parameters or predictions.

This matters greatly for safety-critical systems. In autonomous driving, it is not enough to predict occupancy or object states; the system should also know when it is uncertain.

Important motivations for uncertainty estimation include:

detecting ambiguous scenes
identifying unreliable predictions
improving risk-aware planning
supporting robust human trust in model outputs

From a theoretical perspective, Bayesian methods provide a principled way to separate data uncertainty from model uncertainty, even though practical approximations are often necessary.

Research Perspective. Generative modeling offers a promising path toward world modeling and uncertainty-aware perception. A key challenge is integrating generative objectives with structured prediction (e.g., occupancy grids) while maintaining efficiency and controllability in real-time systems.

Closing Remarks

Machine learning theory transforms model development from empirical engineering into principled research. Rather than memorizing formal definitions, the goal is to develop a principled understanding of:

why models generalize
how useful representations emerge
how uncertainty should be modeled
why some architectures work better than others
how systems behave under distribution shift in the real world

These questions are especially meaningful for autonomous driving, where perception systems must operate in open, changing, and safety-critical environments.

As I continue preparing for PhD research, I view statistical learning theory, representation learning, and generative modeling as complementary foundations. Together, they provide the tools to reason about generalization, representation, and uncertainty—three pillars that are essential for building reliable perception systems in real-world environments.