As I prepare for PhD-level research in computer vision, autonomous driving perception, and embodied perception, I want to build a systematic foundation in machine learning.

Machine learning is the bridge between mathematical theory and modern perception systems. It provides the concepts needed to understand why models learn, how they generalize, how uncertainty is modeled, how representations are formed, and why deep models succeed or fail under distribution shift.

This note is my long-term study record for machine learning foundations. It is organized around several major sources and themes:

Andrew Ng’s machine learning course;
Pattern Recognition and Machine Learning by Christopher Bishop;
statistical learning theory;
probabilistic modeling;
representation learning;
connections to computer vision, autonomous driving, and embodied perception.

Roadmap

This note is organized into the following chapters:

Core Machine Learning
Supervised learning, empirical risk minimization, regression, classification, bias–variance tradeoff, regularization, and model selection.
Andrew Ng Machine Learning Course
A practical entry point covering regression, classification, neural networks, SVMs, clustering, anomaly detection, recommender systems, and applied ML workflows.
Pattern Recognition and Machine Learning
A probabilistic view of machine learning based on Bishop’s PRML, including Bayesian decision theory, linear models, graphical models, mixture models, EM, variational inference, and kernel methods.
Statistical Learning Theory
Generalization, VC dimension, Rademacher complexity, uniform convergence, empirical risk, expected risk, and distribution shift.
Representation Learning
Feature learning, invariance, equivariance, contrastive learning, self-supervised learning, information bottleneck, and token-based representations.
Connections to Computer Vision and Autonomous Driving
How machine learning principles appear in dense prediction, semantic occupancy prediction, collaborative perception, uncertainty estimation, and world models.

1. Core Machine Learning

Machine learning studies how algorithms improve their behavior from data. A learning system usually contains three key elements:

a model class that defines possible functions;
an objective function that defines what good performance means;
an optimization algorithm that searches for good parameters.

In supervised learning, we are given a training dataset:

\[\mathcal{D}=\{(x_i,y_i)\}_{i=1}^{N},\]

where (x_i) is an input and (y_i) is the corresponding label. The goal is to learn a function:

\[f_\theta: \mathcal{X} \rightarrow \mathcal{Y}\]

that predicts (y) from (x) and generalizes to unseen data.

1.1 Empirical Risk Minimization

A central idea in machine learning is Empirical Risk Minimization, or ERM.

Given a loss function (\ell(f_\theta(x), y)), the empirical risk is:

\[\hat{R}(\theta)=\frac{1}{N}\sum_{i=1}^{N}\ell(f_\theta(x_i),y_i).\]

The learning problem is written as:

\[\theta^* = \arg\min_\theta \hat{R}(\theta).\]

This means we choose parameters that minimize the average training loss.

However, minimizing training loss alone is not enough. The real goal is to minimize the expected risk:

\[R(\theta)=\mathbb{E}_{(x,y)\sim P_{data}}[\ell(f_\theta(x),y)].\]

The difference between empirical risk and expected risk is the core issue behind generalization.

In computer vision, this distinction is crucial. A model may perform well on a training set but fail under new weather, new cities, new camera settings, or new traffic patterns.

1.2 Regression

Regression predicts continuous values.

In linear regression, the model is:

\[\hat{y}=w^T x+b.\]

The common objective is mean squared error:

\[\mathcal{L}(w,b)=\frac{1}{N}\sum_{i=1}^{N}(w^T x_i+b-y_i)^2.\]

Linear regression is simple, but it introduces several important ideas:

model parameters;
loss functions;
least-squares optimization;
overfitting;
regularization;
closed-form solutions;
gradient-based learning.

The regularized version, ridge regression, adds an (L_2) penalty:

\[\mathcal{L}(w,b)=\frac{1}{N}\sum_{i=1}^{N}(w^T x_i+b-y_i)^2 + \lambda \|w\|_2^2.\]

This penalty discourages excessively large weights and improves generalization.

1.3 Classification

Classification predicts discrete labels.

For binary classification, logistic regression models the probability of class 1 as:

\[P(y=1|x)=\sigma(w^T x+b),\]

where:

\[\sigma(z)=\frac{1}{1+e^{-z}}.\]

The cross-entropy loss is:

\[\mathcal{L}(w,b)= -\frac{1}{N}\sum_{i=1}^{N} \left[y_i\log \hat{y}_i+(1-y_i)\log(1-\hat{y}_i)\right].\]

For multi-class classification, we use softmax:

\[P(y=k|x)=\frac{\exp(z_k)}{\sum_{j}\exp(z_j)}.\]

Classification is the basis for many computer vision tasks, including image classification, object detection classification heads, semantic segmentation, and semantic occupancy prediction.

1.4 Bias–Variance Tradeoff

The bias–variance tradeoff explains how model complexity affects generalization.

The expected prediction error can be conceptually decomposed into:

\[\text{Error} = \text{Bias}^2 + \text{Variance} + \text{Noise}.\]

High bias means the model is too simple and underfits.
High variance means the model is too sensitive to the training data and overfits.
Noise is irreducible uncertainty in the data.

In deep learning, this classical decomposition becomes more complicated, but the intuition remains important.

For autonomous driving, overfitting can be dangerous because the model may memorize benchmark-specific patterns but fail in rare or long-tail scenarios.

1.5 Regularization

Regularization adds constraints or penalties to reduce overfitting.

Common methods include:

(L_1) regularization;
(L_2) regularization;
weight decay;
dropout;
data augmentation;
early stopping;
label smoothing;
mixup and cutmix;
stochastic depth.

The regularized objective can be written as:

\[\theta^* = \arg\min_\theta \left[ \frac{1}{N}\sum_{i=1}^{N}\ell(f_\theta(x_i),y_i) + \lambda \Omega(\theta) \right].\]

Here, (\Omega(\theta)) is a regularization term.

In vision models, data augmentation is especially important because it encourages invariance to lighting, viewpoint, scale, cropping, and small geometric changes.

1.6 Model Selection and Validation

A machine learning system requires choosing hyperparameters such as:

learning rate;
batch size;
model depth and width;
regularization strength;
number of training epochs;
data augmentation strength;
threshold values and decision rules.

Common validation strategies include:

train/validation/test split;
cross-validation;
early stopping based on validation performance;
ablation studies;
robustness evaluation under shifted conditions.

For research, model selection must be carefully reported. Otherwise, the final results may not be reproducible or trustworthy.

2. Andrew Ng Machine Learning Course

Andrew Ng’s machine learning course is a practical and intuitive entry point into machine learning. Its strength is that it explains key ideas with clear examples and emphasizes the full ML workflow, not just formulas.

The main value of this course is to build intuition about:

how learning algorithms are formulated;
how cost functions guide learning;
how gradient descent works;
how to diagnose bias and variance;
how to build practical ML systems.

2.1 Linear Regression and Gradient Descent

The course begins with linear regression and gradient descent.

For one training example, the prediction is:

\[h_\theta(x)=\theta^T x.\]

The cost function is:

\[J(\theta)=\frac{1}{2m}\sum_{i=1}^{m}(h_\theta(x^{(i)})-y^{(i)})^2.\]

Gradient descent updates parameters by:

\[\theta_j := \theta_j - \alpha \frac{\partial J(\theta)}{\partial \theta_j}.\]

This gives the first important intuition: learning is an iterative process of reducing prediction error by following the negative gradient direction.

2.2 Logistic Regression

Logistic regression introduces probabilistic classification.

The hypothesis is:

\[h_\theta(x)=\frac{1}{1+e^{-\theta^T x}}.\]

The output can be interpreted as:

\[h_\theta(x)=P(y=1|x;\theta).\]

The cost function is cross entropy:

\[J(\theta)=-\frac{1}{m}\sum_{i=1}^{m} \left[ y^{(i)}\log h_\theta(x^{(i)})+ (1-y^{(i)})\log(1-h_\theta(x^{(i)})) \right].\]

This is important because many deep learning losses are extensions of logistic regression and cross-entropy classification.

2.3 Neural Networks

The course introduces neural networks as nonlinear function approximators.

A neural network layer can be written as:

\[a^{(l)} = g(W^{(l)}a^{(l-1)} + b^{(l)}),\]

where (g) is a nonlinear activation.

The key ideas are:

multiple layers compose simple functions into complex functions;
hidden layers learn intermediate representations;
backpropagation computes gradients efficiently;
nonlinear activation functions allow neural networks to model complex decision boundaries.

This provides the conceptual foundation for deep learning.

2.4 Bias, Variance, and Learning Curves

One of the most useful parts of Andrew Ng’s course is the practical diagnosis of machine learning systems.

Learning curves compare training and validation errors as the training set size increases.

Typical patterns:

high training error and high validation error indicate high bias;
low training error and high validation error indicate high variance;
both can sometimes be improved by more data, stronger features, better models, or regularization.

This diagnostic mindset is essential for research. When a model performs poorly, I should ask:

Is the model underfitting?
Is it overfitting?
Is the data insufficient?
Is the loss function appropriate?
Is the evaluation protocol correct?
Is there a distribution shift between training and validation?

2.5 Support Vector Machines

Support Vector Machines introduce the idea of maximizing margins.

For a linear classifier, the decision boundary is:

\[w^T x+b=0.\]

SVM seeks a boundary that maximizes the margin between classes.

The optimization objective can be written as:

\[\min_w \frac{1}{2}\|w\|^2 + C\sum_i \xi_i,\]

subject to:

\[y_i(w^T x_i+b) \ge 1-\xi_i.\]

The kernel trick allows nonlinear classification by implicitly mapping inputs into a high-dimensional feature space.

Although SVMs are less dominant in modern deep learning, the concepts of margin, kernels, and regularization remain important.

2.6 Unsupervised Learning

The course also covers unsupervised learning, including clustering and dimensionality reduction.

K-means Clustering

K-means partitions data into (K) clusters by minimizing within-cluster distance:

\[\min_{\{c_i\},\{\mu_k\}} \sum_{i=1}^{N}\|x_i-\mu_{c_i}\|^2.\]

It alternates between:

assigning each point to the nearest cluster center;
updating cluster centers by averaging assigned points.

This is useful for understanding prototype learning, feature grouping, and clustering-based representation analysis.

Principal Component Analysis

PCA reduces dimensionality by projecting data onto directions with maximum variance.

It is closely related to SVD and low-rank approximation.

2.7 Practical ML Workflow

Andrew Ng’s course emphasizes that machine learning is not just model training. It is a system-building process.

Important workflow ideas:

define a clear metric;
build a simple baseline first;
inspect errors manually;
use learning curves to diagnose problems;
perform ablation studies;
avoid data leakage;
improve data quality before overcomplicating the model.

This is directly relevant to my research. For collaborative occupancy prediction, I need to carefully check whether improvements come from the proposed method, stronger representation capacity, better training, or evaluation details.

3. Pattern Recognition and Machine Learning

Bishop’s Pattern Recognition and Machine Learning provides a more theoretical and probabilistic view of machine learning.

Unlike purely algorithmic introductions, PRML emphasizes:

probability distributions;
Bayesian decision theory;
generative and discriminative models;
latent variables;
graphical models;
approximate inference.

This perspective is important because perception systems often operate under uncertainty.

3.1 Bayesian Decision Theory

Bayesian decision theory describes optimal prediction under uncertainty.

Given input (x), the model estimates posterior probabilities:

\[P(C_k|x).\]

The optimal classification rule under 0–1 loss is:

\[\hat{C}=\arg\max_k P(C_k|x).\]

For more general losses, the decision should minimize expected risk:

\[R(a|x)=\sum_k L(a,C_k)P(C_k|x),\]

where (a) is an action and (L(a,C_k)) is the loss of taking action (a) when the true class is (C_k).

This is important for autonomous driving because different errors have different consequences. For example, missing an occupied region may be more dangerous than falsely predicting a free region as occupied.

3.2 Generative and Discriminative Models

A discriminative model directly models:

\[P(y|x).\]

A generative model models the joint distribution:

\[P(x,y)=P(x|y)P(y).\]

Examples:

logistic regression is discriminative;
naive Bayes is generative;
many deep classifiers are discriminative;
diffusion models and VAEs are generative.

Generative modeling is increasingly relevant to world models and future prediction because the model must generate possible future states rather than only classify current observations.

3.3 Maximum Likelihood Estimation

Given data (\mathcal{D}={x_i}_{i=1}^{N}), maximum likelihood estimation chooses parameters that maximize:

\[L(\theta)=\prod_{i=1}^{N}p(x_i|\theta).\]

It is often easier to maximize the log-likelihood:

\[\log L(\theta)=\sum_{i=1}^{N}\log p(x_i|\theta).\]

MLE connects probabilistic modeling to loss functions.

For example, minimizing mean squared error corresponds to maximum likelihood under Gaussian noise assumptions. Minimizing cross entropy corresponds to maximum likelihood for categorical labels.

3.4 MAP Estimation and Bayesian Learning

Maximum A Posteriori estimation includes a prior:

\[\theta_{MAP}=\arg\max_\theta p(\theta|\mathcal{D}).\]

Using Bayes’ rule:

\[p(\theta|\mathcal{D}) \propto p(\mathcal{D}|\theta)p(\theta).\]

Taking negative log gives:

\[-\log p(\theta|\mathcal{D}) = -\log p(\mathcal{D}|\theta)-\log p(\theta)+\text{const}.\]

This shows that regularization can be interpreted as a prior over parameters.

For example, (L_2) regularization corresponds to a Gaussian prior on weights.

3.5 Mixture Models and EM Algorithm

Mixture models represent data as coming from multiple latent components.

A Gaussian Mixture Model is:

\[p(x)=\sum_{k=1}^{K}\pi_k \mathcal{N}(x|\mu_k,\Sigma_k),\]

where (\pi_k) are mixture weights.

The Expectation-Maximization algorithm alternates between:

E-step: estimate latent assignment probabilities;
M-step: update model parameters using those assignments.

EM is important because it teaches a general strategy: when hidden variables make direct optimization difficult, alternate between estimating hidden structure and optimizing parameters.

This idea appears in clustering, latent variable models, and some self-training methods.

3.6 Kernel Methods

Kernel methods allow linear algorithms to operate in implicit high-dimensional feature spaces.

A kernel function computes:

\[K(x_i,x_j)=\phi(x_i)^T\phi(x_j),\]

without explicitly computing (\phi(x)).

Common kernels include:

linear kernel;
polynomial kernel;
radial basis function kernel.

Kernel methods are less dominant than deep learning today, but they provide important theoretical insights into similarity, feature spaces, and nonlinearity.

Attention can also be loosely viewed as a learned similarity-based aggregation mechanism, which makes kernel intuition useful for understanding modern models.

3.7 Graphical Models

Graphical models represent probabilistic dependencies using graphs.

They include:

Bayesian networks;
Markov random fields;
factor graphs;
conditional random fields.

The key idea is factorization:

\[p(x_1,\ldots,x_n)=\prod_i \psi_i(\mathcal{C}_i),\]

where each factor (\psi_i) depends on a subset of variables.

Graphical models are useful for understanding structured prediction, temporal modeling, SLAM, sensor fusion, and multi-agent reasoning.

3.8 Variational Inference

Bayesian inference often requires computing posterior distributions that are intractable:

\[p(z|x)=\frac{p(x,z)}{p(x)}.\]

Variational inference approximates the true posterior with a simpler distribution (q(z)).

It minimizes:

\[D_{KL}(q(z)\|p(z|x)).\]

This leads to the Evidence Lower Bound:

\[\log p(x) \ge \mathbb{E}_{q(z)}[\log p(x,z)-\log q(z)].\]

Variational inference is central to VAEs and many probabilistic deep learning models.

For occupancy world models, approximate inference may be useful for modeling uncertainty over future scene states.

4. Statistical Learning Theory

Statistical learning theory studies why and when learning algorithms generalize.

The central question is:

If a model performs well on finite training data, when can we trust it to perform well on unseen data?

This is especially important for autonomous driving, where real-world deployment involves long-tail cases and distribution shifts.

4.1 Expected Risk and Empirical Risk

The expected risk is:

\[R(f)=\mathbb{E}_{(x,y)\sim P}[\ell(f(x),y)].\]

The empirical risk is:

\[\hat{R}(f)=\frac{1}{N}\sum_{i=1}^{N}\ell(f(x_i),y_i).\]

Generalization analysis studies the gap:

\[R(f)-\hat{R}(f).\]

A small training loss is meaningful only when this gap is controlled.

4.2 Generalization Gap

The generalization gap is:

\[\text{Gap}=R(f)-\hat{R}(f).\]

A model overfits when empirical risk is low but expected risk is high.

In deep learning, the classical theory does not fully explain why overparameterized neural networks generalize well, but the generalization gap remains a useful concept.

For research, I should always ask:

Does the method improve true generalization or only benchmark performance?
Does the model rely on dataset-specific shortcuts?
Does the improvement hold under different scenes, weather, or sensor configurations?
Is the comparison fair under the same training and evaluation settings?

4.3 VC Dimension

VC dimension measures the capacity of a hypothesis class.

A hypothesis class has VC dimension (d) if it can shatter (d) points but cannot shatter some set of (d+1) points.

Intuitively:

higher VC dimension means higher expressive power;
higher expressive power may increase overfitting risk;
more data is needed to control generalization.

Although VC dimension is not directly used to analyze modern deep networks in practice, it gives an important conceptual link between capacity and generalization.

4.4 Uniform Convergence

Uniform convergence studies whether empirical risk uniformly approximates expected risk over a hypothesis class:

\[\sup_{f\in\mathcal{F}} |R(f)-\hat{R}(f)|.\]

If this quantity is small, then minimizing empirical risk also approximately minimizes expected risk.

This provides a theoretical justification for ERM.

4.5 Rademacher Complexity

Rademacher complexity measures how well a function class can fit random noise.

For a function class (\mathcal{F}), empirical Rademacher complexity is:

\[\hat{\mathfrak{R}}_S(\mathcal{F}) = \mathbb{E}_{\sigma} \left[ \sup_{f\in\mathcal{F}} \frac{1}{N}\sum_{i=1}^{N}\sigma_i f(x_i) \right],\]

where (\sigma_i) are random signs.

If a function class can fit random labels easily, its complexity is high.

This gives a more refined way to think about generalization than simply counting parameters.

4.6 Distribution Shift

Distribution shift occurs when training and test data come from different distributions:

\[P_{train}(x,y) \neq P_{test}(x,y).\]

Common types include:

covariate shift: (P(x)) changes;
label shift: (P(y)) changes;
concept shift: (P(y x)) changes;
domain shift: environment, sensor, or style changes;
temporal shift: data changes over time.

For autonomous driving and embodied perception, distribution shift is unavoidable. Models must handle new cities, weather, lighting, sensors, traffic patterns, and rare events.

This is why robustness, uncertainty, domain adaptation, and out-of-distribution detection are critical.

5. Representation Learning

Representation learning studies how models transform raw inputs into useful features.

A good representation should preserve task-relevant information while discarding nuisance factors.

For perception systems, good representations should capture:

geometry;
semantics;
motion;
uncertainty;
object relations;
spatial structure;
temporal consistency.

5.1 Hand-Crafted Features vs Learned Features

Classical computer vision relied on hand-crafted features such as:

SIFT;
HOG;
ORB;
color histograms;
edge and corner descriptors.

Deep learning replaced many hand-crafted features with learned representations.

CNNs learn hierarchical visual features:

early layers capture edges and textures;
middle layers capture parts and shapes;
deeper layers capture semantic concepts.

Transformers learn token interactions and global context through attention.

5.2 Invariance and Equivariance

A representation is invariant to a transformation if it does not change when the input is transformed.

For example, a classifier should ideally recognize an object regardless of small translations or lighting changes.

A representation is equivariant if it transforms in a predictable way when the input transforms.

For example, in segmentation, if the image shifts, the segmentation mask should shift correspondingly.

Mathematically, for transformation (g):

invariance means (f(gx)=f(x));
equivariance means (f(gx)=g’f(x)).

In 3D perception, equivariance is especially important because coordinate transformations, ego-motion, and pose alignment are central problems.

5.3 Contrastive Learning

Contrastive learning trains representations by pulling positive pairs together and pushing negative pairs apart.

A common objective is InfoNCE:

\[\mathcal{L}_{InfoNCE} = -\log \frac{\exp(\mathrm{sim}(z_i,z_i^+)/\tau)} {\exp(\mathrm{sim}(z_i,z_i^+)/\tau)+\sum_j \exp(\mathrm{sim}(z_i,z_j^-)/\tau)}.\]

Here:

(z_i) is an anchor representation;
(z_i^+) is a positive sample;
(z_j^-) are negative samples;
(\tau) is a temperature parameter.

Contrastive learning is useful for self-supervised learning, multi-view representation learning, and cross-modal alignment.

5.4 Self-Supervised Learning

Self-supervised learning uses automatically generated supervision instead of manual labels.

Common paradigms include:

contrastive learning;
masked image modeling;
masked autoencoding;
image-text alignment;
temporal prediction;
future frame prediction;
reconstruction-based learning.

Examples include:

SimCLR;
MoCo;
BYOL;
DINO;
MAE;
CLIP.

For autonomous driving, self-supervised learning is attractive because large amounts of driving data are available but dense labels are expensive.

5.5 Information Bottleneck

The information bottleneck principle suggests that a good representation should keep information about the target while discarding irrelevant input details.

The objective can be written conceptually as:

\[\min I(X;Z) - \beta I(Z;Y),\]

where:

(X) is the input;
(Z) is the representation;
(Y) is the target;
(I(\cdot;\cdot)) denotes mutual information.

This idea is useful for thinking about compression, generalization, and communication-efficient perception.

In collaborative perception, agents should communicate representations that preserve task-relevant information while removing redundancy.

5.6 Token-Based Representations

Modern Transformers represent data as tokens.

In vision, an image can be divided into patch tokens. In 3D perception, a scene can be represented as:

image tokens;
BEV tokens;
voxel tokens;
object tokens;
memory tokens;
communication tokens.

Token-based representations are flexible because the same token set can support:

perception;
temporal memory;
communication;
fusion;
prediction.

For my research, this is especially important. Collaborative occupancy prediction can be formulated as token-level perception, token-level communication, and token-level fusion.

6. Machine Learning for Computer Vision and Autonomous Driving

Machine learning concepts become concrete in computer vision and autonomous driving tasks.

6.1 Dense Prediction

Dense prediction tasks produce structured outputs over pixels, voxels, or BEV grids.

Examples:

semantic segmentation;
depth estimation;
optical flow;
semantic occupancy prediction;
BEV segmentation;
3D scene completion.

Unlike image classification, dense prediction requires preserving spatial structure.

Important ML challenges include:

class imbalance;
spatial correlation;
structured loss functions;
long-tail categories;
boundary accuracy;
uncertainty in occluded regions.

6.2 Semantic Occupancy Prediction

Semantic occupancy prediction estimates both geometry and semantics in 3D space.

The model predicts a voxel grid:

\[O \in \{0,1,\ldots,K\}^{X\times Y\times Z}.\]

Machine learning issues in this task include:

extreme imbalance between empty and occupied voxels;
ambiguity from monocular or multi-view images;
occlusion and missing observations;
label noise in 3D annotations;
evaluation with IoU and mIoU;
temporal consistency.

This task connects supervised learning, dense prediction, 3D representation learning, uncertainty estimation, and structured prediction.

6.3 Collaborative Perception

Collaborative perception extends perception from one agent to multiple agents.

From a machine learning perspective, this introduces several new questions:

What should be communicated?
Which agents should communicate?
How should messages be fused?
How can communication cost be constrained?
How does the model handle noisy poses or delayed messages?

This turns perception into a distributed learning and inference problem.

Communication-efficient collaborative perception can be viewed through the lens of representation learning:

The communicated message should be a compact representation that preserves task-relevant information for the receiver.

6.4 Uncertainty Estimation

Autonomous systems must know when they are uncertain.

Common uncertainty types:

aleatoric uncertainty: uncertainty from inherent noise;
epistemic uncertainty: uncertainty from limited knowledge;
distributional uncertainty: uncertainty from domain shift.

Common methods:

confidence calibration;
entropy of prediction distribution;
Monte Carlo dropout;
ensembles;
evidential learning;
Bayesian neural networks.

In occupancy prediction, uncertainty is especially important for occluded regions and safety-critical planning.

6.5 World Models and Future Prediction

World models require learning how the environment evolves over time.

Machine learning questions include:

how to learn compact state representations;
how to predict future states;
how to model uncertainty over futures;
how to learn from video or sequential observations;
how to combine perception and prediction.

For occupancy world models, the goal is not just current-frame reconstruction, but future occupancy forecasting.

This connects supervised learning, sequence modeling, representation learning, probabilistic prediction, and decision-making.

7. Personal Study Plan

My current plan is to study machine learning in three layers.

7.1 Practical ML Layer

Main source:

Andrew Ng’s machine learning course.

Goal:

build practical intuition;
understand basic algorithms;
learn ML debugging workflows;
connect cost functions with optimization.

7.2 Probabilistic ML Layer

Main source:

Bishop’s Pattern Recognition and Machine Learning.

Goal:

understand probabilistic modeling;
study Bayesian decision theory;
learn latent variable models;
connect ML losses with likelihoods;
understand approximate inference.

7.3 Theoretical ML Layer

Main topics:

statistical learning theory;
generalization;
distribution shift;
representation learning;
information theory;
uncertainty estimation.

Goal:

understand why models generalize;
learn how to evaluate robustness;
build research-level intuition for ML systems.

Closing Remarks

Machine learning is not only a toolbox of algorithms. It is a way to think about data, models, uncertainty, generalization, and representation.

For my PhD preparation, the most important goal is not to memorize every algorithm, but to understand the principles behind learning systems:

how objectives are defined;
how models are optimized;
how representations are learned;
how generalization is evaluated;
how uncertainty is handled;
how ML ideas connect to real perception systems.

This foundation will support my research in computer vision, autonomous driving perception, collaborative occupancy prediction, and occupancy world models.

Roadmap

1. Core Machine Learning

1.1 Empirical Risk Minimization

1.2 Regression

1.3 Classification

1.4 Bias–Variance Tradeoff

1.5 Regularization

1.6 Model Selection and Validation

2. Andrew Ng Machine Learning Course

2.1 Linear Regression and Gradient Descent

2.2 Logistic Regression

2.3 Neural Networks

2.4 Bias, Variance, and Learning Curves

2.5 Support Vector Machines

2.6 Unsupervised Learning

K-means Clustering

Principal Component Analysis

2.7 Practical ML Workflow

3. Pattern Recognition and Machine Learning

3.1 Bayesian Decision Theory

3.2 Generative and Discriminative Models

3.3 Maximum Likelihood Estimation

3.4 MAP Estimation and Bayesian Learning

3.5 Mixture Models and EM Algorithm

3.6 Kernel Methods

3.7 Graphical Models

3.8 Variational Inference

4. Statistical Learning Theory

4.1 Expected Risk and Empirical Risk

4.2 Generalization Gap

4.3 VC Dimension

4.4 Uniform Convergence

4.5 Rademacher Complexity

4.6 Distribution Shift

5. Representation Learning

5.1 Hand-Crafted Features vs Learned Features

5.2 Invariance and Equivariance

5.3 Contrastive Learning

5.4 Self-Supervised Learning

5.5 Information Bottleneck

5.6 Token-Based Representations

6. Machine Learning for Computer Vision and Autonomous Driving

6.1 Dense Prediction

6.2 Semantic Occupancy Prediction

6.3 Collaborative Perception

6.4 Uncertainty Estimation

6.5 World Models and Future Prediction

7. Personal Study Plan

7.1 Practical ML Layer

7.2 Probabilistic ML Layer

7.3 Theoretical ML Layer

Closing Remarks

Enjoy Reading This Article?