As I prepare for PhD applications in computer vision, autonomous driving, and embodied perception, I realized that strong research ideas must rest on a broad and systematic knowledge foundation.

My current research focuses on 3D perception, semantic occupancy prediction, collaborative perception, and occupancy world models. These topics connect many areas: mathematics, machine learning, deep learning, computer vision, computer graphics, robotics, reinforcement learning, autonomous driving, and AI agents.

This post is a personal roadmap for building that foundation. It is not meant to be a fixed curriculum, but a structured knowledge base that helps me connect theory, algorithms, systems, and research problems.

1. Mathematical Foundations

Mathematics is the language behind machine learning, computer vision, graphics, robotics, and autonomous driving. My goal is not only to learn formulas, but also to understand how mathematical tools explain model behavior, optimization stability, geometric reasoning, and uncertainty.

1.1 Matrix Theory and Linear Algebra

A deeper understanding of matrices is essential for attention mechanisms, optimization, geometry, and 3D vision. I plan to study linear algebra together with a more advanced Matrix Theory course, including materials such as the Northwestern Polytechnical University version of matrix theory.

Key topics:

Vector spaces, basis, dimension, rank, null space, and subspaces
Linear transformations and change of basis
Orthogonality, projections, least squares, and Gram–Schmidt
Eigenvalues, eigenvectors, diagonalization, and spectral decomposition
Singular Value Decomposition (SVD) and low-rank approximation
Positive definite and positive semidefinite matrices
Matrix norms, condition numbers, and numerical stability
Block matrices, Schur complement, and matrix inequalities
Matrix calculus for deep learning

These ideas appear everywhere in my research, especially in attention, token merging, low-rank structure, BEV representations, and 3D geometric transformations.

1.2 Numerical Analysis

Numerical analysis is important for understanding why algorithms are stable, efficient, or fragile in practice. It is especially useful for optimization, geometry, simulation, and scientific computing.

Key topics:

Floating-point representation and numerical errors
Stability, convergence, and conditioning
Solving linear systems and least-squares problems
Iterative methods such as Jacobi, Gauss–Seidel, and conjugate gradient
Numerical optimization and line search
Interpolation and approximation
Numerical differentiation and integration
ODE solvers and basic simulation methods

For machine learning and perception systems, numerical analysis helps explain training instability, gradient explosion, ill-conditioned optimization, and precision-related issues in embedded or GPU deployment.

1.3 Probability and Statistics

Probability is fundamental for perception, uncertainty estimation, sensor fusion, occupancy prediction, and world modeling.

Important concepts:

Random variables, PMF, PDF, CDF, expectation, variance, and covariance
Common distributions: Gaussian, Bernoulli, Categorical, Poisson, Dirichlet
Multivariate Gaussian distributions and covariance structure
Conditional probability and Bayes’ rule
Maximum Likelihood Estimation (MLE) and Maximum A Posteriori (MAP)
KL divergence, JS divergence, entropy, and cross entropy
Hypothesis testing and confidence intervals
Monte Carlo estimation and importance sampling
Uncertainty estimation, calibration, and reliability diagrams

These concepts are directly related to semantic occupancy prediction, probabilistic scene representation, uncertainty-aware perception, and future occupancy forecasting.

1.4 Optimization

Training deep neural networks is fundamentally an optimization problem. Understanding optimization also helps me reason about convergence, generalization, and the behavior of large models.

Core topics:

Gradient computation and backpropagation
SGD, Momentum, Nesterov acceleration
Adam, AdamW, and adaptive optimization
Learning-rate schedules: warmup, cosine decay, and step decay
Regularization: weight decay, dropout, stochastic depth
Constrained optimization, Lagrangian methods, and KKT conditions
Hessian, curvature, saddle points, and local minima
Sharp vs. flat minima and generalization
Numerical stability tricks such as log-sum-exp

These principles are essential for designing stable training pipelines for 3D perception, transformers, occupancy prediction, and multi-agent systems.

2. Machine Learning Foundations

Before studying advanced vision systems, I need a strong foundation in classical and modern machine learning.

2.1 Core Machine Learning

Main references I plan to use include:

Andrew Ng’s Machine Learning course
Pattern Recognition and Machine Learning by Christopher Bishop
Related materials on statistical learning, probabilistic modeling, and representation learning

Key topics:

Supervised learning and empirical risk minimization
Bias–variance tradeoff
Linear regression and logistic regression
Support vector machines and kernel methods
Decision trees, random forests, and boosting
Clustering, Gaussian mixture models, and EM algorithm
Probabilistic graphical models
Bayesian learning and approximate inference
Model selection, regularization, and cross-validation

This foundation helps connect classical learning theory with modern deep learning systems.

2.2 Statistical Learning Theory

Topics I aim to master:

Empirical risk vs. expected risk
Generalization and overfitting
VC dimension and model capacity
Rademacher complexity
Generalization bounds
Distribution shift, domain shift, and out-of-distribution detection
Robustness and uncertainty under changing environments

These topics are especially relevant for autonomous driving, where models must generalize across scenes, weather, domains, sensors, and traffic conditions.

2.3 Representation Learning

Modern deep learning depends heavily on learning useful representations.

Key concepts:

Invariance and equivariance
Contrastive learning and InfoNCE
Self-supervised learning such as MAE and DINO
Information Bottleneck theory
Inductive biases in CNNs and Transformers
Multi-modal representation learning
Token-based representation learning

These ideas are closely connected to my work on token-based 3D scene representations and collaborative perception.

3. Deep Learning

Deep learning is the core technical foundation for modern computer vision and autonomous driving perception.

3.1 Neural Network Fundamentals

Topics to revisit:

Multilayer perceptrons and activation functions
CNNs and convolutional inductive bias
Normalization methods: BatchNorm, LayerNorm, RMSNorm
Residual connections and deep network training
Dropout, stochastic depth, and regularization
Loss functions for classification, segmentation, and dense prediction

3.2 Transformers

Transformers are central to current computer vision, 3D perception, and multi-agent perception.

Key topics:

Scaled dot-product attention

\[\text{Attention}(Q,K,V)=\text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V\]

Multi-head attention
Positional encodings: absolute, relative, and RoPE
Encoder, decoder, and cross-attention structures
Vision Transformers and patch embeddings
Efficient attention, FlashAttention, and sparse attention
Token pruning, token selection, and token merging

These concepts are directly related to my research on tokenized BEV representations, spatio-temporal memory, and communication-aware token merging.

3.3 Deep Learning for Computer Vision

Main references include:

Stanford CS231n: Deep Learning for Computer Vision
Andrew Ng’s Deep Learning Specialization
Modern papers on CNNs, Transformers, detection, segmentation, 3D vision, and autonomous driving perception

Important topics:

CNN architectures and feature pyramids
Object detection and segmentation
Metric learning and contrastive learning
Multi-task learning
Dense prediction and structured output learning
Multi-view and multi-modal perception

4. Computer Vision

Computer vision provides the core research foundation for perception systems.

4.1 Introduction to Computer Vision

I plan to systematically study introductory computer vision materials, including courses such as Cornell University’s Introduction to Computer Vision.

Core topics:

Image formation and camera models
Filtering, edges, corners, and feature descriptors
Homography and image alignment
Optical flow and motion estimation
Object recognition
Image segmentation
Tracking and video understanding

These topics provide the classical foundation behind many modern deep learning-based systems.

4.2 Multi-View Geometry

For autonomous driving and 3D perception, geometry is essential.

Important concepts:

Pinhole camera model
Intrinsic and extrinsic parameters
Coordinate transformations and SE(3)
Epipolar geometry and fundamental matrix
Triangulation and bundle adjustment
Perspective-n-Point pose estimation
Depth estimation and stereo geometry

These ideas are important for multi-view 3D perception, BEV transformation, occupancy prediction, and pose-aware collaborative fusion.

4.3 3D Scene Representations

Common representations include:

Point clouds and PointNet-style architectures
Voxels and sparse convolution
Bird’s-Eye-View representations
Meshes and surface representations
Implicit fields, occupancy fields, and signed distance functions
Gaussian splatting and neural scene representations

This part connects computer vision with computer graphics and 3D world modeling.

5. Computer Graphics

Computer graphics is increasingly important for computer vision, autonomous driving, simulation, embodied AI, and world modeling. It provides the tools for understanding 3D geometry, rendering, simulation, and scene representation.

5.1 Geometry and Rendering

Key topics:

3D transformations and homogeneous coordinates
Mesh representation and surface parameterization
Rasterization and z-buffering
Ray tracing and path tracing
Shading models and lighting
Texture mapping and material representation
Differentiable rendering

These topics help me understand the bridge between 3D vision, neural rendering, simulation, and occupancy-based world modeling.

5.2 Neural Graphics and 3D Reconstruction

Important topics:

Neural Radiance Fields (NeRF)
3D Gaussian Splatting
Differentiable rendering
Neural implicit surfaces
Scene reconstruction and view synthesis
Simulation-to-real transfer

These methods are relevant to 3D scene understanding, autonomous driving simulation, and embodied perception.

6. Reinforcement Learning and Decision Making

Although my main focus is perception, reinforcement learning is important for understanding agents, planning, and embodied intelligence.

Key topics:

Markov Decision Processes
Dynamic programming
Monte Carlo methods
Temporal-difference learning
Q-learning and policy gradients
Actor–critic methods
Model-based reinforcement learning
Offline reinforcement learning
Multi-agent reinforcement learning

These topics help connect perception with downstream decision-making and control.

7. AI Agents and Embodied Intelligence

AI agents and embodied intelligence are becoming increasingly important for future intelligent systems. For my research direction, I mainly focus on the perception and world-modeling layer of embodied agents.

7.1 AI Agents

Key topics:

Agent architectures: perception, memory, planning, and action
Tool use and environment interaction
Memory mechanisms and retrieval
Planning and reasoning
Multi-agent cooperation
Evaluation of agent behavior

7.2 Embodied Intelligence

Important topics:

Embodied perception
Visual navigation
Object interaction and manipulation
Scene memory and spatial reasoning
Simulators and embodied benchmarks
World models for embodied agents

This area connects naturally to my interest in occupancy world models, temporal reasoning, and predictive 3D scene understanding.

8. Autonomous Driving Perception

My main research direction lies in autonomous driving perception and 3D scene understanding.

8.1 BEV Representation and Sensor Fusion

Key paradigms:

Early, middle, and late fusion
Camera-only BEV perception
LiDAR-camera fusion
Lift-splat style view transformation
Cross-attention based 3D lifting
Temporal alignment and ego-motion compensation

8.2 Occupancy Prediction

Important topics:

Binary occupancy and semantic occupancy
Voxel grid representation
Visibility reasoning and occlusion handling
Class imbalance in dense prediction
Occupancy forecasting
Occupancy world models

This is one of my core research interests.

8.3 Temporal Modeling and World Models

Dynamic environments require temporal reasoning.

Topics include:

Temporal memory and feature aggregation
Scene flow and motion modeling
Future occupancy prediction
4D occupancy forecasting
World models for autonomous driving
Online vs. offline perception constraints

9. Collaborative Perception and Communication

Collaborative perception introduces multi-agent reasoning and communication constraints.

9.1 Collaboration Paradigms

Important research questions:

What information should agents communicate?
When should they communicate?
Which agents should they communicate with?
How should received information be aligned and fused?

Common fusion methods include feature concatenation, attention-based fusion, graph-based aggregation, and token-level fusion.

9.2 Communication Efficiency

Bandwidth constraints require efficient message design.

Key ideas:

Feature compression
Token selection and token merging
Quantization and pruning
Task-aware communication
Adaptive communication budgets
Rate–distortion style trade-offs

These ideas directly connect to my current research on communication-efficient collaborative occupancy prediction.

10. Suggested Learning Resources

This roadmap will be supported by several courses and books:

Matrix Theory: Northwestern Polytechnical University matrix theory materials
Numerical Analysis: standard numerical analysis textbooks and course notes
Machine Learning: Andrew Ng’s Machine Learning course
Pattern Recognition: Christopher Bishop’s Pattern Recognition and Machine Learning
Deep Learning: Andrew Ng’s Deep Learning Specialization
Computer Vision: Cornell University’s Introduction to Computer Vision
Deep Vision: Stanford CS231n: Deep Learning for Computer Vision
Reinforcement Learning: Sutton and Barto’s Reinforcement Learning: An Introduction
Computer Graphics: introductory graphics courses and neural rendering papers
Autonomous Driving: papers on BEV perception, occupancy prediction, collaborative perception, and world models
AI Agents and Embodied AI: recent papers and surveys on agents, embodied perception, and world modeling

Closing Thoughts

This roadmap represents the knowledge foundation I want to build for PhD research.

For each topic, my goal is to be able to:

define the concept clearly;
explain why it matters;
connect it to my research;
implement representative algorithms;
read and critique related papers.

Research is not just about knowing isolated methods. It is about connecting ideas across fields. This knowledge base is my attempt to build those connections systematically, from mathematics and machine learning to computer vision, graphics, autonomous driving, and embodied AI.

Building My PhD Knowledge Base for Computer Vision

1. Mathematical Foundations

1.1 Matrix Theory and Linear Algebra

1.2 Numerical Analysis

1.3 Probability and Statistics

1.4 Optimization

2. Machine Learning Foundations

2.1 Core Machine Learning

2.2 Statistical Learning Theory

2.3 Representation Learning

3. Deep Learning

3.1 Neural Network Fundamentals

3.2 Transformers

3.3 Deep Learning for Computer Vision

4. Computer Vision

4.1 Introduction to Computer Vision

4.2 Multi-View Geometry

4.3 3D Scene Representations

5. Computer Graphics

5.1 Geometry and Rendering

5.2 Neural Graphics and 3D Reconstruction

6. Reinforcement Learning and Decision Making

7. AI Agents and Embodied Intelligence

7.1 AI Agents

7.2 Embodied Intelligence

8. Autonomous Driving Perception

8.1 BEV Representation and Sensor Fusion

8.2 Occupancy Prediction

8.3 Temporal Modeling and World Models

9. Collaborative Perception and Communication

9.1 Collaboration Paradigms

9.2 Communication Efficiency

10. Suggested Learning Resources

Closing Thoughts

Enjoy Reading This Article?