Building My PhD Knowledge Base for Computer Vision
As I prepare for PhD applications in computer vision, autonomous driving, and embodied perception, I realized that strong research ideas must rest on a broad and systematic knowledge foundation.
My current research focuses on 3D perception, semantic occupancy prediction, collaborative perception, and occupancy world models. These topics connect many areas: mathematics, machine learning, deep learning, computer vision, computer graphics, robotics, reinforcement learning, autonomous driving, and AI agents.
This post is a personal roadmap for building that foundation. It is not meant to be a fixed curriculum, but a structured knowledge base that helps me connect theory, algorithms, systems, and research problems.
1. Mathematical Foundations
Mathematics is the language behind machine learning, computer vision, graphics, robotics, and autonomous driving. My goal is not only to learn formulas, but also to understand how mathematical tools explain model behavior, optimization stability, geometric reasoning, and uncertainty.
1.1 Matrix Theory and Linear Algebra
A deeper understanding of matrices is essential for attention mechanisms, optimization, geometry, and 3D vision. I plan to study linear algebra together with a more advanced Matrix Theory course, including materials such as the Northwestern Polytechnical University version of matrix theory.
Key topics:
- Vector spaces, basis, dimension, rank, null space, and subspaces
- Linear transformations and change of basis
- Orthogonality, projections, least squares, and Gram–Schmidt
- Eigenvalues, eigenvectors, diagonalization, and spectral decomposition
- Singular Value Decomposition (SVD) and low-rank approximation
- Positive definite and positive semidefinite matrices
- Matrix norms, condition numbers, and numerical stability
- Block matrices, Schur complement, and matrix inequalities
- Matrix calculus for deep learning
These ideas appear everywhere in my research, especially in attention, token merging, low-rank structure, BEV representations, and 3D geometric transformations.
1.2 Numerical Analysis
Numerical analysis is important for understanding why algorithms are stable, efficient, or fragile in practice. It is especially useful for optimization, geometry, simulation, and scientific computing.
Key topics:
- Floating-point representation and numerical errors
- Stability, convergence, and conditioning
- Solving linear systems and least-squares problems
- Iterative methods such as Jacobi, Gauss–Seidel, and conjugate gradient
- Numerical optimization and line search
- Interpolation and approximation
- Numerical differentiation and integration
- ODE solvers and basic simulation methods
For machine learning and perception systems, numerical analysis helps explain training instability, gradient explosion, ill-conditioned optimization, and precision-related issues in embedded or GPU deployment.
1.3 Probability and Statistics
Probability is fundamental for perception, uncertainty estimation, sensor fusion, occupancy prediction, and world modeling.
Important concepts:
- Random variables, PMF, PDF, CDF, expectation, variance, and covariance
- Common distributions: Gaussian, Bernoulli, Categorical, Poisson, Dirichlet
- Multivariate Gaussian distributions and covariance structure
- Conditional probability and Bayes’ rule
- Maximum Likelihood Estimation (MLE) and Maximum A Posteriori (MAP)
- KL divergence, JS divergence, entropy, and cross entropy
- Hypothesis testing and confidence intervals
- Monte Carlo estimation and importance sampling
- Uncertainty estimation, calibration, and reliability diagrams
These concepts are directly related to semantic occupancy prediction, probabilistic scene representation, uncertainty-aware perception, and future occupancy forecasting.
1.4 Optimization
Training deep neural networks is fundamentally an optimization problem. Understanding optimization also helps me reason about convergence, generalization, and the behavior of large models.
Core topics:
- Gradient computation and backpropagation
- SGD, Momentum, Nesterov acceleration
- Adam, AdamW, and adaptive optimization
- Learning-rate schedules: warmup, cosine decay, and step decay
- Regularization: weight decay, dropout, stochastic depth
- Constrained optimization, Lagrangian methods, and KKT conditions
- Hessian, curvature, saddle points, and local minima
- Sharp vs. flat minima and generalization
- Numerical stability tricks such as log-sum-exp
These principles are essential for designing stable training pipelines for 3D perception, transformers, occupancy prediction, and multi-agent systems.
2. Machine Learning Foundations
Before studying advanced vision systems, I need a strong foundation in classical and modern machine learning.
2.1 Core Machine Learning
Main references I plan to use include:
- Andrew Ng’s Machine Learning course
- Pattern Recognition and Machine Learning by Christopher Bishop
- Related materials on statistical learning, probabilistic modeling, and representation learning
Key topics:
- Supervised learning and empirical risk minimization
- Bias–variance tradeoff
- Linear regression and logistic regression
- Support vector machines and kernel methods
- Decision trees, random forests, and boosting
- Clustering, Gaussian mixture models, and EM algorithm
- Probabilistic graphical models
- Bayesian learning and approximate inference
- Model selection, regularization, and cross-validation
This foundation helps connect classical learning theory with modern deep learning systems.
2.2 Statistical Learning Theory
Topics I aim to master:
- Empirical risk vs. expected risk
- Generalization and overfitting
- VC dimension and model capacity
- Rademacher complexity
- Generalization bounds
- Distribution shift, domain shift, and out-of-distribution detection
- Robustness and uncertainty under changing environments
These topics are especially relevant for autonomous driving, where models must generalize across scenes, weather, domains, sensors, and traffic conditions.
2.3 Representation Learning
Modern deep learning depends heavily on learning useful representations.
Key concepts:
- Invariance and equivariance
- Contrastive learning and InfoNCE
- Self-supervised learning such as MAE and DINO
- Information Bottleneck theory
- Inductive biases in CNNs and Transformers
- Multi-modal representation learning
- Token-based representation learning
These ideas are closely connected to my work on token-based 3D scene representations and collaborative perception.
3. Deep Learning
Deep learning is the core technical foundation for modern computer vision and autonomous driving perception.
3.1 Neural Network Fundamentals
Topics to revisit:
- Multilayer perceptrons and activation functions
- CNNs and convolutional inductive bias
- Normalization methods: BatchNorm, LayerNorm, RMSNorm
- Residual connections and deep network training
- Dropout, stochastic depth, and regularization
- Loss functions for classification, segmentation, and dense prediction
3.2 Transformers
Transformers are central to current computer vision, 3D perception, and multi-agent perception.
Key topics:
- Scaled dot-product attention
- Multi-head attention
- Positional encodings: absolute, relative, and RoPE
- Encoder, decoder, and cross-attention structures
- Vision Transformers and patch embeddings
- Efficient attention, FlashAttention, and sparse attention
- Token pruning, token selection, and token merging
These concepts are directly related to my research on tokenized BEV representations, spatio-temporal memory, and communication-aware token merging.
3.3 Deep Learning for Computer Vision
Main references include:
- Stanford CS231n: Deep Learning for Computer Vision
- Andrew Ng’s Deep Learning Specialization
- Modern papers on CNNs, Transformers, detection, segmentation, 3D vision, and autonomous driving perception
Important topics:
- CNN architectures and feature pyramids
- Object detection and segmentation
- Metric learning and contrastive learning
- Multi-task learning
- Dense prediction and structured output learning
- Multi-view and multi-modal perception
4. Computer Vision
Computer vision provides the core research foundation for perception systems.
4.1 Introduction to Computer Vision
I plan to systematically study introductory computer vision materials, including courses such as Cornell University’s Introduction to Computer Vision.
Core topics:
- Image formation and camera models
- Filtering, edges, corners, and feature descriptors
- Homography and image alignment
- Optical flow and motion estimation
- Object recognition
- Image segmentation
- Tracking and video understanding
These topics provide the classical foundation behind many modern deep learning-based systems.
4.2 Multi-View Geometry
For autonomous driving and 3D perception, geometry is essential.
Important concepts:
- Pinhole camera model
- Intrinsic and extrinsic parameters
- Coordinate transformations and SE(3)
- Epipolar geometry and fundamental matrix
- Triangulation and bundle adjustment
- Perspective-n-Point pose estimation
- Depth estimation and stereo geometry
These ideas are important for multi-view 3D perception, BEV transformation, occupancy prediction, and pose-aware collaborative fusion.
4.3 3D Scene Representations
Common representations include:
- Point clouds and PointNet-style architectures
- Voxels and sparse convolution
- Bird’s-Eye-View representations
- Meshes and surface representations
- Implicit fields, occupancy fields, and signed distance functions
- Gaussian splatting and neural scene representations
This part connects computer vision with computer graphics and 3D world modeling.
5. Computer Graphics
Computer graphics is increasingly important for computer vision, autonomous driving, simulation, embodied AI, and world modeling. It provides the tools for understanding 3D geometry, rendering, simulation, and scene representation.
5.1 Geometry and Rendering
Key topics:
- 3D transformations and homogeneous coordinates
- Mesh representation and surface parameterization
- Rasterization and z-buffering
- Ray tracing and path tracing
- Shading models and lighting
- Texture mapping and material representation
- Differentiable rendering
These topics help me understand the bridge between 3D vision, neural rendering, simulation, and occupancy-based world modeling.
5.2 Neural Graphics and 3D Reconstruction
Important topics:
- Neural Radiance Fields (NeRF)
- 3D Gaussian Splatting
- Differentiable rendering
- Neural implicit surfaces
- Scene reconstruction and view synthesis
- Simulation-to-real transfer
These methods are relevant to 3D scene understanding, autonomous driving simulation, and embodied perception.
6. Reinforcement Learning and Decision Making
Although my main focus is perception, reinforcement learning is important for understanding agents, planning, and embodied intelligence.
Key topics:
- Markov Decision Processes
- Dynamic programming
- Monte Carlo methods
- Temporal-difference learning
- Q-learning and policy gradients
- Actor–critic methods
- Model-based reinforcement learning
- Offline reinforcement learning
- Multi-agent reinforcement learning
These topics help connect perception with downstream decision-making and control.
7. AI Agents and Embodied Intelligence
AI agents and embodied intelligence are becoming increasingly important for future intelligent systems. For my research direction, I mainly focus on the perception and world-modeling layer of embodied agents.
7.1 AI Agents
Key topics:
- Agent architectures: perception, memory, planning, and action
- Tool use and environment interaction
- Memory mechanisms and retrieval
- Planning and reasoning
- Multi-agent cooperation
- Evaluation of agent behavior
7.2 Embodied Intelligence
Important topics:
- Embodied perception
- Visual navigation
- Object interaction and manipulation
- Scene memory and spatial reasoning
- Simulators and embodied benchmarks
- World models for embodied agents
This area connects naturally to my interest in occupancy world models, temporal reasoning, and predictive 3D scene understanding.
8. Autonomous Driving Perception
My main research direction lies in autonomous driving perception and 3D scene understanding.
8.1 BEV Representation and Sensor Fusion
Key paradigms:
- Early, middle, and late fusion
- Camera-only BEV perception
- LiDAR-camera fusion
- Lift-splat style view transformation
- Cross-attention based 3D lifting
- Temporal alignment and ego-motion compensation
8.2 Occupancy Prediction
Important topics:
- Binary occupancy and semantic occupancy
- Voxel grid representation
- Visibility reasoning and occlusion handling
- Class imbalance in dense prediction
- Occupancy forecasting
- Occupancy world models
This is one of my core research interests.
8.3 Temporal Modeling and World Models
Dynamic environments require temporal reasoning.
Topics include:
- Temporal memory and feature aggregation
- Scene flow and motion modeling
- Future occupancy prediction
- 4D occupancy forecasting
- World models for autonomous driving
- Online vs. offline perception constraints
9. Collaborative Perception and Communication
Collaborative perception introduces multi-agent reasoning and communication constraints.
9.1 Collaboration Paradigms
Important research questions:
- What information should agents communicate?
- When should they communicate?
- Which agents should they communicate with?
- How should received information be aligned and fused?
Common fusion methods include feature concatenation, attention-based fusion, graph-based aggregation, and token-level fusion.
9.2 Communication Efficiency
Bandwidth constraints require efficient message design.
Key ideas:
- Feature compression
- Token selection and token merging
- Quantization and pruning
- Task-aware communication
- Adaptive communication budgets
- Rate–distortion style trade-offs
These ideas directly connect to my current research on communication-efficient collaborative occupancy prediction.
10. Suggested Learning Resources
This roadmap will be supported by several courses and books:
- Matrix Theory: Northwestern Polytechnical University matrix theory materials
- Numerical Analysis: standard numerical analysis textbooks and course notes
- Machine Learning: Andrew Ng’s Machine Learning course
- Pattern Recognition: Christopher Bishop’s Pattern Recognition and Machine Learning
- Deep Learning: Andrew Ng’s Deep Learning Specialization
- Computer Vision: Cornell University’s Introduction to Computer Vision
- Deep Vision: Stanford CS231n: Deep Learning for Computer Vision
- Reinforcement Learning: Sutton and Barto’s Reinforcement Learning: An Introduction
- Computer Graphics: introductory graphics courses and neural rendering papers
- Autonomous Driving: papers on BEV perception, occupancy prediction, collaborative perception, and world models
- AI Agents and Embodied AI: recent papers and surveys on agents, embodied perception, and world modeling
Closing Thoughts
This roadmap represents the knowledge foundation I want to build for PhD research.
For each topic, my goal is to be able to:
- define the concept clearly;
- explain why it matters;
- connect it to my research;
- implement representative algorithms;
- read and critique related papers.
Research is not just about knowing isolated methods. It is about connecting ideas across fields. This knowledge base is my attempt to build those connections systematically, from mathematics and machine learning to computer vision, graphics, autonomous driving, and embodied AI.
Enjoy Reading This Article?
Here are some more articles you might like to read next: