Building My PhD Knowledge Base for Computer Vision

As I prepare for PhD applications in computer vision, autonomous driving, and embodied perception, I realized that strong research ideas must rest on a broad and systematic knowledge foundation.

My current research focuses on 3D perception, semantic occupancy prediction, collaborative perception, and occupancy world models. These topics connect many areas: mathematics, machine learning, deep learning, computer vision, computer graphics, robotics, reinforcement learning, autonomous driving, and AI agents.

This post is a personal roadmap for building that foundation. It is not meant to be a fixed curriculum, but a structured knowledge base that helps me connect theory, algorithms, systems, and research problems.


1. Mathematical Foundations

Mathematics is the language behind machine learning, computer vision, graphics, robotics, and autonomous driving. My goal is not only to learn formulas, but also to understand how mathematical tools explain model behavior, optimization stability, geometric reasoning, and uncertainty.

1.1 Matrix Theory and Linear Algebra

A deeper understanding of matrices is essential for attention mechanisms, optimization, geometry, and 3D vision. I plan to study linear algebra together with a more advanced Matrix Theory course, including materials such as the Northwestern Polytechnical University version of matrix theory.

Key topics:

  • Vector spaces, basis, dimension, rank, null space, and subspaces
  • Linear transformations and change of basis
  • Orthogonality, projections, least squares, and Gram–Schmidt
  • Eigenvalues, eigenvectors, diagonalization, and spectral decomposition
  • Singular Value Decomposition (SVD) and low-rank approximation
  • Positive definite and positive semidefinite matrices
  • Matrix norms, condition numbers, and numerical stability
  • Block matrices, Schur complement, and matrix inequalities
  • Matrix calculus for deep learning

These ideas appear everywhere in my research, especially in attention, token merging, low-rank structure, BEV representations, and 3D geometric transformations.


1.2 Numerical Analysis

Numerical analysis is important for understanding why algorithms are stable, efficient, or fragile in practice. It is especially useful for optimization, geometry, simulation, and scientific computing.

Key topics:

  • Floating-point representation and numerical errors
  • Stability, convergence, and conditioning
  • Solving linear systems and least-squares problems
  • Iterative methods such as Jacobi, Gauss–Seidel, and conjugate gradient
  • Numerical optimization and line search
  • Interpolation and approximation
  • Numerical differentiation and integration
  • ODE solvers and basic simulation methods

For machine learning and perception systems, numerical analysis helps explain training instability, gradient explosion, ill-conditioned optimization, and precision-related issues in embedded or GPU deployment.


1.3 Probability and Statistics

Probability is fundamental for perception, uncertainty estimation, sensor fusion, occupancy prediction, and world modeling.

Important concepts:

  • Random variables, PMF, PDF, CDF, expectation, variance, and covariance
  • Common distributions: Gaussian, Bernoulli, Categorical, Poisson, Dirichlet
  • Multivariate Gaussian distributions and covariance structure
  • Conditional probability and Bayes’ rule
  • Maximum Likelihood Estimation (MLE) and Maximum A Posteriori (MAP)
  • KL divergence, JS divergence, entropy, and cross entropy
  • Hypothesis testing and confidence intervals
  • Monte Carlo estimation and importance sampling
  • Uncertainty estimation, calibration, and reliability diagrams

These concepts are directly related to semantic occupancy prediction, probabilistic scene representation, uncertainty-aware perception, and future occupancy forecasting.


1.4 Optimization

Training deep neural networks is fundamentally an optimization problem. Understanding optimization also helps me reason about convergence, generalization, and the behavior of large models.

Core topics:

  • Gradient computation and backpropagation
  • SGD, Momentum, Nesterov acceleration
  • Adam, AdamW, and adaptive optimization
  • Learning-rate schedules: warmup, cosine decay, and step decay
  • Regularization: weight decay, dropout, stochastic depth
  • Constrained optimization, Lagrangian methods, and KKT conditions
  • Hessian, curvature, saddle points, and local minima
  • Sharp vs. flat minima and generalization
  • Numerical stability tricks such as log-sum-exp

These principles are essential for designing stable training pipelines for 3D perception, transformers, occupancy prediction, and multi-agent systems.


2. Machine Learning Foundations

Before studying advanced vision systems, I need a strong foundation in classical and modern machine learning.

2.1 Core Machine Learning

Main references I plan to use include:

  • Andrew Ng’s Machine Learning course
  • Pattern Recognition and Machine Learning by Christopher Bishop
  • Related materials on statistical learning, probabilistic modeling, and representation learning

Key topics:

  • Supervised learning and empirical risk minimization
  • Bias–variance tradeoff
  • Linear regression and logistic regression
  • Support vector machines and kernel methods
  • Decision trees, random forests, and boosting
  • Clustering, Gaussian mixture models, and EM algorithm
  • Probabilistic graphical models
  • Bayesian learning and approximate inference
  • Model selection, regularization, and cross-validation

This foundation helps connect classical learning theory with modern deep learning systems.


2.2 Statistical Learning Theory

Topics I aim to master:

  • Empirical risk vs. expected risk
  • Generalization and overfitting
  • VC dimension and model capacity
  • Rademacher complexity
  • Generalization bounds
  • Distribution shift, domain shift, and out-of-distribution detection
  • Robustness and uncertainty under changing environments

These topics are especially relevant for autonomous driving, where models must generalize across scenes, weather, domains, sensors, and traffic conditions.


2.3 Representation Learning

Modern deep learning depends heavily on learning useful representations.

Key concepts:

  • Invariance and equivariance
  • Contrastive learning and InfoNCE
  • Self-supervised learning such as MAE and DINO
  • Information Bottleneck theory
  • Inductive biases in CNNs and Transformers
  • Multi-modal representation learning
  • Token-based representation learning

These ideas are closely connected to my work on token-based 3D scene representations and collaborative perception.


3. Deep Learning

Deep learning is the core technical foundation for modern computer vision and autonomous driving perception.

3.1 Neural Network Fundamentals

Topics to revisit:

  • Multilayer perceptrons and activation functions
  • CNNs and convolutional inductive bias
  • Normalization methods: BatchNorm, LayerNorm, RMSNorm
  • Residual connections and deep network training
  • Dropout, stochastic depth, and regularization
  • Loss functions for classification, segmentation, and dense prediction

3.2 Transformers

Transformers are central to current computer vision, 3D perception, and multi-agent perception.

Key topics:

  • Scaled dot-product attention
\[\text{Attention}(Q,K,V)=\text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V\]
  • Multi-head attention
  • Positional encodings: absolute, relative, and RoPE
  • Encoder, decoder, and cross-attention structures
  • Vision Transformers and patch embeddings
  • Efficient attention, FlashAttention, and sparse attention
  • Token pruning, token selection, and token merging

These concepts are directly related to my research on tokenized BEV representations, spatio-temporal memory, and communication-aware token merging.


3.3 Deep Learning for Computer Vision

Main references include:

  • Stanford CS231n: Deep Learning for Computer Vision
  • Andrew Ng’s Deep Learning Specialization
  • Modern papers on CNNs, Transformers, detection, segmentation, 3D vision, and autonomous driving perception

Important topics:

  • CNN architectures and feature pyramids
  • Object detection and segmentation
  • Metric learning and contrastive learning
  • Multi-task learning
  • Dense prediction and structured output learning
  • Multi-view and multi-modal perception

4. Computer Vision

Computer vision provides the core research foundation for perception systems.

4.1 Introduction to Computer Vision

I plan to systematically study introductory computer vision materials, including courses such as Cornell University’s Introduction to Computer Vision.

Core topics:

  • Image formation and camera models
  • Filtering, edges, corners, and feature descriptors
  • Homography and image alignment
  • Optical flow and motion estimation
  • Object recognition
  • Image segmentation
  • Tracking and video understanding

These topics provide the classical foundation behind many modern deep learning-based systems.


4.2 Multi-View Geometry

For autonomous driving and 3D perception, geometry is essential.

Important concepts:

  • Pinhole camera model
  • Intrinsic and extrinsic parameters
  • Coordinate transformations and SE(3)
  • Epipolar geometry and fundamental matrix
  • Triangulation and bundle adjustment
  • Perspective-n-Point pose estimation
  • Depth estimation and stereo geometry

These ideas are important for multi-view 3D perception, BEV transformation, occupancy prediction, and pose-aware collaborative fusion.


4.3 3D Scene Representations

Common representations include:

  • Point clouds and PointNet-style architectures
  • Voxels and sparse convolution
  • Bird’s-Eye-View representations
  • Meshes and surface representations
  • Implicit fields, occupancy fields, and signed distance functions
  • Gaussian splatting and neural scene representations

This part connects computer vision with computer graphics and 3D world modeling.


5. Computer Graphics

Computer graphics is increasingly important for computer vision, autonomous driving, simulation, embodied AI, and world modeling. It provides the tools for understanding 3D geometry, rendering, simulation, and scene representation.

5.1 Geometry and Rendering

Key topics:

  • 3D transformations and homogeneous coordinates
  • Mesh representation and surface parameterization
  • Rasterization and z-buffering
  • Ray tracing and path tracing
  • Shading models and lighting
  • Texture mapping and material representation
  • Differentiable rendering

These topics help me understand the bridge between 3D vision, neural rendering, simulation, and occupancy-based world modeling.


5.2 Neural Graphics and 3D Reconstruction

Important topics:

  • Neural Radiance Fields (NeRF)
  • 3D Gaussian Splatting
  • Differentiable rendering
  • Neural implicit surfaces
  • Scene reconstruction and view synthesis
  • Simulation-to-real transfer

These methods are relevant to 3D scene understanding, autonomous driving simulation, and embodied perception.


6. Reinforcement Learning and Decision Making

Although my main focus is perception, reinforcement learning is important for understanding agents, planning, and embodied intelligence.

Key topics:

  • Markov Decision Processes
  • Dynamic programming
  • Monte Carlo methods
  • Temporal-difference learning
  • Q-learning and policy gradients
  • Actor–critic methods
  • Model-based reinforcement learning
  • Offline reinforcement learning
  • Multi-agent reinforcement learning

These topics help connect perception with downstream decision-making and control.


7. AI Agents and Embodied Intelligence

AI agents and embodied intelligence are becoming increasingly important for future intelligent systems. For my research direction, I mainly focus on the perception and world-modeling layer of embodied agents.

7.1 AI Agents

Key topics:

  • Agent architectures: perception, memory, planning, and action
  • Tool use and environment interaction
  • Memory mechanisms and retrieval
  • Planning and reasoning
  • Multi-agent cooperation
  • Evaluation of agent behavior

7.2 Embodied Intelligence

Important topics:

  • Embodied perception
  • Visual navigation
  • Object interaction and manipulation
  • Scene memory and spatial reasoning
  • Simulators and embodied benchmarks
  • World models for embodied agents

This area connects naturally to my interest in occupancy world models, temporal reasoning, and predictive 3D scene understanding.


8. Autonomous Driving Perception

My main research direction lies in autonomous driving perception and 3D scene understanding.

8.1 BEV Representation and Sensor Fusion

Key paradigms:

  • Early, middle, and late fusion
  • Camera-only BEV perception
  • LiDAR-camera fusion
  • Lift-splat style view transformation
  • Cross-attention based 3D lifting
  • Temporal alignment and ego-motion compensation

8.2 Occupancy Prediction

Important topics:

  • Binary occupancy and semantic occupancy
  • Voxel grid representation
  • Visibility reasoning and occlusion handling
  • Class imbalance in dense prediction
  • Occupancy forecasting
  • Occupancy world models

This is one of my core research interests.


8.3 Temporal Modeling and World Models

Dynamic environments require temporal reasoning.

Topics include:

  • Temporal memory and feature aggregation
  • Scene flow and motion modeling
  • Future occupancy prediction
  • 4D occupancy forecasting
  • World models for autonomous driving
  • Online vs. offline perception constraints

9. Collaborative Perception and Communication

Collaborative perception introduces multi-agent reasoning and communication constraints.

9.1 Collaboration Paradigms

Important research questions:

  • What information should agents communicate?
  • When should they communicate?
  • Which agents should they communicate with?
  • How should received information be aligned and fused?

Common fusion methods include feature concatenation, attention-based fusion, graph-based aggregation, and token-level fusion.


9.2 Communication Efficiency

Bandwidth constraints require efficient message design.

Key ideas:

  • Feature compression
  • Token selection and token merging
  • Quantization and pruning
  • Task-aware communication
  • Adaptive communication budgets
  • Rate–distortion style trade-offs

These ideas directly connect to my current research on communication-efficient collaborative occupancy prediction.


10. Suggested Learning Resources

This roadmap will be supported by several courses and books:

  • Matrix Theory: Northwestern Polytechnical University matrix theory materials
  • Numerical Analysis: standard numerical analysis textbooks and course notes
  • Machine Learning: Andrew Ng’s Machine Learning course
  • Pattern Recognition: Christopher Bishop’s Pattern Recognition and Machine Learning
  • Deep Learning: Andrew Ng’s Deep Learning Specialization
  • Computer Vision: Cornell University’s Introduction to Computer Vision
  • Deep Vision: Stanford CS231n: Deep Learning for Computer Vision
  • Reinforcement Learning: Sutton and Barto’s Reinforcement Learning: An Introduction
  • Computer Graphics: introductory graphics courses and neural rendering papers
  • Autonomous Driving: papers on BEV perception, occupancy prediction, collaborative perception, and world models
  • AI Agents and Embodied AI: recent papers and surveys on agents, embodied perception, and world modeling

Closing Thoughts

This roadmap represents the knowledge foundation I want to build for PhD research.

For each topic, my goal is to be able to:

  1. define the concept clearly;
  2. explain why it matters;
  3. connect it to my research;
  4. implement representative algorithms;
  5. read and critique related papers.

Research is not just about knowing isolated methods. It is about connecting ideas across fields. This knowledge base is my attempt to build those connections systematically, from mathematics and machine learning to computer vision, graphics, autonomous driving, and embodied AI.




Enjoy Reading This Article?

Here are some more articles you might like to read next:

  • LLM Learning: From Pretraining to Decoder Inference
  • LLM学习:从 Pretraining 到 Decoder 推理
  • Refining My PhD Research Direction Around 3D Perception
  • 围绕三维感知进一步明确 Ph.D. 研究方向
  • From Occupancy Prediction to Occupancy World Models