Computer Vision Foundations
Computer vision is one of the core foundations of my PhD preparation. My current research interests, including 3D perception, semantic occupancy prediction, collaborative perception, and occupancy world models, all depend on a deep understanding of visual representation, geometry, and scene understanding.
This note is my long-term study record for computer vision. It covers both classical computer vision and modern deep learning-based vision, with a focus on the concepts most relevant to autonomous driving and embodied perception.
Main references include:
- Cornell University’s Introduction to Computer Vision;
- classical multi-view geometry materials;
- modern 3D vision and depth estimation papers;
- research on BEV perception, occupancy prediction, and autonomous driving scene understanding.
Roadmap
This note is organized into the following chapters:
-
Introduction to Computer Vision
Image formation, filtering, edges, corners, feature descriptors, optical flow, segmentation, recognition, and tracking. -
Image Formation and Camera Models
Pinhole camera model, intrinsic and extrinsic parameters, projection, distortion, and coordinate systems. -
Classical Visual Features
Edges, corners, SIFT, HOG, ORB, image matching, homography, and feature-based recognition. -
Multi-View Geometry
Epipolar geometry, fundamental matrix, essential matrix, triangulation, stereo vision, PnP, and bundle adjustment. -
Depth Estimation
Monocular depth, stereo depth, multi-view depth, self-supervised depth estimation, and depth-aware 3D perception. -
3D Scene Representations
Point clouds, voxels, BEV, meshes, implicit fields, occupancy fields, signed distance functions, and Gaussian representations. -
Vision for Autonomous Driving and Embodied Perception
BEV perception, semantic occupancy prediction, collaborative perception, temporal modeling, and world models.
1. Introduction to Computer Vision
Computer vision studies how machines perceive, interpret, and reason about visual information. The input is usually an image, video, or 3D observation, while the output may be a label, bounding box, segmentation mask, depth map, 3D reconstruction, occupancy grid, or future scene prediction.
A typical vision pipeline can be summarized as:
Image acquisition → feature extraction → geometric reasoning → semantic understanding → task-level prediction
In modern deep learning systems, many of these steps are learned end-to-end. However, the classical concepts remain important because they explain the geometric and physical structure behind visual data.
1.1 What Computer Vision Tries to Solve
Computer vision problems can be grouped into several levels.
Low-Level Vision
Low-level vision focuses on local image processing and signal-level operations:
- denoising;
- filtering;
- edge detection;
- corner detection;
- texture analysis;
- optical flow.
These methods help extract basic visual structures from images.
Mid-Level Vision
Mid-level vision focuses on grouping and geometric structure:
- image segmentation;
- feature matching;
- motion estimation;
- stereo matching;
- depth estimation;
- structure from motion.
This level connects raw image evidence with geometric interpretation.
High-Level Vision
High-level vision focuses on semantic understanding:
- image classification;
- object detection;
- semantic segmentation;
- instance segmentation;
- scene understanding;
- visual reasoning.
Modern autonomous driving perception combines all three levels: low-level image cues, geometric reasoning, and high-level semantic prediction.
1.2 Cornell Introduction to Computer Vision
Cornell’s Introduction to Computer Vision provides a strong foundation in classical and modern vision. The important topics I want to study include:
- image formation and camera projection;
- filtering and convolution;
- edge and corner detection;
- local feature descriptors;
- image alignment and homography;
- optical flow and motion estimation;
- object recognition;
- image segmentation;
- stereo vision and 3D reconstruction;
- deep learning for recognition.
The value of this course is that it builds intuition from first principles. Instead of directly starting from neural networks, it explains how images are formed, how geometry constrains visual observations, and how classical algorithms solve vision problems.
For my research, this foundation is useful because 3D perception and occupancy prediction are not only deep learning problems. They also depend heavily on camera geometry, view transformation, depth reasoning, and spatial consistency.
2. Image Formation and Camera Models
Computer vision begins with image formation. A camera maps the 3D world onto a 2D image plane. Understanding this mapping is essential for 3D perception.
2.1 Pinhole Camera Model
The pinhole camera model describes perspective projection from 3D points to 2D image coordinates.
A 3D point in camera coordinates is:
\[P_c = [X, Y, Z]^T.\]The normalized image coordinates are:
\[x = \frac{X}{Z}, \quad y = \frac{Y}{Z}.\]After applying camera intrinsics, the pixel coordinates are:
\[u = f_x \frac{X}{Z} + c_x,\] \[v = f_y \frac{Y}{Z} + c_y.\]In homogeneous coordinates, this can be written as:
\[s \begin{bmatrix} u \\ v \\ 1 \end{bmatrix} = K \begin{bmatrix} X \\ Y \\ Z \end{bmatrix},\]where (K) is the camera intrinsic matrix:
\[K = \begin{bmatrix} f_x & 0 & c_x \\ 0 & f_y & c_y \\ 0 & 0 & 1 \end{bmatrix}.\]This projection model is fundamental for depth estimation, 3D reconstruction, and multi-view perception.
2.2 Intrinsic and Extrinsic Parameters
Camera parameters are divided into intrinsic and extrinsic parameters.
Intrinsic Parameters
Intrinsic parameters describe the internal camera properties:
- focal length (f_x, f_y);
- principal point ((c_x, c_y));
- pixel aspect ratio;
- lens distortion.
They define how 3D camera coordinates are projected into image pixels.
Extrinsic Parameters
Extrinsic parameters describe the camera pose relative to the world or ego vehicle coordinate frame:
\[P_c = R P_w + t,\]where:
- (R) is the rotation matrix;
- (t) is the translation vector;
- (P_w) is a 3D point in world coordinates;
- (P_c) is the same point in camera coordinates.
The full projection is:
\[s p = K [R|t] P_w.\]In autonomous driving, accurate camera intrinsics and extrinsics are crucial for multi-view fusion and BEV transformation.
2.3 Coordinate Systems
Autonomous driving systems usually involve multiple coordinate systems:
- image coordinate system;
- camera coordinate system;
- ego vehicle coordinate system;
- LiDAR coordinate system;
- world coordinate system;
- BEV grid coordinate system.
A key challenge is to transform features and points consistently across these coordinate systems.
For collaborative perception, there is an additional transformation between different agents:
\[P^{ego} = T_{j \rightarrow ego} P^j.\]This is why pose alignment is a central component in multi-agent perception.
2.4 Lens Distortion
Real cameras are not perfect pinhole cameras. Lens distortion affects image coordinates, especially near image boundaries.
Common distortion types include:
- radial distortion;
- tangential distortion;
- fisheye distortion.
Camera calibration estimates distortion parameters so that images can be corrected before geometric reasoning.
For autonomous driving, calibration quality strongly affects projection, depth estimation, and BEV perception.
3. Classical Visual Features
Before deep learning, computer vision relied heavily on hand-crafted features. Although modern systems use learned features, classical features remain useful for understanding visual structure and geometry.
3.1 Image Filtering
Filtering applies a kernel to an image to extract or modify information.
A 2D convolution operation is:
\[I'(i,j)=\sum_m\sum_n K(m,n)I(i+m,j+n).\]Common filters include:
- Gaussian filter for smoothing;
- Sobel filter for gradients;
- Laplacian filter for second-order edges;
- bilateral filter for edge-preserving smoothing.
Filtering is the foundation of convolutional neural networks, where kernels are learned from data rather than manually designed.
3.2 Edge Detection
Edges correspond to large intensity changes. They often indicate object boundaries, lane markings, or structural contours.
The image gradient is:
\[\nabla I = \begin{bmatrix} \frac{\partial I}{\partial x} \\ \frac{\partial I}{\partial y} \end{bmatrix}.\]The gradient magnitude is:
\[\|\nabla I\| = \sqrt{I_x^2 + I_y^2}.\]Common edge detectors include:
- Sobel operator;
- Laplacian of Gaussian;
- Canny edge detector.
In autonomous driving, classical edge detection is often used in early lane detection systems.
3.3 Corner Detection
Corners are points with strong gradients in multiple directions. They are useful for matching and tracking.
The Harris corner detector uses the second-moment matrix:
\[M = \sum_{(u,v) \in W} \begin{bmatrix} I_x^2 & I_x I_y \\ I_x I_y & I_y^2 \end{bmatrix}.\]The corner response is:
\[R = \det(M) - k(\mathrm{tr}(M))^2.\]A large positive response indicates a corner.
Corners are important in visual odometry, SLAM, and image alignment.
3.4 Local Feature Descriptors
Local descriptors describe image patches around keypoints.
Common descriptors include:
- SIFT;
- SURF;
- ORB;
- BRIEF;
- HOG.
A good descriptor should be robust to:
- scale changes;
- rotation;
- lighting variation;
- small viewpoint changes.
Feature descriptors support image matching, structure from motion, visual localization, and classical 3D reconstruction.
3.5 Homography and Image Alignment
A homography maps points between two views of the same plane:
\[s p' = H p.\]where (H) is a (3 \times 3) matrix.
Homography is useful for:
- image stitching;
- planar object tracking;
- perspective correction;
- bird’s-eye-view transformation for planar roads.
In autonomous driving, simple lane perception systems sometimes use inverse perspective mapping to transform road images into a top-down view.
4. Multi-View Geometry
Multi-view geometry studies the geometric relationships between multiple camera views. It is one of the most important foundations for 3D vision.
For autonomous driving, multi-view geometry explains how multiple cameras can jointly recover 3D structure and how features can be lifted from image space to BEV or voxel space.
4.1 Epipolar Geometry
When a 3D point is observed by two cameras, its projections are constrained by epipolar geometry.
Given corresponding points (x) and (x’) in two images, the fundamental matrix (F) satisfies:
\[x'^T F x = 0.\]This means that the corresponding point (x’) must lie on the epipolar line:
\[l' = F x.\]Epipolar geometry reduces the search space for correspondence from a 2D image region to a 1D line.
4.2 Fundamental Matrix and Essential Matrix
The fundamental matrix (F) describes the epipolar constraint in pixel coordinates.
The essential matrix (E) describes the same constraint in normalized camera coordinates:
\[E = K'^T F K.\]If the relative rotation (R) and translation (t) between two cameras are known, the essential matrix can be written as:
\[E = [t]_\times R.\]Here, ([t]_\times) is the skew-symmetric matrix of translation.
These matrices are fundamental to stereo vision, structure from motion, and camera pose estimation.
4.3 Triangulation
Triangulation estimates a 3D point from its projections in multiple views.
Given camera projection matrices (P_1) and (P_2), and corresponding image points (x_1) and (x_2), triangulation solves for the 3D point (X) such that:
\[x_1 \sim P_1 X,\] \[x_2 \sim P_2 X.\]Because of noise, the rays may not intersect exactly, so triangulation is often formulated as a least-squares problem.
Triangulation is a core operation in 3D reconstruction.
4.4 Stereo Vision
Stereo vision estimates depth from two rectified images.
For rectified stereo, depth is related to disparity by:
\[Z = \frac{fB}{d},\]where:
- (Z) is depth;
- (f) is focal length;
- (B) is baseline;
- (d) is disparity.
This formula shows that larger disparity means closer depth.
Stereo vision is important because it provides a direct geometric way to estimate depth, unlike monocular depth estimation, which must infer depth from learned priors.
4.5 Perspective-n-Point
The PnP problem estimates camera pose from 3D-2D correspondences.
Given 3D points (X_i) and their 2D projections (x_i), PnP solves for camera pose ((R,t)):
\[s_i x_i = K(RX_i+t).\]PnP is widely used in:
- visual localization;
- augmented reality;
- robotics;
- SLAM;
- camera pose estimation.
4.6 Bundle Adjustment
Bundle adjustment jointly optimizes camera poses and 3D points by minimizing reprojection error:
\[\min_{\{R_i,t_i\},\{X_j\}} \sum_{i,j} \left\| x_{ij} - \pi(R_i X_j + t_i) \right\|^2.\]Here, (\pi(\cdot)) is the camera projection function.
Bundle adjustment is one of the most important optimization problems in 3D vision. It connects multi-view geometry, numerical optimization, and sparse linear algebra.
5. Depth Estimation
Depth estimation predicts the distance from the camera to each point in the scene. It is a key problem for 3D perception.
Depth estimation can be grouped into:
- stereo depth estimation;
- monocular depth estimation;
- multi-view depth estimation;
- self-supervised depth estimation;
- depth completion.
5.1 Monocular Depth Estimation
Monocular depth estimation predicts depth from a single image:
\[D = f_\theta(I).\]This problem is ill-posed because many 3D scenes can produce the same 2D image.
Therefore, monocular depth relies heavily on learned priors, such as:
- object size;
- perspective cues;
- texture gradients;
- semantic context;
- scene layout.
Deep learning has greatly improved monocular depth estimation, but generalization remains challenging under domain shift.
5.2 Stereo Depth Estimation
Stereo depth uses two images with known camera baseline. The key is to estimate disparity.
The pipeline is:
- rectify stereo images;
- compute matching cost;
- aggregate cost volume;
- estimate disparity;
- convert disparity to depth.
Modern deep stereo methods learn cost volumes and regularization networks.
Stereo depth has stronger geometric constraints than monocular depth, but it requires calibrated stereo cameras and can struggle with textureless regions, reflective surfaces, and occlusions.
5.3 Multi-View Depth Estimation
Multi-view depth estimation uses more than two images.
It is common in:
- structure from motion;
- multi-view stereo;
- neural rendering;
- autonomous driving with multiple cameras.
The key challenge is to aggregate evidence from multiple views while handling occlusion, visibility, and calibration errors.
In autonomous driving, multi-camera perception often avoids explicit dense depth estimation, but depth reasoning is still implicitly or explicitly required for lifting image features into 3D space.
5.4 Self-Supervised Depth Estimation
Self-supervised depth estimation learns depth without ground-truth depth labels, usually by reconstructing one view from another.
A typical photometric loss is:
\[\mathcal{L}_{photo} = \| I_t - \hat{I}_t \|.\]The model predicts depth and camera motion, then uses differentiable warping to reconstruct the target image.
This approach is attractive because driving videos are abundant, but it has limitations:
- moving objects violate static-scene assumptions;
- illumination changes affect photometric consistency;
- occlusion causes invalid matches;
- scale ambiguity exists in monocular settings.
5.5 Depth in Occupancy Prediction
Depth is important for occupancy prediction because occupancy prediction requires reasoning about where image evidence lies in 3D space.
Common strategies include:
- explicit depth prediction before feature lifting;
- depth distribution over frustum rays;
- learnable 3D queries attending to image features;
- BEV transformation with geometry-aware attention;
- implicit depth reasoning through Transformers.
For my research, depth estimation is closely related to the idea of placing 3D queries or tokens in likely non-empty regions, which may improve efficiency and reduce redundant computation.
6. 3D Scene Representations
3D vision depends heavily on how the scene is represented. Different representations have different trade-offs in accuracy, memory, computation, and geometric structure.
6.1 Point Clouds
Point clouds represent scenes as unordered sets of 3D points:
\[P = \{p_i \in \mathbb{R}^3\}_{i=1}^{N}.\]Each point may include attributes such as color, intensity, semantic label, or feature vector.
Advantages:
- sparse and memory-efficient;
- directly produced by LiDAR;
- preserves geometric structure.
Limitations:
- irregular structure;
- difficult for standard CNNs;
- varying point density;
- limited surface or volume information.
PointNet and PointNet++ are foundational methods for point cloud learning.
6.2 Voxels
Voxel representations divide 3D space into a regular grid:
\[V \in \mathbb{R}^{X \times Y \times Z \times C}.\]Advantages:
- regular structure;
- compatible with 3D convolution;
- naturally represents occupancy.
Limitations:
- high memory cost;
- resolution trade-off;
- many empty voxels.
Semantic occupancy prediction usually uses voxel grids because they represent both occupied and free space.
6.3 Bird’s-Eye-View Representation
BEV represents the scene from a top-down view.
A BEV feature map can be written as:
\[B \in \mathbb{R}^{H \times W \times C}.\]BEV is widely used in autonomous driving because it aligns with the ground plane and planning space.
Advantages:
- efficient 2D representation;
- suitable for driving scenes;
- easy to fuse across cameras and agents;
- natural for map-like reasoning.
Limitations:
- vertical information may be compressed;
- height reasoning needs additional design;
- projection from images to BEV is challenging.
My current research heavily uses BEV tokens as a compact representation for perception, memory, communication, and fusion.
6.4 Meshes
Meshes represent surfaces using vertices, edges, and faces.
Advantages:
- efficient surface representation;
- widely used in graphics;
- suitable for shape modeling.
Limitations:
- hard to represent topology changes;
- less convenient for dense volumetric prediction;
- difficult to generate directly from images in complex scenes.
Meshes are important in computer graphics and 3D reconstruction, but occupancy prediction usually prefers voxel or implicit representations.
6.5 Implicit Fields
Implicit representations describe a 3D scene as a continuous function:
\[f_\theta(x,y,z) \rightarrow s.\]The output (s) may represent:
- occupancy probability;
- signed distance;
- density;
- color;
- semantic label.
Examples include:
- Occupancy Networks;
- Signed Distance Functions;
- Neural Radiance Fields.
Implicit fields provide continuous scene representations, but they can be expensive to query densely.
6.6 Occupancy Fields
An occupancy field predicts whether each 3D location is occupied:
\[f_\theta(p) \rightarrow P(occupied|p).\]Semantic occupancy extends this to class labels:
\[f_\theta(p) \rightarrow \{0,1,\ldots,K\}.\]Occupancy representations are important because they describe not only visible surfaces, but also free space and occluded regions.
This makes them suitable for autonomous driving, where planning requires knowing which regions are free, occupied, or uncertain.
6.7 Gaussian Representations
3D Gaussian representations model scenes using a set of Gaussian primitives.
Each Gaussian may contain:
- position;
- covariance;
- opacity;
- color;
- semantic feature.
Gaussian representations have become popular due to 3D Gaussian Splatting, which enables efficient rendering and scene representation.
For autonomous driving and collaborative perception, Gaussian-based representations may provide a compact alternative to dense voxel grids or BEV features.
7. Vision for Autonomous Driving and Embodied Perception
Computer vision becomes especially challenging in autonomous driving and embodied perception because models must operate in dynamic, open-world, and safety-critical environments.
7.1 Autonomous Driving Perception
Autonomous driving perception includes:
- object detection;
- lane detection;
- semantic segmentation;
- depth estimation;
- BEV perception;
- occupancy prediction;
- motion prediction;
- collaborative perception.
The system must understand not only what objects are present, but also where they are in 3D space and how the scene may evolve.
7.2 BEV Perception
BEV perception transforms image or sensor features into a top-down representation.
Common approaches include:
- geometry-based projection;
- depth-distribution based lifting;
- Transformer-based cross-attention;
- temporal BEV fusion;
- multi-sensor BEV fusion.
BEV is effective because it provides a unified coordinate frame for perception, prediction, and planning.
7.3 Semantic Occupancy Prediction
Semantic occupancy prediction is a dense 3D scene understanding task.
It predicts both geometry and semantics in a voxel grid:
\[O \in \{0,1,\ldots,K\}^{X \times Y \times Z}.\]Compared with detection, occupancy prediction provides a more complete representation of the environment.
Challenges include:
- occlusion;
- class imbalance;
- memory cost;
- 3D feature lifting;
- temporal consistency;
- evaluation across semantic classes.
7.4 Collaborative Perception
Collaborative perception allows multiple agents to exchange information.
This helps overcome:
- occlusion;
- limited field of view;
- long-range perception uncertainty;
- single-agent sensor failure.
However, communication bandwidth is limited. Therefore, collaborative perception must decide what information to transmit and how to fuse it.
This motivates my current research on token-based collaborative occupancy prediction and communication-efficient multi-agent perception.
7.5 Embodied Perception
Embodied perception studies visual perception for agents that act in the world.
Unlike static image understanding, embodied perception requires:
- spatial memory;
- active exploration;
- object interaction;
- temporal reasoning;
- perception-action coupling.
For my research, I focus mainly on the perception and world-modeling layer of embodied agents, especially 3D scene understanding and future occupancy prediction.
8. Personal Study Plan
My current computer vision study plan has three layers.
8.1 Classical Vision Layer
Main source:
- Cornell Introduction to Computer Vision.
Goal:
- understand image formation;
- learn classical features;
- study camera models;
- build geometric intuition.
8.2 3D Vision Layer
Main topics:
- multi-view geometry;
- stereo vision;
- depth estimation;
- structure from motion;
- 3D scene representation;
- camera pose estimation.
Goal:
- understand how 2D images become 3D structure;
- connect geometry with deep learning;
- prepare for research in 3D perception and occupancy prediction.
8.3 Autonomous Driving Vision Layer
Main topics:
- BEV perception;
- semantic occupancy prediction;
- collaborative perception;
- temporal modeling;
- occupancy world models;
- embodied perception.
Goal:
- connect computer vision foundations to my research direction;
- understand how perception systems are built for real autonomous agents;
- develop research ideas for efficient and predictive 3D scene understanding.
Closing Remarks
Computer vision is not only about recognizing objects in images. It is about understanding visual information as geometry, semantics, motion, and structure.
For my PhD preparation, the most important goal is to connect classical vision principles with modern deep learning systems:
- image formation explains visual input;
- camera geometry explains projection and 3D structure;
- multi-view geometry explains cross-view consistency;
- depth estimation explains spatial reasoning;
- 3D representations explain how scenes are encoded;
- autonomous driving perception explains how these ideas are used in real-world intelligent systems.
This foundation will support my research in 3D perception, semantic occupancy prediction, collaborative perception, and occupancy world models.
Enjoy Reading This Article?
Here are some more articles you might like to read next: