Computer Graphics Foundations
Computer graphics is becoming increasingly important for computer vision, autonomous driving, embodied perception, and 3D world modeling. While computer vision studies how to infer 3D structure and semantics from images, computer graphics studies how to represent, transform, simulate, and render 3D worlds.
For my PhD preparation, I want to study computer graphics not only as a separate field, but also as a foundation for 3D scene representation, neural rendering, simulation, differentiable perception, and occupancy world models.
This note is my long-term study record for computer graphics foundations. It focuses on the topics most relevant to my research direction: geometry, rendering, NeRF, 3D Gaussian Splatting, differentiable rendering, and their connections to autonomous driving and embodied perception.
Roadmap
This note is organized into the following chapters:
-
Geometry for Computer Graphics
Points, vectors, coordinate systems, transformations, homogeneous coordinates, meshes, surfaces, and scene representation. -
Rendering Pipeline
Rasterization, z-buffering, shading, texture mapping, lighting, visibility, and camera models. -
Ray Tracing and Physically Based Rendering
Rays, intersections, reflection, refraction, global illumination, path tracing, and rendering equations. -
Differentiable Rendering
Differentiable rasterization, differentiable ray tracing, inverse rendering, gradient-based optimization, and vision–graphics integration. -
Neural Radiance Fields
Volume rendering, radiance fields, positional encoding, view synthesis, neural scene representation, and dynamic NeRFs. -
3D Gaussian Splatting
Gaussian primitives, anisotropic covariance, alpha compositing, differentiable splatting, real-time rendering, and scene optimization. -
Connections to Computer Vision and Autonomous Driving
3D reconstruction, occupancy representation, neural rendering for simulation, embodied perception, and world models.
1. Geometry for Computer Graphics
Geometry is the foundation of computer graphics. A graphics system must represent objects, transform them between coordinate systems, and project them onto an image plane.
In computer vision, geometry is often used to recover 3D structure from images. In computer graphics, geometry is used to construct and render 3D structure into images. These two directions are dual to each other:
- Computer vision: images → 3D understanding
- Computer graphics: 3D scene → images
This duality is important for modern research because many methods combine both directions, especially in neural rendering, differentiable rendering, and 3D reconstruction.
1.1 Points, Vectors, and Coordinate Systems
A 3D point can be represented as:
\[p = [x,y,z]^T \in \mathbb{R}^3.\]A vector represents direction and magnitude, while a point represents position. In practice, graphics systems often use homogeneous coordinates:
\[\tilde{p} = [x,y,z,1]^T.\]Homogeneous coordinates allow translation, rotation, scaling, and projection to be represented using matrix multiplication.
A 3D scene usually contains multiple coordinate systems:
- object coordinate system;
- world coordinate system;
- camera coordinate system;
- image coordinate system;
- screen coordinate system.
Understanding these coordinate systems is essential for both rendering and 3D perception.
1.2 Geometric Transformations
Common transformations include translation, rotation, scaling, and shearing.
A rigid transformation from local coordinates to world coordinates can be written as:
\[p_w = R p_o + t,\]where:
- (R) is a rotation matrix;
- (t) is a translation vector;
- (p_o) is a point in object coordinates;
- (p_w) is the transformed point in world coordinates.
Using homogeneous coordinates:
\[\tilde{p}_w = \begin{bmatrix} R & t \\ 0 & 1 \end{bmatrix} \tilde{p}_o.\]This matrix form is important for rendering pipelines, robotics, autonomous driving, and collaborative perception, where objects and sensors must be aligned across different coordinate frames.
1.3 Mesh Representation
A mesh represents a surface using vertices, edges, and faces.
A triangular mesh is usually written as:
\[\mathcal{M}=(V,F),\]where:
- (V={v_i}) is the set of vertices;
- (F={f_j}) is the set of triangular faces.
Meshes are widely used because they are efficient and compatible with rasterization.
Advantages:
- compact surface representation;
- efficient rendering;
- explicit geometry;
- easy to texture and shade.
Limitations:
- hard to represent topology changes;
- difficult to model volumetric uncertainty;
- less convenient for occluded or unknown regions;
- not ideal for free-space reasoning.
For autonomous driving and occupancy prediction, meshes are useful for surface reconstruction, but voxel and occupancy representations are often better for representing free, occupied, and unknown space.
1.4 Surfaces and Implicit Geometry
Instead of explicitly storing faces, a surface can also be represented implicitly.
An implicit surface is defined as the zero level set of a function:
\[f(x,y,z)=0.\]A signed distance function, or SDF, is a common implicit representation:
\[f(p)=d,\]where (d) is the signed distance from point (p) to the nearest surface.
- (f(p)>0): outside the object;
- (f(p)<0): inside the object;
- (f(p)=0): on the surface.
Implicit representations are important in modern neural graphics because neural networks can learn continuous functions over 3D space.
1.5 Scene Graphs
A complex 3D scene is often organized as a scene graph.
A scene graph represents objects and transformations in a hierarchical structure. For example:
World
├── Vehicle
│ ├── Wheel
│ └── Camera
└── Road
Each node has a local transformation relative to its parent.
Scene graphs are useful for:
- hierarchical modeling;
- animation;
- robotics simulation;
- autonomous driving simulation;
- embodied environments.
For embodied AI, scene graphs can also represent semantic relationships between objects and spaces.
2. Rendering Pipeline
Rendering converts a 3D scene into a 2D image. The classical real-time rendering pipeline includes:
Geometry processing → View transformation → Projection → Rasterization → Shading → Image output
This pipeline is used in graphics engines, games, simulators, and many visualization systems.
2.1 Camera Projection
The camera maps 3D points to 2D image coordinates.
A point in camera coordinates is:
\[P_c=[X,Y,Z]^T.\]The normalized image coordinates are:
\[x=\frac{X}{Z}, \quad y=\frac{Y}{Z}.\]After applying intrinsics:
\[u=f_x\frac{X}{Z}+c_x,\] \[v=f_y\frac{Y}{Z}+c_y.\]This is the same camera model used in computer vision. In graphics, this projection is used to render 3D scenes. In vision, the inverse problem is to infer 3D structure from rendered or captured images.
2.2 Rasterization
Rasterization converts geometric primitives, usually triangles, into pixels.
The main idea is:
- project vertices onto the image plane;
- determine which pixels are covered by each triangle;
- interpolate vertex attributes inside each triangle;
- shade visible pixels.
Rasterization is efficient and is the foundation of real-time rendering.
It is widely used in:
- graphics engines;
- game rendering;
- AR/VR;
- robotics simulation;
- autonomous driving simulators.
2.3 Z-Buffering and Visibility
When multiple surfaces project to the same pixel, the renderer must decide which surface is visible.
The z-buffer stores the depth value of the closest surface for each pixel.
For each candidate fragment:
- if its depth is smaller than the stored depth, it is visible;
- otherwise, it is occluded.
Visibility is also a key problem in computer vision. In occupancy prediction, the model must reason about visible, occluded, free, occupied, and unknown regions.
2.4 Shading
Shading computes the color of a surface point based on material, lighting, and viewing direction.
A simple diffuse shading model is Lambertian shading:
\[I = k_d \max(0, n^T l),\]where:
- (k_d) is diffuse reflectance;
- (n) is the surface normal;
- (l) is the light direction.
More complex shading models include specular reflection, physically based materials, and global illumination.
Shading is important because visual appearance depends not only on geometry, but also on light and material.
2.5 Texture Mapping
Texture mapping assigns image patterns to 3D surfaces.
A mesh vertex or surface point is assigned texture coordinates:
\[(u,v).\]The renderer samples a texture image at those coordinates to determine color.
Texture mapping is useful for realistic rendering, simulation, and synthetic data generation.
In vision, texture can provide useful cues, but it may also introduce domain gaps between synthetic and real data.
3. Ray Tracing and Physically Based Rendering
Rasterization is efficient, but it approximates light transport. Ray tracing simulates the path of light more explicitly.
3.1 Ray Representation
A ray is represented as:
\[r(t)=o+td,\]where:
- (o) is the ray origin;
- (d) is the ray direction;
- (t) is the distance along the ray.
Rendering can be done by shooting rays from the camera into the scene and finding intersections with objects.
3.2 Ray–Surface Intersection
For each ray, the renderer finds the closest surface intersection.
For example, for a sphere centered at (c) with radius (R), the intersection satisfies:
\[\|o+td-c\|^2 = R^2.\]Solving this equation gives possible intersection depths.
Ray intersections are fundamental in ray tracing, NeRF volume rendering, and differentiable rendering.
3.3 Reflection and Refraction
Ray tracing can model effects such as:
- reflection;
- refraction;
- shadows;
- indirect illumination;
- transparency.
For perfect mirror reflection, the reflected direction is:
\[r = d - 2(d^T n)n,\]where (d) is the incoming direction and (n) is the surface normal.
These effects are important for realistic rendering, but they also make inverse vision problems harder because appearance is affected by complex light transport.
3.4 Rendering Equation
The rendering equation describes outgoing radiance:
\[L_o(x,\omega_o)=L_e(x,\omega_o)+ \int_{\Omega} f_r(x,\omega_i,\omega_o)L_i(x,\omega_i)(\omega_i \cdot n)d\omega_i.\]where:
- (L_o) is outgoing radiance;
- (L_e) is emitted radiance;
- (f_r) is the BRDF;
- (L_i) is incoming radiance;
- (\omega_i) and (\omega_o) are incoming and outgoing directions.
This equation is central to physically based rendering.
3.5 Path Tracing
Path tracing estimates the rendering equation using Monte Carlo sampling.
Instead of computing all light paths exactly, it samples possible paths and averages their contributions.
Path tracing can produce highly realistic images, but it is computationally expensive.
For machine learning, path tracing is relevant because realistic synthetic data can support simulation, domain randomization, and training data generation.
4. Differentiable Rendering
Differentiable rendering makes the rendering process differentiable, so gradients can flow from image-space losses back to scene parameters.
This connects computer graphics and computer vision.
Classical rendering answers:
Given a scene, what image will be produced?
Differentiable rendering asks:
Given an image loss, how should the scene parameters change?
4.1 Inverse Rendering
Inverse rendering tries to recover scene properties from images, such as:
- geometry;
- material;
- lighting;
- camera pose;
- texture;
- motion.
A typical optimization objective is:
\[\min_\theta \|R(\theta)-I\|^2,\]where:
- (R(\theta)) is the rendered image;
- (I) is the target image;
- (\theta) represents scene parameters.
If the renderer is differentiable, gradients can be computed as:
\[\frac{\partial \mathcal{L}}{\partial \theta}.\]4.2 Differentiable Rasterization
Rasterization contains discrete visibility decisions, which are not naturally differentiable.
Differentiable rasterization approximates these operations using soft visibility or probabilistic formulations.
This enables optimization of:
- mesh vertices;
- camera pose;
- texture;
- material;
- lighting.
Differentiable rasterization is useful for mesh reconstruction and model fitting.
4.3 Differentiable Volume Rendering
Volume rendering is naturally more differentiable than hard surface rasterization because it accumulates densities along rays.
This is one reason why NeRF-style methods became successful.
A ray accumulates color from many sampled points, and gradients can flow to density and color values along the ray.
This supports continuous scene optimization from images.
4.4 Why Differentiable Rendering Matters for Vision
Differentiable rendering is important because it provides a bridge between:
- 2D image supervision;
- 3D scene representation;
- geometric consistency;
- neural networks;
- optimization.
It enables models to learn 3D structure from image-level signals, which is useful when direct 3D supervision is expensive or unavailable.
For autonomous driving and embodied perception, differentiable rendering may support:
- self-supervised depth learning;
- 3D reconstruction;
- simulation-based learning;
- view synthesis;
- occupancy world modeling.
5. Neural Radiance Fields
Neural Radiance Fields, or NeRF, represent a 3D scene as a continuous neural function.
Given a 3D location (x) and viewing direction (d), NeRF predicts density and color:
\[F_\theta(x,d) \rightarrow (\sigma, c).\]where:
- (\sigma) is volume density;
- (c) is RGB color.
5.1 Positional Encoding
Neural networks struggle to represent high-frequency details directly from raw coordinates. NeRF uses positional encoding:
\[\gamma(x)= (\sin(2^0\pi x), \cos(2^0\pi x), \ldots, \sin(2^{L-1}\pi x), \cos(2^{L-1}\pi x)).\]This allows the network to model fine spatial details.
The same idea is connected to positional encoding in Transformers and coordinate-based neural networks.
5.2 Volume Rendering
NeRF renders an image by sampling points along camera rays.
For a ray:
\[r(t)=o+td,\]NeRF evaluates density and color at sampled points. The final pixel color is approximated by:
\[C(r)=\sum_i T_i \alpha_i c_i,\]where:
\[\alpha_i=1-\exp(-\sigma_i \delta_i),\]and (T_i) is accumulated transmittance:
\[T_i=\prod_{j<i}(1-\alpha_j).\]This differentiable rendering process allows NeRF to be trained from posed images.
5.3 Strengths and Limitations of NeRF
Strengths:
- continuous 3D representation;
- high-quality novel view synthesis;
- differentiable optimization;
- strong geometric consistency from multiple views.
Limitations:
- slow rendering in original NeRF;
- requires accurate camera poses;
- struggles with dynamic scenes;
- difficult to scale to large outdoor scenes;
- less directly suitable for real-time autonomous driving.
Many later works improve speed, dynamic modeling, scalability, and generalization.
5.4 Dynamic NeRFs
Dynamic NeRFs extend NeRF to time-varying scenes.
A dynamic radiance field can be written as:
\[F_\theta(x,d,t) \rightarrow (\sigma, c).\]This adds time as an input and allows the model to represent motion.
Dynamic scene modeling is closely related to occupancy world models, because both aim to represent how the 3D world changes over time.
5.5 NeRF and Autonomous Driving
NeRF is useful for autonomous driving research in several ways:
- reconstructing driving scenes;
- generating novel views;
- simulation and data augmentation;
- self-supervised 3D learning;
- scene editing and digital twins.
However, autonomous driving scenes are large, dynamic, and safety-critical, so NeRF must be adapted carefully for scalability and real-time constraints.
6. 3D Gaussian Splatting
3D Gaussian Splatting represents a scene using a set of 3D Gaussian primitives and renders them efficiently through splatting.
Each Gaussian can be represented by:
\[G_i = (\mu_i, \Sigma_i, \alpha_i, c_i),\]where:
- (\mu_i) is the 3D position;
- (\Sigma_i) is the covariance;
- (\alpha_i) is opacity;
- (c_i) is color or appearance feature.
6.1 Gaussian Primitives
Unlike point clouds, Gaussian primitives have spatial extent. The covariance matrix describes their shape and orientation.
An anisotropic Gaussian can stretch differently in different directions, making it more expressive than a simple point.
This helps represent surfaces with fewer primitives than dense point clouds.
6.2 Splatting
Splatting projects 3D Gaussians onto the image plane and blends them to produce pixels.
The rendering process is efficient because it avoids expensive neural network queries along every ray.
This makes 3D Gaussian Splatting much faster than original NeRF-style rendering.
6.3 Optimization
3D Gaussian Splatting optimizes Gaussian parameters using image reconstruction loss.
Typical optimized parameters include:
- position;
- covariance;
- opacity;
- color;
- sometimes semantic or feature embeddings.
Because rendering is differentiable, gradients can update Gaussian parameters from image losses.
6.4 Strengths and Limitations
Strengths:
- real-time or near real-time rendering;
- explicit 3D primitives;
- high-quality view synthesis;
- efficient scene representation;
- easier editing than implicit neural fields.
Limitations:
- depends on initialization quality;
- may require many Gaussians for large scenes;
- handling dynamics is non-trivial;
- semantic and occupancy reasoning require additional design;
- memory can still be large for city-scale scenes.
6.5 Gaussian Representations for Perception
Gaussian representations are increasingly relevant to perception.
Potential connections include:
- representing 3D scenes compactly;
- fusing multi-view observations;
- modeling uncertainty with covariance;
- encoding semantic features in Gaussian primitives;
- supporting rendering-based supervision;
- enabling efficient scene reconstruction.
For collaborative perception, Gaussian primitives could provide a compact alternative to dense BEV or voxel features, especially when communication cost matters.
7. Connections to Computer Vision and Autonomous Driving
Computer graphics and computer vision are deeply connected. Graphics provides the forward model from 3D scene to image. Vision tries to solve the inverse problem from image to scene.
7.1 3D Reconstruction
3D reconstruction estimates scene geometry from images or sensors.
Graphics concepts help define what is being reconstructed:
- mesh surfaces;
- point clouds;
- implicit fields;
- Gaussian primitives;
- occupancy grids.
Differentiable rendering allows reconstruction to be supervised by image reconstruction losses.
7.2 Occupancy Representation
Occupancy prediction represents whether each 3D region is free, occupied, or semantically labeled.
Compared with surface-only graphics representations, occupancy is more suitable for autonomous driving because it describes:
- visible surfaces;
- free space;
- occluded regions;
- semantic categories;
- planning-relevant structure.
This makes occupancy representation a natural bridge between computer vision, robotics, and world modeling.
7.3 Simulation and Synthetic Data
Computer graphics enables simulation environments for training and evaluation.
Simulation is important for:
- generating rare scenarios;
- testing safety-critical cases;
- domain randomization;
- controllable data generation;
- embodied AI environments;
- autonomous driving digital twins.
However, simulation-to-real transfer remains challenging because of differences in appearance, physics, sensors, and behavior.
7.4 Differentiable Perception Systems
Differentiable rendering can be integrated into perception systems to provide geometric supervision.
Examples:
- learning depth from view synthesis;
- optimizing pose from image reconstruction;
- training neural scene representations;
- enforcing multi-view consistency;
- learning 3D structure without direct 3D labels.
This is important because 3D annotations are expensive, while images and videos are abundant.
7.5 Occupancy World Models
World models aim to predict how the environment evolves over time.
Computer graphics contributes to this direction by providing tools for:
- 3D scene representation;
- dynamic scene modeling;
- differentiable simulation;
- neural rendering;
- future view and future occupancy prediction.
For my research, graphics is especially relevant to occupancy world models, where the goal is to model both current 3D structure and future scene evolution.
8. Personal Study Plan
My computer graphics study plan has three layers.
8.1 Classical Graphics Layer
Main topics:
- geometry;
- transformations;
- meshes;
- rasterization;
- shading;
- texture mapping;
- ray tracing.
Goal:
- understand the forward rendering pipeline;
- build intuition for 3D representation;
- connect graphics with camera geometry.
8.2 Neural Graphics Layer
Main topics:
- differentiable rendering;
- NeRF;
- dynamic NeRF;
- 3D Gaussian Splatting;
- neural implicit surfaces;
- neural scene representations.
Goal:
- understand how neural networks represent and render 3D scenes;
- learn how image supervision can train 3D models;
- connect neural rendering with 3D perception.
8.3 Autonomous Driving and Embodied Layer
Main topics:
- simulation;
- synthetic data;
- occupancy representation;
- dynamic scene modeling;
- differentiable perception;
- occupancy world models.
Goal:
- connect computer graphics to my research in autonomous driving perception;
- understand how graphics-based representations can support predictive 3D scene understanding;
- explore future links between neural rendering, collaborative perception, and embodied AI.
Closing Remarks
Computer graphics provides the forward model of vision: how 3D worlds produce images. Computer vision solves the inverse problem: how images reveal 3D worlds.
For my PhD preparation, I want to understand both sides.
The most important connections are:
- geometry explains 3D structure;
- rendering explains image formation;
- differentiable rendering connects images and optimization;
- NeRF represents scenes as continuous neural fields;
- 3D Gaussian Splatting provides efficient explicit neural scene representation;
- graphics-based simulation supports autonomous driving and embodied AI;
- occupancy world models connect 3D perception, prediction, and scene evolution.
This foundation will support my research in 3D perception, semantic occupancy prediction, collaborative perception, and occupancy-based world modeling.
Enjoy Reading This Article?
Here are some more articles you might like to read next: