Deep Learning Foundations

Deep learning is the core technical foundation of modern computer vision, autonomous driving perception, and embodied perception. Most recent progress in 2D vision, 3D perception, semantic occupancy prediction, collaborative perception, and world models is built on deep neural networks.

For my PhD preparation, I want to study deep learning from two perspectives:

  1. Foundation: neural networks, optimization, CNNs, Transformers, representation learning, and regularization.
  2. Application: how these models are used in computer vision, 3D perception, autonomous driving, and embodied AI.

This note is my long-term study record for deep learning. It is organized around several major references and themes, including CS231n, Andrew Ng’s Deep Learning Specialization, CNNs, Transformers, and modern vision architectures.


Roadmap

This note is organized into the following chapters:

  1. Neural Network Fundamentals
    Basic neural network structure, activation functions, loss functions, backpropagation, normalization, initialization, and regularization.

  2. Andrew Ng Deep Learning Specialization
    Practical foundations of deep neural networks, hyperparameter tuning, optimization, CNNs, sequence models, and applied deep learning workflows.

  3. CS231n: Deep Learning for Computer Vision
    Image classification, CNNs, optimization, detection, segmentation, recurrent models, attention, generative models, and visual representation learning.

  4. Convolutional Neural Networks
    Convolution, receptive field, padding, stride, pooling, residual networks, feature pyramids, and dense prediction backbones.

  5. Transformers
    Self-attention, multi-head attention, positional encoding, Vision Transformers, cross-attention, efficient attention, and token-based modeling.

  6. Deep Learning for 3D Vision and Autonomous Driving
    BEV perception, multi-view feature lifting, occupancy prediction, temporal modeling, collaborative perception, and occupancy world models.

  7. Training and Debugging Deep Models
    Learning-rate schedules, optimizer choice, overfitting, underfitting, ablations, reproducibility, and practical debugging.


1. Neural Network Fundamentals

A neural network is a parameterized function:

\[f_\theta(x): \mathcal{X} \rightarrow \mathcal{Y},\]

where (x) is the input, (y) is the target, and (\theta) represents learnable parameters.

A typical neural network layer can be written as:

\[h^{(l)} = \sigma(W^{(l)}h^{(l-1)} + b^{(l)}),\]

where:

  • (W^{(l)}) is the weight matrix;
  • (b^{(l)}) is the bias;
  • (\sigma) is a nonlinear activation function;
  • (h^{(l)}) is the hidden representation at layer (l).

The key idea is that deep networks learn hierarchical representations by composing many simple transformations.


1.1 Why Nonlinearity Matters

Without nonlinear activation functions, stacking multiple linear layers is still equivalent to one linear transformation:

\[W_3 W_2 W_1 x = W x.\]

Therefore, nonlinear activations are necessary for neural networks to approximate complex functions.

Common activation functions include:

  • Sigmoid;
  • Tanh;
  • ReLU;
  • Leaky ReLU;
  • GELU;
  • SiLU / Swish.

In modern deep networks, ReLU and GELU are widely used. CNNs often use ReLU-style activations, while Transformers commonly use GELU.


1.2 Loss Functions

A loss function defines the training objective.

For classification, the most common loss is cross entropy:

\[\mathcal{L}_{CE} = -\sum_{k=1}^{K} y_k \log p_k.\]

For regression, common losses include mean squared error:

\[\mathcal{L}_{MSE} = \frac{1}{N}\sum_{i=1}^{N}(\hat{y}_i-y_i)^2,\]

and L1 loss:

\[\mathcal{L}_{L1} = \frac{1}{N}\sum_{i=1}^{N}|\hat{y}_i-y_i|.\]

For dense prediction tasks such as semantic segmentation or occupancy prediction, loss functions must handle structured outputs and class imbalance. Common choices include:

  • voxel-wise cross entropy;
  • focal loss;
  • Dice loss;
  • Lovasz loss;
  • class-balanced loss.

In semantic occupancy prediction, the model predicts a 3D voxel grid, so the loss is usually applied across all voxels.


1.3 Backpropagation

Backpropagation is the algorithm used to compute gradients efficiently through a neural network.

If the network is a composition of functions:

\[y = f_L \circ f_{L-1} \circ \cdots \circ f_1(x),\]

then gradients are computed using the chain rule.

For parameters (\theta), gradient descent updates them as:

\[\theta_{t+1}=\theta_t-\eta \nabla_\theta \mathcal{L}(\theta_t),\]

where (\eta) is the learning rate.

Understanding backpropagation is important because it helps diagnose:

  • vanishing gradients;
  • exploding gradients;
  • unstable training;
  • dead activations;
  • incorrect loss implementation;
  • broken computation graphs.

1.4 Initialization

Weight initialization strongly affects training stability.

If weights are too small, signals may vanish. If weights are too large, activations and gradients may explode.

Common initialization methods include:

  • Xavier / Glorot initialization;
  • He initialization;
  • truncated normal initialization;
  • zero initialization for special residual branches;
  • pretrained initialization.

For deep networks, good initialization helps maintain stable activation and gradient variance across layers.

In large perception systems, pretrained backbones are often used because they provide better initial representations and reduce training difficulty.


1.5 Normalization

Normalization layers stabilize training by controlling activation statistics.

Common normalization methods include:

  • Batch Normalization;
  • Layer Normalization;
  • Group Normalization;
  • RMSNorm.

BatchNorm is common in CNNs. LayerNorm is standard in Transformers.

For small-batch 3D perception training, BatchNorm may become unstable, so GroupNorm or LayerNorm can be more suitable.


1.6 Regularization

Regularization reduces overfitting and improves generalization.

Common methods include:

  • weight decay;
  • dropout;
  • stochastic depth;
  • data augmentation;
  • label smoothing;
  • mixup and cutmix;
  • early stopping;
  • random masking;
  • self-supervised pretraining.

In computer vision, data augmentation is especially important because it teaches models invariance to brightness, scale, crop, flip, and viewpoint changes.

In autonomous driving, augmentation must be used carefully because geometric consistency matters. For example, image transformations must remain consistent with camera parameters and 3D annotations.


2. Andrew Ng Deep Learning Specialization

Andrew Ng’s Deep Learning Specialization provides a practical and structured introduction to deep learning. It is useful because it explains not only models, but also the engineering workflow of training and debugging neural networks.

The specialization mainly covers:

  • neural networks and deep learning;
  • improving deep neural networks;
  • structuring machine learning projects;
  • convolutional neural networks;
  • sequence models.

2.1 Neural Networks and Deep Learning

The first part introduces the basic structure of neural networks, including forward propagation, loss computation, and backward propagation.

Important ideas:

  • deep networks learn hierarchical representations;
  • each layer transforms features into a more useful space;
  • nonlinear activations enable complex decision boundaries;
  • backpropagation provides efficient gradient computation.

For me, this part is important because it provides the basic language for understanding all modern deep learning architectures.


2.2 Improving Deep Neural Networks

This part focuses on practical training techniques.

Key topics:

  • train/dev/test split;
  • bias and variance diagnosis;
  • regularization;
  • dropout;
  • normalization;
  • gradient checking;
  • mini-batch gradient descent;
  • momentum;
  • RMSProp;
  • Adam;
  • learning-rate decay;
  • hyperparameter search.

This is extremely useful for research. When a model fails, the problem may not be the idea itself. It may be caused by poor learning rate, unstable normalization, bad initialization, weak augmentation, or incorrect evaluation.


2.3 Structuring Machine Learning Projects

This part emphasizes how to think like an ML engineer and researcher.

Important principles:

  • choose a single-number evaluation metric when possible;
  • establish a strong baseline;
  • analyze errors before adding complexity;
  • understand train-dev mismatch;
  • prioritize improvements based on evidence;
  • avoid changing too many variables at once.

For PhD research, this mindset is essential. A good research project requires controlled experiments, clear baselines, and convincing ablations.


2.4 Convolutional Neural Networks

The specialization introduces CNNs for vision tasks.

Important topics:

  • convolution;
  • padding;
  • stride;
  • pooling;
  • feature hierarchy;
  • classic CNN architectures;
  • object detection;
  • face recognition;
  • neural style transfer.

CNNs remain important even in the Transformer era because they provide strong local inductive bias and computational efficiency.


2.5 Sequence Models

The specialization also covers sequence models:

  • RNN;
  • GRU;
  • LSTM;
  • attention;
  • sequence-to-sequence models.

Although Transformers have replaced many recurrent models, sequence modeling remains important for temporal perception, trajectory prediction, and world models.

For autonomous driving, temporal modeling is essential because a single frame cannot fully explain motion, occlusion, and future scene evolution.


3. CS231n: Deep Learning for Computer Vision

CS231n is one of the most important courses for deep learning-based computer vision. It connects neural network foundations with visual recognition, CNNs, optimization, detection, segmentation, attention, and generative models.

For my PhD preparation, CS231n is important because it provides a strong bridge from general deep learning to computer vision research.


3.1 Image Classification

Image classification is the basic visual recognition task.

Given an image (x), the model predicts a class label:

\[\hat{y}=f_\theta(x).\]

Although image classification is simple compared with 3D perception, it introduces many important ideas:

  • feature extraction;
  • classifier heads;
  • softmax loss;
  • data augmentation;
  • overfitting and regularization;
  • evaluation metrics;
  • transfer learning.

Image classification backbones are often reused in detection, segmentation, and 3D perception.


3.2 Optimization in Neural Networks

CS231n provides a clear explanation of optimization for neural networks.

Important topics:

  • gradient descent;
  • stochastic gradient descent;
  • momentum;
  • Adam;
  • learning-rate schedules;
  • weight initialization;
  • normalization;
  • gradient flow;
  • loss landscape intuition.

In research, optimization details often determine whether a method works. A good idea can fail if training is unstable.


3.3 CNN Architectures

CS231n covers classic CNN architectures such as:

  • LeNet;
  • AlexNet;
  • VGG;
  • GoogLeNet / Inception;
  • ResNet;
  • DenseNet.

The most important architecture idea is the residual connection:

\[y = x + F(x).\]

Residual connections make it easier to train very deep networks by improving gradient flow.

Modern vision backbones, 3D perception networks, and Transformer blocks all use residual-style designs.


3.4 Object Detection

Object detection predicts bounding boxes and class labels.

Important paradigms include:

  • two-stage detectors such as Faster R-CNN;
  • one-stage detectors such as YOLO and SSD;
  • anchor-based detection;
  • anchor-free detection;
  • DETR-style detection with Transformers.

Detection introduces structured prediction and localization, which are closer to real perception tasks than image classification.


3.5 Segmentation

Segmentation predicts labels at the pixel level.

Tasks include:

  • semantic segmentation;
  • instance segmentation;
  • panoptic segmentation.

Important architectures include:

  • FCN;
  • U-Net;
  • DeepLab;
  • Mask R-CNN;
  • SegFormer;
  • Mask2Former.

Segmentation is closely related to occupancy prediction because both are dense prediction tasks. The difference is that occupancy prediction extends structured prediction from 2D pixels to 3D voxels.


3.6 Visual Representation Learning

CS231n also introduces representation learning, transfer learning, and self-supervised learning.

Important ideas:

  • pretrained backbones;
  • fine-tuning;
  • feature reuse;
  • representation hierarchy;
  • contrastive learning;
  • masked image modeling.

For autonomous driving, representation learning is important because labeled 3D data is expensive, while unlabeled driving videos are abundant.


4. Convolutional Neural Networks

CNNs are the foundation of deep learning-based computer vision. They remain highly relevant even though Transformers are now widely used.


4.1 Convolution Operation

A convolution computes local weighted sums over an input feature map.

For a 2D input (X) and kernel (K), convolution can be written as:

\[Y(i,j)=\sum_m\sum_n K(m,n)X(i+m,j+n).\]

The key properties of convolution are:

  • local connectivity;
  • parameter sharing;
  • translation equivariance;
  • efficient computation.

These properties make CNNs well suited for images.


4.2 Receptive Field

The receptive field of a neuron is the region of the input that affects it.

Deep CNNs gradually increase receptive field size, allowing higher layers to capture larger context.

For perception tasks, receptive field matters because:

  • local features capture texture and edges;
  • larger context captures objects and scene layout;
  • dense prediction requires both fine details and global context.

4.3 Padding, Stride, and Pooling

Important convolution parameters:

  • padding: controls output size and boundary handling;
  • stride: controls downsampling;
  • pooling: aggregates local features and reduces spatial resolution.

Downsampling increases receptive field and reduces computation, but it may lose spatial detail. Dense prediction models often use encoder-decoder structures to recover resolution.


4.4 Residual Networks

ResNet introduced residual learning:

\[y=x+F(x).\]

This helps train deep networks by allowing gradients to flow through identity paths.

Residual design is now everywhere:

  • CNN backbones;
  • Transformer blocks;
  • 3D perception networks;
  • temporal fusion modules;
  • multi-agent fusion modules.

4.5 Feature Pyramid Networks

Feature Pyramid Networks, or FPN, combine multi-scale features.

Low-level features have high spatial resolution but weak semantics. High-level features have strong semantics but low resolution.

FPN fuses them to support tasks such as detection and segmentation.

In autonomous driving perception, multi-scale features are important for detecting objects of different sizes and building robust BEV representations.


4.6 CNNs in Autonomous Driving

CNNs are used in many autonomous driving modules:

  • image feature extraction;
  • object detection;
  • lane detection;
  • semantic segmentation;
  • depth estimation;
  • BEV feature encoding;
  • occupancy prediction.

Even when the final model uses Transformers, CNNs are often used as image backbones or feature encoders.


5. Transformers

Transformers have become one of the dominant architectures in modern deep learning. They are especially important for vision, 3D perception, multimodal learning, and world models.


5.1 Self-Attention

The core operation of a Transformer is self-attention.

Given queries (Q), keys (K), and values (V), attention is:

\[\mathrm{Attention}(Q,K,V)= \mathrm{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V.\]

Here:

  • (QK^T) computes pairwise similarity;
  • softmax converts similarities into attention weights;
  • multiplying by (V) aggregates information.

Self-attention allows each token to interact with every other token.


5.2 Multi-Head Attention

Multi-head attention applies several attention operations in parallel:

\[\mathrm{MHA}(Q,K,V)=\mathrm{Concat}(head_1,\ldots,head_h)W_O.\]

Each head can learn different relationships, such as:

  • local structure;
  • global context;
  • semantic similarity;
  • geometric relation;
  • temporal dependency.

This is useful for vision because scenes contain many different types of relationships.


5.3 Positional Encoding

Attention alone is permutation-invariant, so Transformers need positional information.

Common positional encodings include:

  • absolute positional encoding;
  • relative positional bias;
  • sinusoidal positional encoding;
  • rotary positional embedding;
  • learned 2D or 3D positional embeddings.

In 3D perception, position is critical. A token’s feature alone is not enough; the model must also know where the token is in image, BEV, voxel, or world coordinates.


5.4 Vision Transformer

Vision Transformer divides an image into patches and treats each patch as a token.

For an image, patches are embedded into tokens:

\[X = [x_1, x_2, \ldots, x_N].\]

Then Transformer layers model interactions among tokens.

ViT shows that pure attention-based architectures can work well for vision when trained with enough data and appropriate regularization.


5.5 Cross-Attention

Cross-attention allows one set of tokens to query another set of tokens.

For example:

  • 3D queries can attend to image features;
  • BEV queries can attend to multi-view camera features;
  • ego-agent tokens can attend to neighboring-agent tokens;
  • future tokens can attend to memory tokens.

This makes cross-attention very important for autonomous driving perception.

In my research, cross-attention appears in:

  • lifting image features into 3D tokens;
  • fusing temporal memory;
  • fusing collaborative tokens from neighboring agents.

5.6 Efficient Attention and Token Reduction

Self-attention has quadratic complexity:

\[O(N^2),\]

where (N) is the number of tokens.

For high-resolution vision and 3D perception, this can be expensive.

Common efficiency strategies include:

  • sparse attention;
  • window attention;
  • linear attention;
  • token pruning;
  • token merging;
  • low-rank attention;
  • memory tokens.

This connects directly to my research on communication-efficient collaborative perception, where token merging reduces transmitted information while preserving useful scene evidence.


6. Deep Learning for 3D Vision and Autonomous Driving

Deep learning for autonomous driving combines vision, geometry, temporal modeling, and sensor fusion.


6.1 Multi-View Perception

Autonomous vehicles often use multiple cameras. Multi-view perception requires aggregating information from different viewpoints.

Important components:

  • camera intrinsics and extrinsics;
  • feature extraction from each view;
  • view transformation;
  • BEV representation;
  • temporal fusion;
  • multi-camera calibration.

Deep learning models must combine learned features with geometric constraints.


6.2 BEV Perception

Bird’s-Eye-View representation maps visual information into a top-down coordinate system.

BEV is useful because it aligns naturally with driving tasks:

  • object detection;
  • lane understanding;
  • motion prediction;
  • planning;
  • occupancy prediction.

Common BEV methods use:

  • lift-splat-shoot style projection;
  • depth-aware feature lifting;
  • Transformer-based cross-attention;
  • temporal BEV fusion.

6.3 Semantic Occupancy Prediction

Semantic occupancy prediction reconstructs a 3D voxel grid with semantic labels.

The prediction target is:

\[O \in \{0,1,\ldots,K\}^{X\times Y\times Z}.\]

Compared with object detection, occupancy prediction provides a denser and more complete representation of the scene.

Deep learning challenges include:

  • high memory cost;
  • class imbalance;
  • occlusion;
  • 3D feature lifting;
  • voxel decoding;
  • temporal consistency.

6.4 Collaborative Perception

Collaborative perception allows multiple agents to exchange information.

Deep learning questions include:

  • how to encode messages;
  • how to select useful information;
  • how to align features across poses;
  • how to fuse received information;
  • how to reduce communication cost.

Transformers are useful here because attention naturally supports token interaction and cross-agent fusion.


6.5 Occupancy World Models

Occupancy world models extend occupancy prediction from current-frame reconstruction to future prediction.

The model must learn:

  • current 3D scene structure;
  • temporal dynamics;
  • motion patterns;
  • uncertainty over future states;
  • long-term spatial memory.

This connects deep learning with sequence modeling, representation learning, and embodied perception.


7. Training and Debugging Deep Models

Deep learning research requires careful training and debugging. Many failures come from implementation or optimization issues rather than the high-level idea.


7.1 Learning Rate

Learning rate is one of the most important hyperparameters.

If it is too large:

  • loss may diverge;
  • training becomes unstable;
  • model may fail to converge.

If it is too small:

  • training is slow;
  • model may get stuck;
  • final performance may be poor.

Common schedules:

  • step decay;
  • cosine decay;
  • warmup + cosine;
  • polynomial decay.

7.2 Optimizers

Common optimizers:

  • SGD;
  • SGD with momentum;
  • RMSProp;
  • Adam;
  • AdamW.

AdamW is widely used in Transformer-based vision models because decoupled weight decay improves regularization.


7.3 Overfitting and Underfitting

Symptoms of underfitting:

  • high training loss;
  • high validation loss;
  • poor training performance.

Symptoms of overfitting:

  • low training loss;
  • high validation loss;
  • large train-validation gap.

Possible solutions:

  • stronger model;
  • better optimization;
  • more data;
  • stronger regularization;
  • data augmentation;
  • early stopping;
  • improved architecture.

7.4 Ablation Studies

Ablation studies are essential in research.

A good ablation should answer:

  • Which component contributes to the improvement?
  • Is the comparison fair?
  • Is the improvement due to architecture, training, or more parameters?
  • Does the method still work under different settings?

For my research, ablations are especially important for separating the effects of:

  • representation capacity;
  • temporal memory;
  • communication strategy;
  • token merging;
  • token budget;
  • fusion design.

7.5 Reproducibility

Deep learning experiments are sensitive to details.

Important reproducibility factors:

  • random seed;
  • dataset split;
  • preprocessing;
  • augmentation;
  • optimizer;
  • learning rate;
  • batch size;
  • hardware;
  • evaluation script;
  • checkpoint selection.

For PhD-level research, clear reporting of these details is part of scientific rigor.


Closing Remarks

Deep learning is not just a collection of architectures. It is a framework for learning representations, optimizing large models, and building intelligent perception systems.

For my PhD preparation, I want to understand deep learning at three levels:

  1. Mathematical level: gradients, optimization, generalization, and representation.
  2. Architectural level: CNNs, Transformers, attention, and token-based modeling.
  3. System level: training pipelines, evaluation, ablations, and deployment constraints.

This foundation will support my research in computer vision, autonomous driving perception, collaborative occupancy prediction, and occupancy world models.




Enjoy Reading This Article?

Here are some more articles you might like to read next:

  • LLM Learning: From Pretraining to Decoder Inference
  • LLM学习:从 Pretraining 到 Decoder 推理
  • Refining My PhD Research Direction Around 3D Perception
  • 围绕三维感知进一步明确 Ph.D. 研究方向
  • From Occupancy Prediction to Occupancy World Models