LLM Learning: From Pretraining to Decoder Inference
Large language models can feel mysterious because many different ideas are stacked together:
- data collection;
- tokenization;
- large-scale pretraining;
- Transformer decoder blocks;
- instruction tuning;
- preference alignment;
- inference-time decoding;
- retrieval, tools, and system prompts.
This note is my attempt to organize the full pipeline in one place.
The simplest mental model is:
An LLM learns to predict the next token. Everything else is built around making that next-token predictor useful, controllable, efficient, and connected to external knowledge.
1. The Whole Pipeline
A modern LLM system usually has two major phases: training and inference.
During training, we build a model:
- collect and clean text or multimodal data;
- train a tokenizer;
- pretrain a neural network to predict next tokens;
- optionally continue pretraining on domain data;
- fine-tune it to follow instructions;
- align it with human or AI preferences;
- evaluate safety, reasoning, factuality, coding, math, and domain ability.
During inference, we use the model:
- receive a user prompt;
- optionally retrieve relevant documents;
- format the prompt with system, developer, user, and tool messages;
- run the prefill stage over the prompt;
- repeatedly run the decode stage to generate new tokens;
- post-process the generated text or call tools.
The core neural architecture is usually a decoder-only Transformer. The word “decoder” here can be confusing. In classic sequence-to-sequence models, there is an encoder and a decoder. In most LLMs, there is only the autoregressive decoder stack. It reads previous tokens through causal self-attention and predicts the next token.
2. Text Becomes Tokens
Neural networks do not directly read words or characters. They read integer token IDs.
A tokenizer maps text into a sequence:
\[\text{text} \rightarrow (x_1, x_2, \ldots, x_T),\]where each (x_t) is a token ID from a vocabulary (V).
Many LLM tokenizers use BPE, byte-level BPE, SentencePiece, or related subword methods. The tokenizer may split text like this:
Large language models are useful.
into pieces such as:
Large | language | models | are | useful | .
or into smaller subword units depending on the vocabulary.
Important concepts:
- Vocabulary size: the number of possible tokens.
- Token: a unit that may be a word, subword, character, byte, or punctuation.
- Context window: the maximum number of tokens the model can process at once.
- Embedding: a learned vector representation for each token.
After tokenization, the model converts token IDs into vectors:
\[h_t^{(0)} = E[x_t] + p_t,\]where (E[x_t]) is the token embedding and (p_t) represents positional information. Many modern LLMs use rotary positional embeddings, often called RoPE, instead of simply adding absolute position vectors.
3. Pretraining: Learning to Predict the Next Token
Pretraining is the main stage where the model absorbs general language, knowledge patterns, code structure, reasoning traces, and world regularities from massive data.
For an autoregressive language model, the probability of a token sequence is factorized as:
\[p_\theta(x_1, x_2, \ldots, x_T) = \prod_{t=1}^{T} p_\theta(x_t \mid x_{<t}).\]The model is trained to maximize the likelihood of the correct next token, or equivalently minimize cross-entropy loss:
\[\mathcal{L}_{\text{pretrain}} = - \sum_{t=1}^{T} \log p_\theta(x_t \mid x_{<t}).\]This is called next-token prediction.
During training, the model can see the full sequence at once, but a causal mask prevents token (t) from attending to future tokens (x_{>t}). So the model learns:
Given the past, predict what comes next.
This is powerful because many tasks can be written as text completion:
- question answering;
- summarization;
- translation;
- code generation;
- mathematical reasoning;
- dialogue;
- tool-use planning.
Pretraining does not directly teach the model to be a polite assistant. It teaches a general distribution over text. That is why later post-training stages are needed.
4. Decoder-Only Transformer
Most current LLMs are stacks of decoder Transformer blocks.
For input hidden states (H^{(l)}), one pre-norm Transformer layer can be written approximately as:
\[\tilde{H}^{(l)} = H^{(l)} + \mathrm{Attention}(\mathrm{Norm}(H^{(l)})),\] \[H^{(l+1)} = \tilde{H}^{(l)} + \mathrm{MLP}(\mathrm{Norm}(\tilde{H}^{(l)})).\]Each layer has two major parts:
- causal self-attention;
- feed-forward network, often called MLP or FFN.
4.1 Self-Attention
For hidden states (H), the model projects them into queries, keys, and values:
\[Q = HW_Q, \quad K = HW_K, \quad V = HW_V.\]Attention is:
\[\mathrm{Attention}(Q,K,V) = \mathrm{softmax}\left(\frac{QK^T}{\sqrt{d_k}} + M\right)V.\]Here (M) is the causal mask. It makes future positions unavailable.
Intuitively:
- query asks: what am I looking for?
- key says: what information do I contain?
- value says: what content should be passed forward?
Self-attention lets each token look back at relevant previous tokens. For example, when generating a closing parenthesis, the model can attend to the earlier opening parenthesis.
4.2 Multi-Head Attention
Instead of using one attention operation, Transformers use multiple heads:
\[\mathrm{head}_i = \mathrm{Attention}(Q_i,K_i,V_i).\]The heads are concatenated and projected:
\[\mathrm{MHA}(H) = \mathrm{Concat}(\mathrm{head}_1,\ldots,\mathrm{head}_m)W_O.\]Different heads can specialize in different patterns:
- syntax;
- long-range dependency;
- code indentation;
- entity tracking;
- retrieval-like copying;
- reasoning steps.
4.3 Feed-Forward Network
After attention, each token independently passes through an MLP:
\[\mathrm{MLP}(h) = W_2 \sigma(W_1 h),\]where (\sigma) may be GELU, SwiGLU, or another activation.
Attention mixes information across tokens. The MLP transforms information inside each token representation.
4.4 The Final Decoder Output
After the last Transformer layer, the hidden state at position (t) is projected to vocabulary logits:
\[z_t = W_{\text{out}} h_t.\]Then softmax gives a probability distribution over the next token:
\[p_\theta(x_{t+1}=i \mid x_{\le t}) = \frac{\exp(z_{t,i})}{\sum_{j \in V}\exp(z_{t,j})}.\]The model chooses or samples a token from this distribution. That token is appended to the context, and the process repeats.
This is the final “decoder” behavior at inference time:
prompt tokens -> Transformer decoder -> logits -> next token -> append -> repeat
5. Post-Training: From Text Model to Assistant
Pretraining creates a strong base model, but a base model only completes text. To make it useful as an assistant, post-training is needed.
5.1 Supervised Fine-Tuning
Supervised fine-tuning, often called SFT, trains the model on instruction-response examples:
User: Explain self-attention.
Assistant: Self-attention is ...
The loss is still next-token cross-entropy, but now the data format teaches assistant behavior:
\[\mathcal{L}_{\text{SFT}} = - \sum_t \log p_\theta(y_t \mid x, y_{<t}),\]where (x) is the instruction and (y) is the target response.
SFT teaches:
- answer format;
- helpfulness;
- following instructions;
- dialogue behavior;
- tool-call syntax if tool examples are included.
5.2 Preference Alignment
Preference alignment trains the model to prefer better answers over worse answers.
One common setup is:
prompt x
chosen answer y+
rejected answer y-
Methods include RLHF, RLAIF, PPO-style optimization, and direct preference optimization. A simplified DPO-style objective is:
\[\mathcal{L}_{\text{DPO}} = -\log \sigma\left( \beta \left[ \log \frac{\pi_\theta(y^+|x)}{\pi_{\text{ref}}(y^+|x)} - \log \frac{\pi_\theta(y^-|x)}{\pi_{\text{ref}}(y^-|x)} \right]\right).\]The goal is not just to make the model more knowledgeable. The goal is to make outputs more useful, harmless, honest, and aligned with user intent.
6. Inference: Prefill and Decode
Inference is where many practical LLM concepts appear.
Suppose the user prompt has (n) tokens:
\[x_1, x_2, \ldots, x_n.\]The model must process the prompt and then generate tokens:
\[y_1, y_2, \ldots, y_m.\]6.1 Prefill
Prefill is the first forward pass over the entire prompt.
During prefill, the model:
- embeds all prompt tokens;
- runs all Transformer layers;
- computes keys and values for every prompt position;
- stores them in the KV cache;
- produces logits for the first generated token.
Prefill is usually compute-heavy because the prompt may be long. However, all prompt positions can be processed in parallel on the GPU.
The latency from request arrival to the first generated token is often called TTFT, or time to first token. Long prompts and retrieval-augmented prompts increase prefill cost.
6.2 KV Cache
In attention, every new token needs to attend to previous keys and values.
Without cache, we would recompute (K) and (V) for all previous tokens at every step. That would be wasteful.
The KV cache stores:
\[K_{1:t}, V_{1:t}\]for each layer and each attention head.
At decode step (t+1), the model only computes (Q_{t+1}, K_{t+1}, V_{t+1}) for the new token, then attends to cached (K_{1:t+1}, V_{1:t+1}).
This makes generation much faster, but it also consumes memory. Serving long-context models is often limited by KV cache memory.
6.3 Decode
After prefill, the model enters the decode loop:
- sample or select one token;
- append it to the context;
- update the KV cache;
- run one-token forward pass;
- repeat until stop condition.
The decode stage is usually sequential because token (y_t) depends on token (y_{t-1}). This is why generation speed is measured in tokens per second.
6.4 Sampling
The logits can be converted to probabilities with temperature:
\[p_i = \frac{\exp(z_i / \tau)}{\sum_j \exp(z_j / \tau)}.\]Here (\tau) is temperature.
- lower temperature: more deterministic;
- higher temperature: more diverse;
- greedy decoding: choose the highest probability token;
- top-k sampling: sample only from the top (k) tokens;
- nucleus sampling: sample from the smallest set whose cumulative probability exceeds (p).
Decoding is not a small detail. The same model can behave differently under different sampling settings.
7. RAG: Retrieval-Augmented Generation
RAG connects an LLM with external documents.
The basic pipeline is:
- split documents into chunks;
- embed each chunk into a vector;
- store vectors in a vector database or index;
- embed the user query;
- retrieve similar chunks;
- insert those chunks into the prompt;
- ask the LLM to answer using the retrieved context.
Similarity is often computed by dot product or cosine similarity:
\[\mathrm{sim}(q,d) = \frac{q^T d}{\|q\|\|d\|}.\]With retrieved context (C), generation becomes:
\[p(y \mid x, C).\]RAG is useful because the base model has frozen parameters. It may not know private, recent, or domain-specific information. RAG lets the system provide relevant knowledge at inference time without retraining the model.
But RAG is not magic. Common failure modes include:
- bad chunking;
- retrieving irrelevant documents;
- missing the truly relevant document;
- context too long;
- contradiction between retrieved documents;
- model ignoring evidence;
- poor citation formatting.
Good RAG systems are partly information retrieval systems and partly prompt-engineered LLM systems.
8. Important Concepts
8.1 Context Window
The context window is the maximum number of tokens the model can read in one request.
A longer context window helps with long documents, multi-turn dialogue, and retrieval. But it increases prefill cost and KV cache memory.
8.2 Prompt
A prompt is the input text or message sequence. In chat models, the prompt may contain:
- system message;
- developer message;
- user message;
- assistant history;
- tool results;
- retrieved documents.
Prompt design matters because the model conditions on all previous tokens.
8.3 System Prompt
The system prompt gives high-level behavior instructions. It can define role, style, safety constraints, output format, and tool rules.
It does not change model weights. It only changes the context.
8.4 Hallucination
Hallucination means the model generates plausible but false or unsupported content.
From the model’s perspective, it is predicting likely text. It does not automatically know which statements are grounded in reality. RAG, tool use, verification, and careful prompting can reduce hallucination, but cannot eliminate it completely.
8.5 Embedding
An embedding is a vector representation of a token, sentence, image, or document.
Embedding models are often used for semantic search:
query -> vector -> nearest documents
8.6 LoRA
LoRA, or low-rank adaptation, fine-tunes a model by learning small low-rank matrices instead of updating all parameters.
A weight update can be written as:
\[W' = W + \Delta W, \quad \Delta W = BA,\]where (A) and (B) are low-rank matrices.
LoRA is memory-efficient and widely used for domain adaptation.
8.7 Quantization
Quantization stores model weights in lower precision, such as INT8 or INT4, instead of FP16 or BF16.
It reduces memory and can speed up inference, but may hurt quality if too aggressive.
8.8 Distillation
Distillation trains a smaller student model to imitate a larger teacher model.
The goal is to keep much of the capability while reducing cost and latency.
8.9 MoE
Mixture-of-Experts models use multiple expert networks but activate only a subset for each token.
This increases total model capacity while keeping per-token compute lower than activating every parameter.
8.10 Tool Use
Tool use means the model outputs a structured action such as:
{"tool": "search", "query": "latest occupancy prediction paper"}
The external system runs the tool and returns results to the model. This is how LLMs can interact with calculators, browsers, databases, code interpreters, and APIs.
9. How I Connect This to My Research
LLMs are not only text systems. Many ideas are directly related to 3D perception and embodied intelligence:
- tokenization is related to compact scene representations;
- attention is related to feature interaction;
- KV cache is related to memory;
- RAG is related to external knowledge retrieval;
- tool use is related to agent action;
- prefill/decode efficiency is related to real-time deployment;
- long-context reasoning is related to temporal scene understanding.
For my own research, the most interesting bridge is token-based representation.
In LLMs, tokens are the units of language modeling. In perception, tokens can become units of space, time, memory, semantics, communication, and planning.
This makes LLM learning useful even outside NLP. It gives a language for thinking about representation, memory, compression, retrieval, and sequential decision making.
10. Summary
The core of an LLM is simple but deep:
\[\text{previous tokens} \rightarrow \text{next-token distribution}.\]Around this core, modern systems add:
- massive pretraining;
- decoder-only Transformer architecture;
- instruction tuning;
- preference alignment;
- efficient prefill and decode;
- KV cache;
- sampling strategies;
- retrieval-augmented generation;
- tools and agents.
Understanding LLMs means understanding both the mathematical model and the engineering system around it.
The model predicts tokens. The system turns token prediction into an assistant.
Enjoy Reading This Article?
Here are some more articles you might like to read next: