Semantic occupancy prediction describes the current 3D state of the world.

But autonomous agents need more than the current state. They need to anticipate what may happen next.

This leads to a natural extension:

Can occupancy prediction become an occupancy world model?

This note explores how current-state occupancy prediction can be extended toward future 4D occupancy forecasting.

1. Current Occupancy vs Future Occupancy

Current occupancy prediction estimates the semantic state of space at time (t):

\[\hat{O}_t = f_\theta(x_{1:t}),\]

where (x_{1:t}) may include images, LiDAR, BEV features, or multi-agent observations.

Future occupancy prediction estimates future scene states:

\[\hat{O}_{t+1:t+H} = f_\theta(x_{1:t}).\]

The difference is important.

Current occupancy asks:

What is the scene now?

Future occupancy asks:

How will the scene evolve?

For autonomous driving and embodied agents, the second question is closer to decision making.

2. Why World Models Matter

A world model is an internal model that predicts how the environment changes over time.

In reinforcement learning, a world model may predict future latent states and rewards. In autonomous driving perception, a world model can predict future spatial states.

Occupancy is a useful format for world modeling because it represents physical space directly.

An occupancy world model can support:

future collision risk estimation;
trajectory planning;
uncertainty-aware decision making;
simulation of possible scene evolution;
reasoning about occluded dynamic objects.

It gives the agent a structured way to imagine future scenes.

3. 4D Occupancy

If 3D occupancy describes space, 4D occupancy describes space over time.

It can be written as:

\[O \in \{0,1,\ldots,K\}^{T \times X \times Y \times Z}.\]

Here, time becomes another dimension.

This representation is powerful because it captures both:

where objects are;
how they move.

For autonomous driving, 4D occupancy can describe vehicles, pedestrians, cyclists, static obstacles, free space, and future unknown regions in a unified format.

4. Motion-Aware Memory

Future occupancy prediction requires memory.

A model cannot predict future scenes from a single frame as reliably as it can from temporal context. It must understand motion, velocity, acceleration, and interaction.

A motion-aware memory should store:

recent occupancy states;
BEV features;
object motion cues;
ego-motion;
temporal uncertainty;
interactions between agents.

Token memory is one possible design. Instead of storing dense feature maps at every time step, the model can store compact tokens that summarize important spatial regions and motion patterns.

This connects naturally to my work on token-based collaborative perception.

5. Collaboration for World Models

Single-agent world models are limited by the agent’s own observations.

Collaborative world models can use information from multiple agents to build a more complete and predictive scene representation.

This is especially useful for:

occluded dynamic objects;
intersections;
long-range regions;
crowded traffic scenes;
areas outside the ego field of view.

For example, if another vehicle observes a pedestrian hidden from the ego vehicle, the collaborative world model may predict future occupancy more accurately.

The challenge is that future prediction is even more sensitive to communication quality. A small error in received information may affect the future trajectory of predicted occupancy.

6. Uncertainty in Future Prediction

Future prediction is uncertain.

There may be multiple possible futures:

a pedestrian may stop or continue walking;
a vehicle may turn or go straight;
an occluded object may appear or remain hidden;
traffic participants may react to each other.

A deterministic occupancy prediction may be insufficient.

Useful future occupancy models should represent uncertainty, either explicitly or implicitly.

Possible directions include:

probabilistic occupancy distributions;
multi-modal future predictions;
uncertainty maps;
confidence-calibrated semantic occupancy;
scenario-conditioned prediction.

For planning, uncertainty is not a detail. It is part of the decision problem.

7. Evaluation Questions

Evaluating occupancy world models is difficult.

For current occupancy, metrics such as IoU and mIoU are common. For future occupancy, we also need to consider:

prediction horizon;
temporal consistency;
dynamic object quality;
calibration;
safety-critical regions;
performance under occlusion;
usefulness for planning.

A model may have good average mIoU but still fail in rare safety-critical cases.

This makes evaluation an important research problem, not only an implementation detail.

8. My Research View

I see occupancy world models as a bridge between perception and embodied intelligence.

They connect:

semantic occupancy prediction;
temporal modeling;
motion forecasting;
collaborative perception;
uncertainty reasoning;
planning-oriented representation learning.

For my PhD direction, this is an exciting path because it extends 3D perception from static reconstruction to predictive scene understanding.

The long-term question is:

How can autonomous agents build compact, communicative, and predictive representations of the 3D world?

From Occupancy Prediction to Occupancy World Models