Collaborative 4D Occupancy World Models with Motion-Aware Token Memory

Collaborative occupancy-based world modeling with motion-aware token memory and future scene prediction

Overview

This project explores how collaborative perception can support occupancy-based world modeling for autonomous driving and embodied perception.

Instead of only reconstructing the current 3D scene, this project studies how the occupancy state of a dynamic environment evolves over time. The system aims to combine multi-agent observations, motion-aware token memory, and future occupancy prediction into a unified 4D world modeling framework.

The key idea is:

collaborative perception should not stop at seeing the current world; it should help agents anticipate how the 3D world will change.

The work is currently a manuscript in preparation for CVPR 2027.


Overall Framework

Collaborative 4D occupancy world model framework

The framework extends collaborative occupancy prediction from current-frame 3D scene reconstruction to future-oriented 4D occupancy world modeling. Multi-view observations are first converted into tokenized 3D/BEV representations, which are then maintained in a motion-aware token memory. Collaborative fusion integrates complementary information from neighboring agents, while the world-modeling module predicts future occupancy states over multiple time steps.


Motivation

Most semantic occupancy prediction methods focus on reconstructing the current 3D scene. However, autonomous agents need more than static scene understanding. They need to reason about how surrounding objects, free space, and occluded regions may evolve in the near future.

This is important because planning and control depend on future states. A vehicle does not only need to know where a pedestrian is now; it must estimate where the pedestrian and nearby vehicles may be in the next few seconds. Similarly, free space that is currently visible may become occupied, and occluded regions may contain dynamic objects.

Collaborative perception provides a natural foundation for this goal. Different agents observe the environment from complementary viewpoints, which can help recover occluded or uncertain regions and provide richer temporal evidence for future prediction.

This project asks:

  • How can collaborative 3D occupancy prediction be extended toward 4D occupancy world modeling?
  • How can agents maintain compact memory tokens that capture both scene structure and motion dynamics?
  • How can multi-agent observations improve future occupancy prediction in occluded and uncertain regions?
  • How can token-based representations support scalable 4D scene understanding?
  • How can predictive occupancy representations become useful for downstream planning?

Research Goal

The goal is to formulate collaborative 3D occupancy prediction as a step toward 4D occupancy world modeling.

Given historical and current observations from multiple agents, the system aims to predict not only the current semantic occupancy grid, but also the future evolution of occupancy states over time:

[ \hat{O}{t:t+K} = f\theta(X_{1:t}^{1:N}) ]

where:

  • (X_{1:t}^{1:N}) denotes historical observations from (N) agents;
  • (\hat{O}_{t:t+K}) denotes current and future occupancy states;
  • (K) is the prediction horizon.

This direction connects three problems:

  • Perception: reconstructing the current 3D semantic scene;
  • Temporal reasoning: modeling how the scene evolves across time;
  • World modeling: predicting future occupancy states as a structured representation of the environment;
  • Planning support: providing predictive spatial representations for autonomous decision-making.

Key Ideas

1. Collaborative 4D Occupancy Modeling

The project extends collaborative occupancy prediction from static 3D scene understanding to future-oriented 4D scene modeling. Multi-agent observations are used to provide complementary spatial and temporal evidence for dynamic scenes.

Instead of treating collaboration as a single-frame feature fusion problem, the framework treats collaboration as a way to improve the agent’s belief about both current and future scene states.

2. Motion-Aware Token Memory

A motion-aware token memory is designed to capture temporal dynamics in compact occupancy representations. Instead of storing dense historical features, the memory maintains structured tokens that represent both scene content and motion-related changes over time.

The memory is expected to support:

  • ego-motion compensation;
  • historical token alignment;
  • dynamic-object motion cues;
  • temporal uncertainty tracking;
  • compact long-range scene context.

3. Future Occupancy Forecasting

The project studies future occupancy forecasting as a bridge between perception, temporal reasoning, and world models. The model aims to predict how occupied, free, and semantic regions evolve in future frames.

For each future step (t+k), the model predicts a structured occupancy field:

[ \hat{O}_{t+k} \in \mathbb{R}^{X \times Y \times Z \times C} ]

where (C) represents semantic occupancy classes.

4. Token-Based 4D Scene Representation

Token-based representations are explored for scalable 4D scene understanding. Tokens provide a compact carrier for multi-agent observations, temporal memory, and future prediction, making them suitable for long-horizon and communication-aware world modeling.

Tokens are also compatible with selective communication: agents can share only the memory or scene tokens that are expected to improve future prediction.

5. Uncertainty and Occlusion Reasoning

Future prediction is especially important in regions that are uncertain, occluded, or dynamically changing. The framework therefore considers uncertainty-aware fusion and forecasting, so that the world model can focus on regions where collaboration and memory provide the greatest value.


System Concept

The planned system follows a collaborative 4D occupancy modeling pipeline:

multi-agent observations -> tokenized 3D scene representation -> motion-aware token memory -> collaborative temporal fusion -> future occupancy forecasting -> 4D occupancy world model

In this pipeline, each agent contributes complementary observations. Motion-aware token memory maintains compact temporal context, while the forecasting module predicts the future evolution of the occupancy field.

The pipeline can be interpreted in three stages:

  1. Encode: convert multi-view observations into compact 3D/BEV tokens.
  2. Remember and collaborate: align historical tokens and fuse complementary information from neighboring agents.
  3. Forecast: decode current and future occupancy fields over a prediction horizon.

Expected Contributions

  • Formulate collaborative 3D occupancy prediction as a step toward 4D occupancy world modeling.
  • Design motion-aware token memory to capture temporal dynamics in compact occupancy representations.
  • Use multi-agent observations to improve future prediction in occluded, uncertain, and dynamic regions.
  • Study future occupancy forecasting as a bridge between perception, temporal reasoning, and world models.
  • Explore token-based representations for scalable 4D scene understanding.
  • Investigate the accuracy-efficiency trade-off of predictive occupancy modeling under communication constraints.

Evaluation Plan

The project is expected to evaluate:

  • current-frame semantic occupancy quality;
  • future occupancy forecasting accuracy;
  • performance in occluded and dynamic regions;
  • benefit of collaborative observations compared with ego-only prediction;
  • effect of motion-aware token memory;
  • communication and memory efficiency.

Possible metrics include IoU, mIoU, future mIoU over prediction horizons, dynamic-object occupancy quality, and communication cost when collaboration is bandwidth-limited.


Research Significance

This project aims to move beyond current-frame occupancy prediction toward predictive 3D scene understanding. By modeling future occupancy evolution, the system can provide a richer representation for autonomous agents that need to reason about dynamic environments.

The long-term goal is to support efficient and reliable perception systems that can not only understand the current scene, but also anticipate how the surrounding 3D world may change over time.

This direction connects my interests in collaborative perception, semantic occupancy prediction, token memory, and world models.


Status

Manuscript in preparation for CVPR 2027.