RAVEN logo

RAVEN

Long-Horizon Reasoning and Navigation with a Visuo-Spatial-Temporal Memory

Yixun Hu*1, Zhicheng Zheng*1, Lihan Zha1, Chunwei Xing2, Rajdeep Singh2, Omar Hossain2, Antonio Loquercio2, Dhruv Shah1 *Equal contribution 1Princeton University, 2University of Pennsylvania
RAVEN system overview

RAVEN leverages visual embeddings as a long-term memory for robotic question answering and navigation, grounding retrieved observations in space and time.

Abstract

Long-term robot deployment requires a compact, scalable memory that preserves fine-grained visual semantics, grounds observations in space and time, and enables efficient storage and retrieval.

In this paper, we propose RAVEN (RetrievAl via Visual Embeddings for Navigation), an agentic system for long-horizon robotic question answering and navigation with visual memory. RAVEN stores visual embeddings with pose and time in a vector database, and grounds retrieval in a spatial map to answer queries with navigation goals. By operating directly on visual embeddings, RAVEN avoids lossy image-to-text translation and enables accurate semantic, spatial, and temporal retrieval at scale.

We evaluate RAVEN on established NaVQA and FindingDory benchmarks, and on our RAVEN-QA benchmark, which comprises web-sourced, simulated, and real-world videos with diverse question annotations. Across these evaluations, RAVEN consistently outperforms the caption-based baseline. Finally, we demonstrate the practical utility of RAVEN on a Unitree Go1 robot navigating real-world indoor environments.

Multimodal vector memory

Stores compact visual representations directly instead of relying on lossy frame captions.

Spatial and temporal grounding

Indexes every memory with spatial pose and world-clock timestamp for precise retrieval.

Agentic tool use

Lets a VLM agent combine semantic, temporal, spatial, and image-based retrieval tools.

Method: Visuo-Spatial-Temporal Memory

RAVEN operates in two phases: memory building during exploration, and memory querying for grounded answers and navigation goals.

RAVEN memory building and querying pipeline

During exploration, the robot encodes RGB frames directly into compact latent representations using a pretrained multimodal encoder (e.g., CLIP, SigLIP, QQMM-v2). These embeddings are indexed alongside their corresponding robot pose and world-clock timestamp in a vector database. This triplet memory structure (visuo-spatial-temporal memory) allows for flexible retrieval queries based on semantic similarity, spatial proximity, or temporal windows.

▢▢▢ RGB Frames Multimodal Encoder CLIP Triplet Memory ◆ vis ◉ pose ◷ time Vector DB Retrieval Queries ⊙ Semantic similarity ◉ Spatial proximity ◷ Temporal window
Worked example · Real-world rollout

“Adam is just heading to his daily sports training. He went out with only his backpack. Now he comes back and looks for his clothes. Question: Where are they?

Retrieval Tools

RAVEN exposes four primitive memory operations to the querying agent.

Text-based retrieverQueries memory using text embeddings from an agent-generated query.
Time-based retrieverReturns consecutive memories starting from a queried timestamp.
Position-based retrieverFinds memory entries spatially closest to a queried location.
Image-based retrieverRetrieves memories from image queries.
Benchmark Results
RAVEN-QA, FindingDory, and NaVQA

We evaluate RAVEN on three benchmarks. Hover any bar to inspect the exact value, and switch tabs to compare across benchmarks.

Embedder Only ReMEmbR VLM Only Ours (RAVEN)

Real-World Deployment Results
Unitree Go1 Indoor Navigation

We deploy RAVEN on a Unitree Go1 quadruped robot across four real-world indoor environments. The qualitative results compare RAVEN with ReMEmbR and VLM-only baselines on dominant objects, secondary objects, reasoning-based queries, and information recall.

Demo Videos
Exploration and Retrieval Rollouts

Sample RAVEN robot rollouts across exploration and query-driven retrieval.

IRoM - Exploration
IRoM - Retrieval
Bowen Hall - Exploration
Bowen Hall - Retrieval
BibTeX
Reference
@misc{hu2026ravenlonghorizonreasoning,
      title={RAVEN: Long-Horizon Reasoning & Navigation with a Visuo-Spatio-Temporal Memory},
      author={Yixun Hu and Zhicheng Zheng and Lihan Zha and Chunwei Xing and Rajdeep Singh and Omar Hossain and Antonio Loquercio and Dhruv Shah},
      year={2026},
      eprint={2606.25206},
      archivePrefix={arXiv},
      primaryClass={cs.RO},
      url={https://arxiv.org/abs/2606.25206},
}