RAVEN

Long-Horizon Reasoning and Navigation with a Visuo-Spatial-Temporal Memory

Yixun Hu^*1, Zhicheng Zheng^*1, Lihan Zha¹, Chunwei Xing², Rajdeep Singh², Omar Hossain², Antonio Loquercio², Dhruv Shah¹ *Equal contribution ¹Princeton University, ²University of Pennsylvania

arXiv Code BibTeX

RAVEN leverages visual embeddings as a long-term memory for robotic question answering and navigation, grounding retrieved observations in space and time.

Abstract

Long-term robot deployment requires a compact, scalable memory that preserves fine-grained visual semantics, grounds observations in space and time, and enables efficient storage and retrieval.

In this paper, we propose RAVEN (RetrievAl via Visual Embeddings for Navigation), an agentic system for long-horizon robotic question answering and navigation with visual memory. RAVEN stores visual embeddings with pose and time in a vector database, and grounds retrieval in a spatial map to answer queries with navigation goals. By operating directly on visual embeddings, RAVEN avoids lossy image-to-text translation and enables accurate semantic, spatial, and temporal retrieval at scale.

We evaluate RAVEN on established NaVQA and FindingDory benchmarks, and on our RAVEN-QA benchmark, which comprises web-sourced, simulated, and real-world videos with diverse question annotations. Across these evaluations, RAVEN consistently outperforms the caption-based baseline. Finally, we demonstrate the practical utility of RAVEN on a Unitree Go1 robot navigating real-world indoor environments.

Multimodal vector memory

Stores compact visual representations directly instead of relying on lossy frame captions.

Spatial and temporal grounding

Indexes every memory with spatial pose and world-clock timestamp for precise retrieval.

Agentic tool use

Lets a VLM agent combine semantic, temporal, spatial, and image-based retrieval tools.

Method: Visuo-Spatial-Temporal Memory

RAVEN operates in two phases: memory building during exploration, and memory querying for grounded answers and navigation goals.

RAVEN memory building and querying pipeline

During exploration, the robot encodes RGB frames directly into compact latent representations using a pretrained multimodal encoder (e.g., CLIP, SigLIP, QQMM-v2). These embeddings are indexed alongside their corresponding robot pose and world-clock timestamp in a vector database. This triplet memory structure (visuo-spatial-temporal memory) allows for flexible retrieval queries based on semantic similarity, spatial proximity, or temporal windows.

Worked example · Real-world rollout

“Adam is just heading to his daily sports training. He went out with only his backpack. Now he comes back and looks for his clothes. Question: Where are they?”

Retrieval Tools

RAVEN exposes four primitive memory operations to the querying agent.

Text-based retrieverQueries memory using text embeddings from an agent-generated query.

Time-based retrieverReturns consecutive memories starting from a queried timestamp.

Position-based retrieverFinds memory entries spatially closest to a queried location.

Image-based retrieverRetrieves memories from image queries.

Benchmark Results

RAVEN-QA, FindingDory, and NaVQA

We evaluate RAVEN on three benchmarks. Hover any bar to inspect the exact value, and switch tabs to compare across benchmarks.

Embedder Only ReMEmbR VLM Only Ours (RAVEN)

Real-World Deployment Results

Unitree Go1 Indoor Navigation

We deploy RAVEN on a Unitree Go1 quadruped robot across four real-world indoor environments. The qualitative results compare RAVEN with ReMEmbR and VLM-only baselines on dominant objects, secondary objects, reasoning-based queries, and information recall.

Demo Videos

Exploration and Retrieval Rollouts

Sample RAVEN robot rollouts across exploration and query-driven retrieval.

IRoM - Exploration

IRoM - Retrieval

Bowen Hall - Exploration

Bowen Hall - Retrieval

BibTeX

Reference

@misc{hu2026ravenlonghorizonreasoning,
      title={RAVEN: Long-Horizon Reasoning & Navigation with a Visuo-Spatio-Temporal Memory},
      author={Yixun Hu and Zhicheng Zheng and Lihan Zha and Chunwei Xing and Rajdeep Singh and Omar Hossain and Antonio Loquercio and Dhruv Shah},
      year={2026},
      eprint={2606.25206},
      archivePrefix={arXiv},
      primaryClass={cs.RO},
      url={https://arxiv.org/abs/2606.25206},
}