RAVEN
Long-Horizon Reasoning and Navigation with a Visuo-Spatial-Temporal Memory
RAVEN leverages visual embeddings as a long-term memory for robotic question answering and navigation, grounding retrieved observations in space and time.
Long-term robot deployment requires a compact, scalable memory that preserves fine-grained visual semantics, grounds observations in space and time, and enables efficient storage and retrieval.
In this paper, we propose RAVEN (RetrievAl via Visual Embeddings for Navigation), an agentic system for long-horizon robotic question answering and navigation with visual memory. RAVEN stores visual embeddings with pose and time in a vector database, and grounds retrieval in a spatial map to answer queries with navigation goals. By operating directly on visual embeddings, RAVEN avoids lossy image-to-text translation and enables accurate semantic, spatial, and temporal retrieval at scale.
We evaluate RAVEN on established NaVQA and FindingDory benchmarks, and on our RAVEN-QA benchmark, which comprises web-sourced, simulated, and real-world videos with diverse question annotations. Across these evaluations, RAVEN consistently outperforms the caption-based baseline. Finally, we demonstrate the practical utility of RAVEN on a Unitree Go1 robot navigating real-world indoor environments.
Stores compact visual representations directly instead of relying on lossy frame captions.
Indexes every memory with spatial pose and world-clock timestamp for precise retrieval.
Lets a VLM agent combine semantic, temporal, spatial, and image-based retrieval tools.
RAVEN operates in two phases: memory building during exploration, and memory querying for grounded answers and navigation goals.
During exploration, the robot encodes RGB frames directly into compact latent representations using a pretrained multimodal encoder (e.g., CLIP, SigLIP, QQMM-v2). These embeddings are indexed alongside their corresponding robot pose and world-clock timestamp in a vector database. This triplet memory structure (visuo-spatial-temporal memory) allows for flexible retrieval queries based on semantic similarity, spatial proximity, or temporal windows.
“Adam is just heading to his daily sports training. He went out with only his backpack. Now he comes back and looks for his clothes. Question: Where are they?”
RAVEN exposes four primitive memory operations to the querying agent.
We evaluate RAVEN on three benchmarks. Hover any bar to inspect the exact value, and switch tabs to compare across benchmarks.
We deploy RAVEN on a Unitree Go1 quadruped robot across four real-world indoor environments. The qualitative results compare RAVEN with ReMEmbR and VLM-only baselines on dominant objects, secondary objects, reasoning-based queries, and information recall.
Sample RAVEN robot rollouts across exploration and query-driven retrieval.
@misc{hu2026ravenlonghorizonreasoning,
title={RAVEN: Long-Horizon Reasoning & Navigation with a Visuo-Spatio-Temporal Memory},
author={Yixun Hu and Zhicheng Zheng and Lihan Zha and Chunwei Xing and Rajdeep Singh and Omar Hossain and Antonio Loquercio and Dhruv Shah},
year={2026},
eprint={2606.25206},
archivePrefix={arXiv},
primaryClass={cs.RO},
url={https://arxiv.org/abs/2606.25206},
}