InfoDeepSeek

InfoDeepSeek: Benchmarking Agentic Information Seeking for Retrieval-Augmented Generation

¹Shanghai Jiao Tong University,²Huawei Noah's Ark Lab
^✉Corresponding Author

Abstract

Retrieval-Augmented Generation (RAG) enhances large language models (LLMs) by grounding responses with retrieved information. As an emerging paradigm, Agentic RAG further enhances this process by introducing autonomous LLM agents into the information seeking process. However, existing benchmarks fall short in evaluating such systems, as they are confined to a static retrieval environment with a fixed, limited corpus and simple queries that fail to elicit agentic behavior. Moreover, their evaluation protocols assess information seeking effectiveness by pre-defined gold sets of documents, making them unsuitable for the open-ended and dynamic nature of real-world web environments. To bridge this gap, we present InfoDeepSeek, a new benchmark with challenging questions designed for assessing agentic information seeking in real-world, dynamic web environments. We propose a systematic methodology for constructing challenging queries satisfying the criteria of determinacy, difficulty, and diversity. Based on this, we develop the first evaluation framework tailored to dynamic agentic information seeking, including fine-grained metrics about the accuracy, utility, and compactness of information seeking outcomes. Through extensive experiments across LLMs, search engines, and question types, InfoDeepSeek reveals nuanced agent behaviors and offers actionable insights for future research.

Figure 1: Comparison between traditional RAG benchmark (up) and our InfoDeepSeek (bottom).

Our Dataset

Figure 2: The pipeline of constructing InfoDeepSeek, where questions are carefully designed to satisfy three criteria: Determinacy, Difficulty, and Diversity. To achieve this, We begin by extracting anchor and ordinary knowledge from web sources, based on which we leverage anchor-based and diversity-based combination to construct draft questions. These draft questions are then subjected to two key filtering stages: determinacy check and difficulty check. Questions that pass both filters are retained as candidates, and subsequently go through a multi-stage validation process.

Experiments

Performance of different LLMs on InfoDeepSeek. ACC and IA@k are measured by %.

Performance of different search engines on InfoDeepSeek. ACC and IA@k are measured by %.

Performance of LLMs and search engines across different question attributes.

Test-time Scaling for Agentic Information Seeking.

BibTeX

@article{xi2025infodeepseek, title={InfoDeepSeek: Benchmarking Agentic Information Seeking for Retrieval-Augmented Generation}, author={Yunjia Xi and Jianghao Lin and Menghui Zhu and Yongzhao Xiao and Zhuoying Ou and Jiaqi Liu and Tong Wan and Bo Chen and Weiwen Liu and Yasheng Wang and Ruiming Tang and Weinan Zhang and Yong Yu}, year={2025}, journal={arXiv preprint arXiv:2505.15872}, url={https://arxiv.org/abs/2505.15872}, }