We train a 4B-parameter deep research agent using scalable agentic RL in a virtual world environment. LiteResearcher-4B achieves 71.3% on GAIA and 78% on Xbench — matching Claude-4.5-Sonnet and outperforming open-source models up to 8× larger.
Reinforcement Learning (RL) has emerged as a powerful training paradigm for LLM-based agents. However, scaling agentic RL for deep research remains constrained by two coupled challenges: hand-crafted synthetic data fail to elicit genuine real-world search capabilities, and real-world search dependency during RL training introduces instability and expensive cost, which limit the scalability of Agentic RL.
LiteResearcher is a training framework to make Agentic RL scalable: by constructing a lite virtual world that mirrors the real-world search dynamics, we enabled a continuously improving training recipe that empowers tiny search agent to outperform large-scale open-source and commercial models (e.g. Tongyi DeepResearch and Claude-4.5 Sonnet). Specifically, on most common benchmarks like GAIA and Xbench, our LiteResearcher-4B achieves the open-source state-of-the-art results of 71.3% and 78.0% respectively, proving that scalable RL training is essential for Deep Research Agents.
LiteResearcher constructs a virtual world with identical architecture to the real web but isolated in execution. The framework consists of three key components:
(1) Co-constructed Training Data & Corpus: We scale up information sources (32M+ webpages, 1M+ domains) and identify five atomic search capabilities — direct retrieval, aggregation, enumeration, cross-verification, and statistics — to generate diverse, realistic training tasks.
(2) Stable Local Tool Environment: A local search engine (BGE-M3 + Milvus, ~0.15s/query) and local browse tool (PostgreSQL, ~0.17s/page) that enable 73.2M tool calls during training at zero marginal cost.
(3) Difficulty-Aware Curriculum RL: Multi-stage training that progressively increases task difficulty and context length, retaining only partially-solvable instances to maintain consistent training signal.
LiteResearcher-4B consistently outperforms open-source models up to 8× larger and matches or exceeds proprietary systems across eight benchmarks.
| Models | GAIA-Text | Browsecomp | Browse.(ZH) | HLE | Frames | Webwalker | Seal-0 | Xbench-DS |
|---|---|---|---|---|---|---|---|---|
| Commercial Models | ||||||||
| Claude-4-Sonnet | 68.3 | 12.2 | 29.1 | 20.3 | 80.7 | 61.7 | - | 64.6 |
| Claude-4.5-Sonnet | 71.2 | 19.6 | 40.8 | 24.5 | 85.0 | - | 53.4 | 66.0 |
| Deepseek-V3.2 | 63.5 | 67.6 | 65.0 | 40.8 | 80.2 | - | 38.5 | 71.0 |
| DeepSeek-V3.1 | 63.1 | 30.0 | 49.2 | 29.8 | 83.7 | 61.2 | - | 71.0 |
| Minimax-M2 | 75.7 | 44.0 | 48.5 | 31.8 | - | - | - | 72.0 |
| OpenAI-GPT-5-high | 76.4 | 54.9 | 65.0 | 35.2 | - | - | 51.4 | 77.8 |
| GLM-4.6 | 71.9 | 45.1 | 49.5 | 30.4 | - | - | - | 70.0 |
| Kimi-Researcher | - | - | - | 26.9 | 78.8 | - | 36.0 | 69.0 |
| Kimi-K2-0905 | 60.2 | 7.4 | 22.2 | 21.7 | 58.1 | - | 25.2 | 61.0 |
| Open-Source Models | ||||||||
| Mirothinker 8B | 66.4 | 31.1 | 40.2 | 21.5 | 80.6 | 60.6 | 40.4 | 60.6 |
| Tongyi Deepsearch 30B | 70.9 | 43.4 | 46.7 | 32.9 | 90.6 | 72.2 | - | 75.0 |
| ASearcher QWQ v2 32B | 58.7 | - | - | - | 74.5 | - | - | 51.1 |
| WebSailor 30B | 53.2 | - | - | - | - | - | - | 53.3 |
| WebDancer 32B (QwQ) | 51.5 | 3.8 | 18.0 | - | - | 47.9 | - | 38.3 |
| WebExplorer 8B | 50.0 | 15.7 | 32.0 | 17.3 | 75.7 | 62.7 | - | 53.7 |
| DeepMiner 32B | 58.7 | 33.5 | 40.1 | - | - | - | - | 62.0 |
| AFM-RL 32B | 55.3 | 11.1 | - | 18.0 | - | 63.0 | - | - |
| SFR-DeepResearch 20B | 66.0 | - | - | 28.7 | 82.8 | - | - | - |
| AgentCPM-Explore 4B | 63.9 | 24.1 | 29.1 | 19.1 | 82.7 | 68.1 | 40.5 | 70.0 |
| LiteResearcher-4B | 71.3 | 27.5* | 32.5* | 22.0 | 83.1 | 72.7 | 41.8 | 78.0 |
Best open-source results in bold. Results with * use a 64k context window with a memory mechanism.
Our difficulty-aware curriculum learning prevents training saturation. Stage 2 with adjusted difficulty yields +3.6% GAIA accuracy after Stage 1 plateaus, demonstrating the importance of progressive curriculum design.