ContextBench
A scientific benchmark evaluating the dynamics of multi-file context retrieval in LLM agents.
Foundation Models
4
Best Pass@1
53.0%
Avg. Efficiency
0.599
Avg. Line F1
0.325
Benchmark Rankings
| Rank | ||||||
|---|---|---|---|---|---|---|
1 | Claude Sonnet 4.5 | 53.0% | 0.344 | 0.658 | $0.76 | |
2 | GPT-5 | 47.2% | 0.312 | 0.591 | $0.45 | |
3 | Devstral 2 | 40.2% | 0.332 | 0.616 | $0.91 | |
#4 | Gemini 2.5 Pro | 36.4% | 0.311 | 0.529 | $0.38 |
Dataset Statistics
A repository-level benchmark spanning 8 programming languages and introducing human-verified gold contexts to expose intermediate context retrieval signals missing from final task resolution rate evaluation.
| Language | #Repo | #Task | #File | #Block | #Line |
|---|---|---|---|---|---|
| Python | 20 | 512 | 1,520 | 6,714 | 115,122 |
| Java | 6 | 57 | 262 | 3,030 | 49,057 |
| JavaScript | 9 | 153 | 819 | 3,949 | 87,907 |
| TypeScript | 8 | 119 | 537 | 1,106 | 40,621 |
| Go | 7 | 104 | 679 | 3,000 | 71,596 |
| Rust | 9 | 63 | 272 | 1,842 | 50,402 |
| C | 3 | 68 | 250 | 1,591 | 62,300 |
| C++ | 4 | 60 | 209 | 1,884 | 45,110 |
| Total | 66 | 1,136 | 4,548 | 23,116 | 522,115 |
Construction Pipeline
An overview of the ContextBench construction pipeline. ContextBench is curated through three key steps: Task Deduplication, Task Selection, and Expert Annotation.

1Task Deduplication
Removes exact and near-duplicate tasks from multiple issue resolution benchmarks using rule-based and embedding-based detection.
2Task Selection
Identifies challenging tasks based on agent solvability and the scope and dispersion of edits in ground-truth patches.
3Expert Annotation
Employs expert developers to trace code dependencies to construct gold contexts, validated through LLM-based patch generation.
Data Flow Dynamics
The data flow during the ContextBench construction pipeline, showing the step-by-step evolution from task deduplication to gold context annotation.
