ContextBench

A scientific benchmark evaluating the dynamics of multi-file context retrieval in LLM agents.

Foundation Models

4

Best Pass@1

53.0%

Avg. Efficiency

0.599

Avg. Line F1

0.325

Benchmark Rankings

Sorting byPrimary Metric
System View
Showing Raw Backbone Capabilities
Rank
1
Claude Sonnet 4.5
53.0%
0.344
0.658
$0.76
2
GPT-5
47.2%
0.312
0.591
$0.45
3
Devstral 2
40.2%
0.332
0.616
$0.91
#4
Gemini 2.5 Pro
36.4%
0.311
0.529
$0.38

Dataset Statistics

A repository-level benchmark spanning 8 programming languages and introducing human-verified gold contexts to expose intermediate context retrieval signals missing from final task resolution rate evaluation.

Language#Repo#Task#File#Block#Line
Python205121,5206,714115,122
Java6572623,03049,057
JavaScript91538193,94987,907
TypeScript81195371,10640,621
Go71046793,00071,596
Rust9632721,84250,402
C3682501,59162,300
C++4602091,88445,110
Total661,1364,54823,116522,115

Construction Pipeline

An overview of the ContextBench construction pipeline. ContextBench is curated through three key steps: Task Deduplication, Task Selection, and Expert Annotation.

Construction Pipeline

1Task Deduplication

Removes exact and near-duplicate tasks from multiple issue resolution benchmarks using rule-based and embedding-based detection.

2Task Selection

Identifies challenging tasks based on agent solvability and the scope and dispersion of edits in ground-truth patches.

3Expert Annotation

Employs expert developers to trace code dependencies to construct gold contexts, validated through LLM-based patch generation.

Data Flow Dynamics

The data flow during the ContextBench construction pipeline, showing the step-by-step evolution from task deduplication to gold context annotation.

Data Flow