ContextBench

A scientific benchmark evaluating the dynamics of multi-file context retrieval in LLM agents.

Foundation Models

4

Best Pass@1

53.0%

Avg. Efficiency

0.599

Avg. Line F1

0.325

Benchmark Rankings

Sorting byPrimary Metric

System View

Showing Raw Backbone Capabilities

Rank
1	Claude Sonnet 4.5	53.0%	0.344	0.658	$0.76
2	GPT-5	47.2%	0.312	0.591	$0.45
3	Devstral 2	40.2%	0.332	0.616	$0.91
#4	Gemini 2.5 Pro	36.4%	0.311	0.529	$0.38

Dataset Statistics

A repository-level benchmark spanning 8 programming languages and introducing human-verified gold contexts to expose intermediate context retrieval signals missing from final task resolution rate evaluation.

Language	#Repo	#Task	#File	#Block	#Line
Python	20	512	1,520	6,714	115,122
Java	6	57	262	3,030	49,057
JavaScript	9	153	819	3,949	87,907
TypeScript	8	119	537	1,106	40,621
Go	7	104	679	3,000	71,596
Rust	9	63	272	1,842	50,402
C	3	68	250	1,591	62,300
C++	4	60	209	1,884	45,110
Total	66	1,136	4,548	23,116	522,115

Construction Pipeline

An overview of the ContextBench construction pipeline. ContextBench is curated through three key steps: Task Deduplication, Task Selection, and Expert Annotation.

1Task Deduplication

Removes exact and near-duplicate tasks from multiple issue resolution benchmarks using rule-based and embedding-based detection.

2Task Selection

Identifies challenging tasks based on agent solvability and the scope and dispersion of edits in ground-truth patches.

3Expert Annotation

Employs expert developers to trace code dependencies to construct gold contexts, validated through LLM-based patch generation.

Data Flow Dynamics

The data flow during the ContextBench construction pipeline, showing the step-by-step evolution from task deduplication to gold context annotation.