ContextBench

A scientific benchmark evaluating the dynamics of multi-file context retrieval in LLM agents.

Foundation Models

4

Best Pass@1

53.0%

Avg. Efficiency

0.599

Avg. Line F1

0.325

Benchmark Rankings

Sorting byPrimary Metric
System View
Showing Raw Backbone Capabilities
Rank
1
Claude Sonnet 4.5
53.0%
0.344
0.658
$0.76
2
GPT-5
47.2%
0.312
0.591
$0.45
3
Devstral 2
40.2%
0.332
0.616
$0.91
#4
Gemini 2.5 Pro
36.4%
0.311
0.529
$0.38