ContextBench
A scientific benchmark evaluating the dynamics of multi-file context retrieval in LLM agents.
Foundation Models
4
Best Pass@1
53.0%
Avg. Efficiency
0.599
Avg. Line F1
0.325
Benchmark Rankings
Sorting byPrimary Metric
System View
Showing Raw Backbone Capabilities
| Rank | ||||||
|---|---|---|---|---|---|---|
1 | Claude Sonnet 4.5 | 53.0% | 0.344 | 0.658 | $0.76 | |
2 | GPT-5 | 47.2% | 0.312 | 0.591 | $0.45 | |
3 | Devstral 2 | 40.2% | 0.332 | 0.616 | $0.91 | |
#4 | Gemini 2.5 Pro | 36.4% | 0.311 | 0.529 | $0.38 |