This market will use the variant of the benchmark frozen one week after the initial release (following the public benchmark red-teaming stage to identify flawed/ambiguous questions).
The temperature used for the 5/5 reliability evaluation will be the default setting provided by each LLM API provider. In cases where this default is ambiguous to determine, we will default to a temperature of 0.7.
Nov 19, 2025
Gemini 3
"pass@5: 19% (prev SOTA 10%)
5/5 reliability: 5% (prev 3%)"
https://x.com/JRobertsAI/status/1991163723436663125?s=20
As of May 24th 2025, Claude 4 Opus is the new SotA:
https://x.com/JRobertsAI/status/1926325748303872203
4% Pass@1
As of March 28th 2025, Gemini 2.5 Pro is the new SotA: https://x.com/JRobertsAI/status/1905577784300183653
3% pass@1
5% pass@5
1% 5/5 reliability