Claude Opus 4.5's METR-50 time horizon
64
1.2kṀ10k
Dec 24
0.7%
< 2h
43%
2h-3h
42%
3h-4h
7%
4h-5h
4%
5h-6h
1.8%
6h-7h
0.8%
7h-8h
0.6%
>=8h

This market will resolve to the highest 50% time horizon, as reported by METR, for the first Claude 4.5 Opus thinking model to appear on METR's graph.

50% time horizon is a measure of AI autonomy based on the length of tasks that AI can do: roughly, it is the time that humans take to complete tasks that an AI system can successfully do 50% of the time. See METR's "Measuring AI Ability to Complete Long Tasks" for the technical definition. Claude 3.7 Sonnet, released in February 2025, was the leading model with a 50% horizon of 59 minutes.

Left bounds inclusive, right bounds exclusive.

Time horizon could vary based on the set of tasks used to measure it, so this market will be based on the time horizon for the most comprehensive set of tasks reported by METR (as of 2025, largely software and engineering tasks). This will be ambiguous if METR stops publishing time horizons across all of their autonomy tasks and only publishes separate results for different subsets; I might N/A in that scenario.

See also:

/Bayesian/gemini-3s-50-time-horizon-per-metr

/Bayesian/gpt5-pros-50-time-horizon-per-metr

/Bayesian/grok-5s-50-time-horizon-per-metr

/Bayesian/r2s-50-time-horizon-per-metr

Get
Ṁ1,000
to start trading!
Sort by:

I will hold. 3h to 3.5 hours is sure.

https://x.com/GregHBurnham/status/1993509024097292388?s=19

@Amonium at first i thought the same but notice that the >4h one has n=3 so basically meaningless. it seems kind of indicative of low 2h30min ish to me?

Claude models outperform on swe bench relative to metr. Explained below

@Bayesian What does n=3 mean?

@MaxLennartson it means there's only 3 problems in SWE-Bench verified that are estimated to take 4h+ for humans to solve, if I'm understanding it right

@Bayesian Is that why you think Claude Opus will have a time horizon of 2h30min?

@MaxLennartson not exactly. more like: i think opus will have a 2.5ish time horizon based on the scores for human times <4 hours. and i'm mostly dismissing the >4h row because it is not very informative because it has only 3 samples

I doubt they'll ever evaluate it. Probably only doing OpenAI's models since they give them free credits

bought Ṁ40 YES

I aim between 3 and 3.3 hours.

Swe bench used to be well correlated to metr. Still is for gpt models. By that, metr for opus is just over 3.5 hours.

But Claude models have started diverging since 4 series. Opus might be as low as 2.7 hours.

🤖

Meowdy! Creator's 3:55 guess points to 3h-4h bucket; I’m curious how much Claude 4.5 really stretches the horizon. I’ll dig into this more tonight and pounce on fresh insights!

guessing 3 hours 55 minutes

@jim Would it be possible to split the potential time horizons up? Example 2h-2.5h, 2.5h-3h and so on.

@MaxLennartson it's not possible with how this market is set up. One could create a new market with better buckets though.

© Manifold Markets, Inc.TermsPrivacy