Will the next major LLM by OpenAI use a new tokenizer?

1kṀ2396

Dec 31

89%

chance

ALL

The GPT-2 model used r50k_base: vocab size = 50k
The GPT-3 model used r50k_base: vocab size = 50k
The GPT-3.5 model used cl100k_base: vocab size = 100k
The GPT-4 model used cl100k_base: vocab size = 100k

Get

1,000

to start trading!

People are also trading

Will OpenAI release another open source LLM before end of 2026?

79% chance

Will OpenAI release a tokenizer with more than 210000 tokens before 2026?

24% chance

Will OpenAI's next major LLM (after GPT-4) feature natural and convenient speech-to-speech capabilities?

81% chance

Will OpenAI's next major LLM release support video input?

37% chance

Will a flagship (>60T training bytes) open-weights LLM from Meta which doesn't use a tokenizer be released in 2025?

16% chance

Will OpenAI's next major LLM (after GPT-4) solve more than 2 of the first 5 new Project Euler problems?

59% chance

Will OpenAI's next major LLM (after GPT-4) achieve over 50% resolution rate on the SWE-bench benchmark?

99% chance

What will be true of OpenAI's best LLM by EOY 2025?

When will OpenAI release their next open-weight LLM model?

8/14/27

Will there be a state-of-the-art LLM that is NOT based on next raw token prediction before 2029?

Sort by:

bought Ṁ1,000 YES

Uses o200k which is not on the list.

bought Ṁ23 YES

Regardless of which release is considered major for the purpose of the question, OpenAI have moved to a new tokenizer with 4o, and all of their big models released since (o1, o3, 4.5, 5) have also used it.

4o uses a different tokenizer (eg "gumdrop")

https://platform.openai.com/tokenizer

Is 4o "major"?

bought Ṁ50 YES

What if there are significantly more new tokens, e.g. representing images or audio, but the tokens representing text are pretty much unchanged?

@firstuserhere So YES if there's a GPT-4.5/5 that uses a tokeniser not on this list, and NO if there's a GPT-4.5/5 that uses a tokeniser that is on this list?