Will the next major LLM by OpenAI use a new tokenizer?
48
1kṀ2396
Dec 31
89%
chance
  1. The GPT-2 model used r50k_base: vocab size = 50k

  2. The GPT-3 model used r50k_base: vocab size = 50k

  3. The GPT-3.5 model used cl100k_base: vocab size = 100k

  4. The GPT-4 model used cl100k_base: vocab size = 100k

Get
Ṁ1,000
to start trading!
Sort by:
bought Ṁ1,000 YES

Uses o200k which is not on the list.

bought Ṁ23 YES

Regardless of which release is considered major for the purpose of the question, OpenAI have moved to a new tokenizer with 4o, and all of their big models released since (o1, o3, 4.5, 5) have also used it.

4o uses a different tokenizer (eg "gumdrop")

https://platform.openai.com/tokenizer

Is 4o "major"?

bought Ṁ50 YES

What if there are significantly more new tokens, e.g. representing images or audio, but the tokens representing text are pretty much unchanged?

@firstuserhere So YES if there's a GPT-4.5/5 that uses a tokeniser not on this list, and NO if there's a GPT-4.5/5 that uses a tokeniser that is on this list?

Do you consider GPT-4-turbo to be a new iteration? What do you quantify as "next major LLM"

@oh No, GPT-4 turbo is part of the same family, does not qualify as the next major LLM release

© Manifold Markets, Inc.TermsPrivacy