Chapter 2 · Part 1

Tokens, not words

Last chapter we said a model "predicts the next word." That was a white lie to get the idea across. Models don't actually deal in words — or letters. They read and write in tokens: common chunks of text, sometimes a whole word, often just a piece of one.

This sounds like a boring technicality. It isn't — it quietly explains some of ChatGPT's strangest quirks, from why it's bad at spelling to why your API bill looks the way it does.

Scroll to watch a sentence get tokenized.

To you, this is a four-word sentence.

scroll↓

Why chunks, not words or letters?

It's a compromise between two bad extremes:

One token per letter would make sequences enormously long and force the model to relearn every word from scratch. Slow and wasteful.
One token per word would need a vocabulary of millions of words and still choke on anything new — names, typos, code, other languages.

So tokenizers learn a middle vocabulary of frequent chunks. Common words ("the", "read") get their own token; rarer words get assembled from pieces ("Token" + "izers"). Spaces are usually glued to the start of the following token, which is why you'll see a leading space inside a chunk.

The quirks this explains

Once you know the model sees tokens, several mysteries dissolve:

Spelling & counting letters. Ask how many r's are in "strawberry" and it may get it wrong. To the model "strawberry" might be just two or three chunks — it never sees the individual letters, so counting them is genuinely hard.
Cost & context limits. APIs charge per token, and the "context window" (next chapters) is measured in tokens, not words. As a rough rule, 1 token ≈ 4 characters ≈ ¾ of a word in English.
Other languages cost more. Languages underrepresented in the tokenizer's training get split into many more tokens, so the same sentence can cost several times as much in, say, Hindi or Thai as in English.

tokens.py — text becomes a list of integers

import tiktoken

enc = tiktoken.get_encoding("cl100k_base")   # GPT-style tokenizer

ids = enc.encode("Tokenizers don't read words.")
print(ids)            # e.g. [3404, 12509, 1541, 956, 1373, 4339, 13]
print(len(ids))       # 7 tokens, not 4 words

# the model only ever sees those integers — never the letters
print([enc.decode([i]) for i in ids])
# ['Token', 'izers', ' don', "'t", ' read', ' words', '.']

Back to prediction

So when we say the model "predicts the next word," what it really does is pick the next token — the next integer in the sequence — and append it, exactly like Chapter 1, just in this chunked alphabet.

But how does it pick? At each step it doesn't choose one token outright — it produces a probability for every token in the vocabulary and samples from them. That distribution is where the real action is, and it's next.