Chapter 4 · Part 2
Careful or creative
Last chapter we saw the real output of a language model: a score for every word, turned into probabilities the model samples from. But before it draws, you can reshape that distribution with a single dial — temperature. Turn it down and the model plays it safe; turn it up and it takes risks.
Nothing about the model changes. The prompt is the same, the scores are the same. Temperature only decides how boldly the model reads from the probabilities it already produced.
Scroll to sweep temperature from cold to hot for the prompt "The sky is".
At low temperature the top token 'blue' hogs nearly all the probability — the model is careful and predictable.
One number, dividing the scores
Recall the softmax from last chapter: it exponentiates each logit and normalizes
so the probabilities sum to 1. Temperature slips in one step earlier — it
divides every logit by T before the softmax runs.
- Low
T(e.g. 0.3) exaggerates the gaps between scores. The leader pulls even further ahead, the distribution gets peaked, and the model almost always picks the top token. This is "careful" — focused, repetitive, near-greedy. - High
T(e.g. 1.5) shrinks the gaps. The distribution flattens, and tokens deep in the tail become real possibilities. This is "creative" — surprising, but also more likely to wander off the rails. T = 1leaves the scores untouched: the model's natural distribution.
Seeing it in code
In practice temperature is one argument you pass alongside the prompt. Under the hood it's just that division before softmax.
import numpy as np
logits = model.forward(tokens) # one score per vocab token (~50k)
def sample(logits, T):
scaled = logits / T # the whole trick: divide by temperature
probs = np.exp(scaled) / np.exp(scaled).sum()
return np.random.choice(len(probs), p=probs)
sample(logits, T=0.2) # careful — almost always the top token
sample(logits, T=1.0) # natural — the model's own distribution
sample(logits, T=1.6) # creative — rarer tokens get a real shotWhen to turn the dial
There's no single "right" temperature — it depends on what you're asking for.
- Low for tasks with one correct answer: code, math, extracting a fact, following a strict format. You want the safe, predictable token every time.
- High for brainstorming, fiction, or alternate phrasings — when you want variety and the occasional unexpected word.
It's also why a slightly-too-high temperature feels "unhinged": once the tail has real weight, the model can grab a genuinely wrong word and then has to keep going as if it meant it.
So far we've looked at a single prediction. But the model doesn't see your prompt in isolation — it reads everything in front of it, up to a hard limit. Next: the context window, everything the model can hold in mind at once.