Chapter 3 · Part 2

Learned from the company they keep

We've been treating the embedding space as if it already exists, neatly arranged. But no one sits down and assigns cat its coordinates. The vectors are learned — and the way they're learned is beautifully simple, captured by a famous line from the linguist J.R. Firth:

"You shall know a word by the company it keeps."

The idea: words that show up in the same kinds of contexts probably mean similar things. Cat and dog both appear near pet, fur, vet, fed the…. If we train a model to predict a word's neighbors, words with similar neighbors are pushed toward similar vectors — automatically.

Scroll to slide a context window across a sentence and see the training pairs it produces.

Pick a center word and look at a small window of words around it — its context.

scroll

Skip-gram: predict the neighbors

This is the recipe behind word2vec, the model that made embeddings famous. In its "skip-gram" form, every word in the corpus becomes a tiny prediction problem: from the center word, guess the words around it. The model starts with random vectors and nudges them, over and over, so that words appearing in similar contexts end up nearby.

skipgram.py — the training pairs from one sentence
sentence = ["the", "fluffy", "cat", "sat", "on", "the", "mat"]
RADIUS = 2

pairs = []
for i, center in enumerate(sentence):
  for j in range(i - RADIUS, i + RADIUS + 1):
      if j != i and 0 <= j < len(sentence):
          pairs.append((center, sentence[j]))   # (center → context)

# ('cat', 'the'), ('cat', 'fluffy'), ('cat', 'sat'), ('cat', 'on'), ...
# A model trained to predict these nudges 'cat' near words it co-occurs with.

The payoff: structure for free

What's remarkable is that nobody told the model what cat means. It only ever saw which words tend to surround which. Yet out the other end comes a space where synonyms cluster, topics group together, and — as we'll see next — meaning even supports arithmetic.

Modern embedding models (the ones behind ChatGPT and semantic search) are far bigger and read whole sentences at once, but they rest on this same foundation: meaning comes from context. Next, the most surprising consequence of that — you can do math on the vectors.