Chapter 2 · Part 1

Closeness is meaning

Last chapter we turned words into points and watched similar ones cluster together. But "close together" was just a feeling we got from the picture. To actually use embeddings, a machine needs to turn closeness into a number — one value that says how related two words are.

The standard measure is cosine similarity, and its trick is to look at the angle between two vectors, not how long they are. Two words pointing the same direction are similar even if one vector is twice as long as the other.

Scroll through a few pairs and watch the angle — and the score — change.

cat and dog point almost the same way: a small angle, cosine near 1 — very similar.

scroll

Why the angle, not the distance?

It would seem natural to just measure straight-line distance between points. The problem: vector length often reflects things we don't care about — how common a word is, how long a document is — rather than its meaning. Two texts can be about exactly the same thing at very different lengths. Cosine similarity throws length away and keeps only direction, which is where the meaning lives.

In code it's a couple of lines — or one call to a library:

similarity.py — cosine similarity between two embeddings
import numpy as np

def cosine(a, b):
  return a @ b / (np.linalg.norm(a) * np.linalg.norm(b))

cat   = model.encode("cat")
dog   = model.encode("dog")
king  = model.encode("king")

cosine(cat, dog)    # ~0.8  — high
cosine(cat, king)   # ~0.1  — low

One number unlocks a lot

This single score is the workhorse of embeddings. "Which of these 10,000 documents is most relevant to my question?" becomes "which has the highest cosine similarity to the question's vector?" "What's similar to this song?" — same operation. We'll build exactly that kind of search in the final chapter.

But first, the obvious question: where do these conveniently-arranged vectors come from? Nobody places them by hand. Next we'll see how a model learns them from raw text — by paying attention to the company each word keeps.