Chapter 5 · Part 3

Embeddings for everything

So far it's been about single words. But nothing in the idea is specific to words. If you can turn anything into a vector in a space where closeness means similarity, you get all the same superpowers — comparison, search, arithmetic. And it turns out you can embed almost anything: whole sentences, documents, images, audio, products, users.

The real magic is when different kinds of things share one space. Embed the word "dog", the sentence "a puppy playing", and a photo of a dog — and all three can land in the same neighborhood.

Scroll to watch words, then sentences, then images settle into shared concept clusters.

Start with words — points in the space, just like Chapter 1.

scroll↓

Sentences and documents

A single word is easy, but most useful text is longer. Sentence embedding models (like Sentence-BERT) read an entire sentence or paragraph and output one vector summarizing its meaning — so "how do I cancel my order?" and "I'd like a refund" land close together even though they share almost no words. This is what makes embeddings practical for real documents, FAQs and chat messages.

A shared space for text and images: CLIP

The headline example is CLIP, trained on hundreds of millions of image–caption pairs. It learns two encoders — one for images, one for text — and trains them so that a picture and its caption produce nearby vectors.

multimodal.py — compare an image to text descriptions

import clip

image_vec = clip.encode_image(open("photo.jpg"))

labels = ["a dog", "a car", "a beach"]
text_vecs = [clip.encode_text(t) for t in labels]

# the closest caption wins — no training on these labels needed
best = max(labels, key=lambda t: cosine(image_vec, clip.encode_text(t)))
print(best)        # -> "a dog"

One representation to connect them all

Once images, text and more live in a common space, you can mix and match: search photos with words, find captions for pictures, recommend products from a typed description. The vector becomes a universal "meaning handle" for any kind of data.

We now have everything — words, sentences, images — as comparable points. In the final chapter we put the whole idea to work: finding things by meaning at scale, the engine behind semantic search, recommendations and retrieval-augmented generation.