Chapter 7 · Part 3

The curse of raw pixels

Part 2 ended on a high note: edges and gradients squeeze a lot of meaning out of an image. But step back and look at what we're really handing a machine when we give it the raw picture — and it turns out raw pixels are a surprisingly terrible way to describe what's in a photo.

A modest 200×200 colour photo is 120,000 numbers. That long list is the only thing the computer gets. And the trouble is that this list is wildly unstable: nudge the camera, dim the lights, and almost every one of those numbers changes — even though, to you, it's obviously the same scene.

Scroll to slide an identical scene by a few pixels and watch the numbers revolt.

The same landscape, twice. Right now the copy is pixel-for-pixel identical, so the difference panel is dark.

scroll

Why this breaks naïve approaches

The most obvious way to recognise an image is to compare it, number by number, against examples you've seen — "which stored picture is this one closest to?" That's the nearest-neighbour idea, and it measures closeness as the distance between two 120,000-long lists.

The visualization above shows why that fails. A tiny shift the eye ignores produces an enormous pixel distance. Meanwhile two genuinely different objects, photographed under the same lighting and framing, can land closer together in pixel space than two photos of the same object. Distance in raw-pixel space simply doesn't track "same thing-ness."

Two problems, stacked

Raw pixels suffer from two curses at once:

  • Too many numbers. 120,000 dimensions for a small thumbnail; millions for a real photo. This is the literal curse of dimensionality — in very high dimensions almost everything is far from almost everything else, and "nearest" stops meaning much.
  • Too fragile. A shift, a rotation, a brighter bulb, a different background — none of these change what the picture is of, yet each rewrites huge swaths of the numbers. We want a description that stays put when the content stays put.
pixel_distance.py — why nearest-neighbour on raw pixels fails
import numpy as np

a = load_image("cat.png").reshape(-1)          # 120,000 numbers
b = np.roll(a.reshape(200, 200, 3), 3, axis=1) # shift the SAME image 3px
b = b.reshape(-1)

dist = np.linalg.norm(a - b)   # huge — yet it's literally the same cat
print(dist)                    # alignment dominates the distance

What we actually want

We want a representation: a shorter list of numbers that captures what's in the image while shrugging off shifts, lighting, and noise. Two photos of the same cat should map to nearby representations; a cat and a car should map far apart — regardless of framing.

Chapter 6 already hinted at one: edges are far more stable than raw brightness. But hand-built edge features only go so far. The real question is who designs the representation — and whether a machine could discover a better one than we ever could. That's next.