Chapter 11 · Part 4

How a CNN sees

We've built the machine — convolution layers stacked into a network (Chapter 10). Now the magical part: what does each layer actually learn to look for? When researchers peered inside trained networks, they found something remarkably orderly — a ladder of understanding that climbs from the trivial to the meaningful.

Crucially, nobody designed this ladder. The network was only ever told the final answer ("this is a cat"). The intermediate concepts — edges, textures, parts — emerged on their own, because they happen to be useful stepping stones.

Scroll to climb the layers, from raw edges up to whole objects.

Layer 1 fires on oriented edges — bright bars at every angle. The same detectors humans hand-built.

scroll

A hierarchy of understanding

Each layer's features are built from the previous layer's, like words from letters and sentences from words:

  • Layer 1 — edges. Tiny filters that respond to light/dark boundaries at specific angles. These look almost identical to the Sobel and Gabor filters of Chapter 6 — the network rediscovered them.
  • Layer 2 — textures. Edges at different angles combine into repeating patterns: fur, brick, mesh, stripes.
  • Layer 3 — parts. Textures and contours assemble into recognisable pieces — an eye, a wheel, a corner of a window.
  • Layer 4 — objects. Parts in the right arrangement become a whole: two eyes above a nose above a mouth is a face.

Why depth is the point

This is why "deep learning" is deep. A shallow model has to leap straight from pixels to "cat" in one step — nearly impossible. A deep model breaks that leap into a staircase of small, learnable steps, each only slightly more abstract than the last.

And because every step is learned from data, the network builds whatever intermediate concepts the task rewards — which for natural images turns out to be suspiciously close to how biological vision is organised, from the retina up through the visual cortex.

peek_inside.py — visualising what a layer responds to
# pick a layer, find the input patch that maximally activates each filter
for layer in model.conv_layers:
  for filt in layer.filters:
      patch = maximize_activation(model, layer, filt)
      show(patch)   # layer 1 -> edges, deeper -> textures, parts, objects

One thing still missing

We can now explain what a trained CNN knows and why its features form a hierarchy. But we've been quietly assuming the weights are already set to good values. Where did those millions of numbers actually come from?

They were learned — nudged, one tiny correction at a time, from a mountain of examples. That process, the engine under everything in Part 4, is the final chapter.