Chapter 10 · Part 4

Convolutional networks

We now have two ingredients. From Chapter 5: a kernel that slides across an image and produces a feature map. From Chapter 9: a neuron whose weights are learned. A convolutional network is what you get when you combine them — a sliding kernel whose numbers are learned, repeated into stacks, layer after layer.

Why not just wire every pixel to a neuron? Because a 200×200 image has 120,000 inputs (Chapter 7) — a fully-connected first layer would need billions of weights. Convolution shares one small set of weights across every position, so a filter that finds an edge in the corner also finds it in the centre. Far fewer weights, and translation no longer breaks everything.

Scroll to send an image through a small CNN and watch its shape transform.

It starts as the raw image — 28×28 pixels, one tensor.

scroll

The shape of a CNN

Every convolutional network follows the same rhythm, and the visualization above traces it:

  • Convolution applies a bank of learned filters, turning one tensor into a deeper one — each filter contributes one feature map (one channel).
  • Pooling (or a strided conv) shrinks the spatial size, usually by half. It throws away exactly where a feature was while keeping whether it was there — buying a little translation-invariance.
  • Repeat. Width and height fall; channel depth climbs. Pixels turn into increasingly abstract features.
  • A dense head at the end flattens whatever's left and scores the classes.

Why this beats a plain network

A fully-connected network could in principle learn to see, but it would have to learn the same edge detector separately for every location, from far more data. The CNN bakes in two priors that match how images actually work:

  • Locality — a pixel's meaning depends mostly on its neighbours, so each filter only looks at a small window.
  • Translation equivalence — a cat is a cat wherever it appears, so the same weights are reused everywhere.
tiny_cnn.py — the convolution/pool rhythm
model = Sequential([
  Conv2D(8,  3, activation="relu", input_shape=(28, 28, 1)),
  MaxPool2D(2),
  Conv2D(16, 3, activation="relu"),
  MaxPool2D(2),
  Flatten(),
  Dense(10, activation="softmax"),  # scores for 10 classes
])

What the layers are actually finding

We keep saying "deeper features," but what are they? The first layer learns edge detectors, just like the filters in Chapter 8. The next layer combines edges into textures and corners; the one after that into parts; the last into whole objects.

That climb — from edges to objects — is one of the most beautiful results in deep learning, and it's the next chapter.