Chapter 12 · Part 4
Training: how it learns
Every chapter in Part 4 leaned on the same promise: the weights are learned. A neuron's weights, a filter's nine numbers, the millions of parameters in a deep CNN — all of them start as random noise and somehow end up detecting edges, parts, and cats. This final chapter is how that happens.
The whole of training is one idea, repeated millions of times: measure how wrong you are, then change every weight a tiny bit in the direction that makes you less wrong. That measure-and-nudge loop is gradient descent, and it's simpler than it sounds.
Scroll to roll a ball down a loss curve — the picture of learning.
Every possible setting of the weights has a loss — how wrong the network is. Plot it and you get a landscape.
Loss, gradient, step
Three pieces, and you've seen them all already:
- Loss measures how wrong the network's predictions are on the training examples. High loss = bad guesses. It's the same loss bar that fell in Chapter 8.
- Gradient is the slope of the loss with respect to every weight at once — the direction in which loss increases fastest. Computed efficiently by backpropagation, which is just the chain rule run backwards through the layers.
- Step nudges each weight a little way against its gradient, scaled by the learning rate. Too big and the ball overshoots the valley; too small and it crawls.
Why this is all you need
It feels too simple to work — and yet rolling downhill on the loss is how every neural network learns, from this tiny CNN to the largest models. The structure we spent the course building decides what the weights can represent; gradient descent decides which values they actually take.
for images, labels in dataset: # many times over
preds = model(images) # forward pass: make guesses
loss = cross_entropy(preds, labels)
grads = backprop(loss, model.weights) # how wrong, per weight
for w, g in zip(model.weights, grads):
w -= learning_rate * g # one small step downhillFrom pixels to predictions
Step back and look at the whole staircase we climbed. An image was just a grid of numbers (Chapter 1). We learned to operate on those numbers, then to convolve them into features (Chapter 5). We saw why raw pixels fail and why learned features beat hand-designed ones (Chapters 7–8). We built the learner out of neurons, stacked them into a CNN, and watched a hierarchy of meaning emerge (Chapters 9–11). And now, with gradient descent, we know how all those numbers were found.
That's the entire arc: a machine that turns pixels into predictions — not by being told the rules of seeing, but by discovering them, one small step downhill at a time.