Chapter 2 · Part 1

Serial vs parallel

Here is the single idea that explains why three different kinds of chip exist. Imagine you have a thousand simple sums to do. You could hire one brilliant mathematician who does each sum in a flash, one after another — or a thousand schoolchildren, each handed a single sum to do at the same moment. For one hard, twisty problem the mathematician wins. For a thousand independent sums, the crowd finishes in the time it takes to do one.

That's serial vs parallel, and the matrix math from Chapter 1 is the thousand-independent-sums case — the crowd's dream job.

Scroll to race one fast worker against many simple ones on the same batch.

The same 24 independent multiply-adds, run two ways.

scroll↓

Latency vs throughput

The race exposes two different things you might optimize for:

Latency — how fast can you finish one task? This is the brilliant mathematician's strength, and it's what you want for a single, urgent, possibly branchy job.
Throughput — how many tasks can you finish per second overall? This is the crowd's strength, and it's what you want for a giant batch of independent work.

A CPU is built for latency (a few fast, clever cores). A GPU and a TPU are built for throughput (a sea of simple cores). Neither is "better" — they're answers to different questions.

Embarrassingly parallel

Matrix math is what computer scientists cheerfully call embarrassingly parallel: the work splits into independent pieces with almost no coordination needed. Each output element's dot product doesn't care what the others are doing, so you can hand them out to thousands of workers and collect the answers at the end.

Where we're headed

So the workload (Chapter 1) is a mountain of identical, independent multiply-adds, and the way to go fast (Chapter 2) is to do them in parallel. Now we can finally look at the chips — three different bets on the serial-vs-parallel spectrum. We start with the generalist that prizes latency: the CPU.