Chapter 1 · Part 1

It's all matrix math

Why does AI need special chips at all? Your laptop already has a perfectly good processor. The answer starts not with the hardware but with the workload — what a neural network actually spends its time doing. And once you look, it's astonishingly repetitive: underneath the buzzwords, a neural net is mostly one operation, repeated trillions of times — multiply two numbers and add the result to a running total.

That operation is multiply-accumulate (MAC), and stacked up it forms matrix multiplication. Every layer you met in the neural networks course — and every attention block in an LLM — is, at heart, a big matrix multiply.

Scroll to see a single layer as a matrix multiply, and watch the operation count explode.

A layer takes an input vector and multiplies it by a grid of weights.

scroll↓

One operation, at unimaginable scale

The reason this matters: the amount is staggering. A single large-language-model response can require trillions of multiply-accumulates. Training is millions of times more. No amount of clever code makes a trillion multiplications free — you simply need hardware that can do enormous numbers of them, fast.

The two things that make it tractable

Two properties of this workload are the whole reason GPUs and TPUs exist, and we'll build on both:

It's the same operation, over and over. Multiply-add, everywhere. You don't need a chip that can do anything — you need one that does this incredibly fast.
The operations are independent. Every output element's dot product can be computed separately from the others — nothing waits on anything else.

That second property has a name — parallelism — and it's the hinge the entire rest of this course turns on. Next: serial vs parallel, and why it decides everything about how a chip is built.