Chapter 5 · Part 2
The TPU: built for one thing
A GPU is a generalist among parallel chips. The TPU (Tensor Processing Unit, Google's design) takes the opposite bet: give up almost all flexibility and build a chip that does basically nothing but matrix multiply — as fast and as power-efficiently as physically possible. It's an ASIC: an application-specific integrated circuit, hard-wired for one job.
Its heart is a beautiful structure called a systolic array.
Scroll to watch data pump through the array, multiplying and accumulating as it goes.
A grid of tiny multiply-accumulate cells — each holds one weight, fixed in place.
The systolic array
The name comes from systole — a heartbeat — because data is pumped rhythmically through the grid. The clever part is what doesn't move: the weights stay put inside the cells. In a CPU or GPU, doing a multiply means repeatedly fetching operands from memory and writing results back — and that data movement, not the math, is often the real cost. The systolic array sidesteps it: load the weights once, then stream activations through, and each value gets reused across a whole row or column of cells without another trip to memory.
The trade: throughput per watt, not flexibility
By doing only matmul, a TPU can deliver enormous matrix throughput per watt of power and per dollar — which is why Google fills data centers with them for training and serving its largest models. The cost is flexibility: a TPU can't render graphics, run your operating system, or pivot to a wildly different algorithm. It's a specialist. (GPUs sit in between — more specialized than a CPU, far more general than a TPU. Other "NPUs" and AI accelerators in phones and laptops follow the same specialize-for-matmul idea.)
Where we're headed
We now have three chips spanning the whole spectrum: the flexible, low-latency CPU; the parallel, programmable GPU; and the specialized, blistering-throughput TPU. So which do you actually reach for — and what's the catch that all three share? Finally: which chip when.