Chapter 2 · Part 1
Turning pixels into objects
A camera frame is just a grid of brightness numbers. Before the car can do anything useful, it has to answer "what's in this picture?" — which is exactly the computer-vision problem the convolutional-network course is about. Here it runs in real time, on eight feeds at once, thirty-plus times a second. This step is called perception.
Scroll to watch the perception network annotate one forward frame.
Start with a raw camera frame — to the computer, just pixels.
Three jobs at once
Modern perception nets are multi-task: one shared backbone (a big convolutional or transformer network) feeds several "heads," each answering a different question about the same frame:
- Detection — draw a box around each object and label it (car, truck, pedestrian, cyclist, traffic light) with a confidence score.
- Semantic segmentation — classify every pixel: road, lane marking, sidewalk, sky. This is how the car knows where it can drive, not just where objects are.
- Attributes — beyond "car," read its state: brake lights on? turn signal? what color is that traffic light?
Confidence, not certainty
Notice every detection carries a number like 0.98. The network never says "that
is a pedestrian" — it says "I'm 88% sure." Downstream planning treats a confident
detection differently from a shaky one, and a low-confidence "maybe a person"
near the road is still treated cautiously. Driving safely means acting sensibly
under uncertainty, a theme that runs through the rest of the stack.
Where we're headed
We can now label objects in each of the eight images. But each detection lives in its own camera's flat 2D frame — the front camera and the side camera don't share a coordinate system, and none of them know true distance. To drive, the car needs all eight stitched into one consistent 3D picture of the space around it. Next: the bird's-eye view.