Chapter 3 · Part 1
Images as tensors
We've seen that an image is a grid of numbers (Chapter 1), and that colour needs three of those grids stacked together (Chapter 2). Now let's give that object its real name and look at its shape — because shape is the first thing any AI model asks about an image.
A stack of number-grids like this is called a tensor. Scroll to pull our photo apart into the tensor it really is.
Our colour photo from the last chapter.
Height, width, channels
A tensor is just a grid of numbers that can have more than two dimensions. For a colour image, there are three:
- Height — how many rows of pixels.
- Width — how many columns.
- Channels — how many numbers per pixel (3 for RGB, 1 for grayscale).
We write the shape as (H, W, C). Our little photo is (120, 120, 3). Note
the order — height comes first, the same row-before-column convention from
Chapter 1.
import numpy as np
gray = np.zeros((120, 120)) # 2D tensor: H, W (grayscale)
color = np.zeros((120, 120, 3)) # 3D tensor: H, W, C (RGB)
batch = np.zeros((32, 120, 120, 3))# 4D tensor: N, H, W, C (32 images)
print(color.shape) # (120, 120, 3)
print(color.size) # 43200 -> total numbersThat last line hints at where this is going: models rarely look at one image at a time. They process a batch of them at once, adding a fourth dimension. The tensor is the universal container.
More numbers, more detail
If an image is just a block of numbers, then a natural question is: how many numbers do we actually need? Two knobs control that — resolution (how many pixels) and bit depth (how many brightness levels each pixel can take).
Scroll to turn both knobs down and watch the picture — and the number count — collapse.
Full resolution: plenty of pixels, smooth colour.
Why this framing matters
Calling an image a tensor isn't pedantry — it's the bridge to machine learning. A model doesn't see a "photo"; it sees a tensor of a fixed shape and does arithmetic on it. Resize, crop, normalise, stack into batches — these are all just reshaping numbers.
From here on, every operation in the course — filtering, convolution, feeding
pixels to a network — is an operation on this (H, W, C) block. Next, in Part 2,
we start doing things to it, beginning with simple pixel math.