Chapter 3 · Part 1

Images as tensors

We've seen that an image is a grid of numbers (Chapter 1), and that colour needs three of those grids stacked together (Chapter 2). Now let's give that object its real name and look at its shape — because shape is the first thing any AI model asks about an image.

A stack of number-grids like this is called a tensor. Scroll to pull our photo apart into the tensor it really is.

Our colour photo from the last chapter.

scroll↓

Height, width, channels

A tensor is just a grid of numbers that can have more than two dimensions. For a colour image, there are three:

Height — how many rows of pixels.
Width — how many columns.
Channels — how many numbers per pixel (3 for RGB, 1 for grayscale).

We write the shape as (H, W, C). Our little photo is (120, 120, 3). Note the order — height comes first, the same row-before-column convention from Chapter 1.

tensor.py — every image has a shape

import numpy as np

gray  = np.zeros((120, 120))       # 2D tensor: H, W            (grayscale)
color = np.zeros((120, 120, 3))    # 3D tensor: H, W, C         (RGB)
batch = np.zeros((32, 120, 120, 3))# 4D tensor: N, H, W, C      (32 images)

print(color.shape)   # (120, 120, 3)
print(color.size)    # 43200  ->  total numbers

That last line hints at where this is going: models rarely look at one image at a time. They process a batch of them at once, adding a fourth dimension. The tensor is the universal container.

More numbers, more detail

If an image is just a block of numbers, then a natural question is: how many numbers do we actually need? Two knobs control that — resolution (how many pixels) and bit depth (how many brightness levels each pixel can take).

Scroll to turn both knobs down and watch the picture — and the number count — collapse.

Full resolution: plenty of pixels, smooth colour.

scroll↓

Why this framing matters

Calling an image a tensor isn't pedantry — it's the bridge to machine learning. A model doesn't see a "photo"; it sees a tensor of a fixed shape and does arithmetic on it. Resize, crop, normalise, stack into batches — these are all just reshaping numbers.

From here on, every operation in the course — filtering, convolution, feeding pixels to a network — is an operation on this (H, W, C) block. Next, in Part 2, we start doing things to it, beginning with simple pixel math.