Convolution Operation: The Foundation of CNNs

Interactive guide to convolution in CNNs: visualize sliding windows, kernels, stride, padding, and feature detection with step-by-step demos.

Best viewed on desktop for optimal interactive experience

Convolution Operation: The Foundation of CNNs

The convolution operation is the core building block of every convolutional neural network. It gives CNNs their ability to automatically detect patterns in spatial data, from simple edges in early layers to complex objects in deeper layers. Unlike fully connected layers that treat every input pixel independently, convolution exploits the spatial structure of images by applying the same small filter across all positions, sharing parameters and preserving locality.

Three properties make convolution particularly powerful for vision tasks. Sparse connectivity means each output neuron depends on only a small local patch of the input, not the entire image. Parameter sharing means the same filter weights are reused at every spatial position, dramatically reducing the number of learnable parameters. Translation equivariance means a feature detected in one part of the image is detected anywhere, because the same filter scans everywhere.

The Sliding Window Analogy

The easiest way to understand convolution is through a physical analogy. Imagine holding a small flashlight over a large painting. The flashlight illuminates only a small patch at a time. You examine that patch, write down a summary number, then slide the flashlight to the next position and repeat. After scanning the entire painting, your collection of summary numbers forms a new, smaller image called a feature map. The flashlight is the kernel, the painting is the input, and the scanning process is convolution.

The Flashlight on a Painting

Imagine shining a small flashlight across a painting. At each position, the flashlight illuminates a 3x3 patch, computes a weighted summary of what it sees, then slides to the next position. That summary becomes one pixel in the output feature map.

Input (7x7 painting)
20
30
50
80
120
150
180
25
40
60
90
130
160
190
30
50
100
150
170
180
200
35
55
110
200
220
190
170
30
50
100
180
200
170
140
25
40
70
120
150
130
110
20
30
50
80
100
100
90
Kernel (3x3 flashlight)
-1
-1
-1
-1
+8
-1
-1
-1
-1
Edge detection kernel: center minus neighbors
Output (5x5 feature map)
0
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
Position (0,0): element-wise multiply then sum
20x-1 + 30x-1 + 50x-1 + 25x-1 + 40x+8 + 60x-1 + 30x-1 + 50x-1 + 100x-1 = 0
1/25
Positions scanned
9
Multiplies per position
5x5
Output size

Mathematical Definition

In mathematical notation, discrete two-dimensional convolution (technically cross-correlation, which is what deep learning frameworks implement) takes an input I and a kernel K and produces an output at position (i, j) by:

(I \star K)(i,j) = Σm=0k-1 Σn=0k-1 I(i+m, j+n) · K(m, n)

In words: place the kernel's top-left corner at position (i, j) of the input. Multiply each kernel weight with the corresponding input value. Sum all those products. That sum is the output value at (i, j).

True mathematical convolution flips the kernel before multiplying, but in deep learning this distinction does not matter because the kernel weights are learned from data. Whether the kernel is flipped or not, gradient descent finds the optimal weights. Every major framework, including PyTorch and TensorFlow, implements cross-correlation and calls it "convolution."

Convolution Calculator

The best way to internalize convolution is to compute it by hand. Step through every position of a 5x5 input grid with a 3x3 kernel. At each position, the highlighted cells show which input values participate, and the element-wise products are summed into a single output value. Switch between kernels to see how different weight patterns produce different output feature maps from the same input.

Convolution Calculator

Step through every position of a 5x5 input with a 3x3 kernel. At each stop, the highlighted cells are multiplied element-wise with the kernel, then summed to produce one output value.

Step 1 / 9
Active kernel: Edge Detect
-1
-1
-1
-1
8
-1
-1
-1
-1
-1
Current output value
(0,0)
Output position
81
Total multiplications

The edge detection kernel subtracts all 8 neighbors from 8 times the center. Uniform regions produce zero; sharp transitions produce large positive values. This is a discrete Laplacian operator.

Stride and Padding

Two hyperparameters control the geometry of convolution: stride and padding.

Stride determines how many positions the kernel jumps between consecutive applications. A stride of 1 means the kernel slides one pixel at a time, producing the densest possible output. A stride of 2 means the kernel skips every other position, halving the output dimensions and acting as a built-in downsampling operator. Many modern architectures use strided convolutions instead of pooling layers for spatial reduction.

Padding adds extra values (typically zeros) around the border of the input before convolution. Without padding ("valid" mode), the output shrinks because the kernel cannot be centered on border pixels. With "same" padding, enough zeros are added to keep the output the same size as the input when stride is 1. This is the most common setting in practice because it lets you stack many convolutional layers without the spatial dimensions shrinking to nothing.

The output spatial size follows a precise formula:

\text{Output Size} = \left\lfloor \text{Input} + 2 × \text{Padding} - \text{Kernel}\text{Stride} \right\rfloor + 1

Stride and Padding Explorer

Adjust stride and padding to see how they reshape the output. Stride controls how far the kernel jumps between positions. Padding adds zeros around the border to control the output dimensions.

5x5
Output size
0
Padding (each side)
25
Kernel positions

Valid padding uses no padding at all. The kernel only visits positions where it fits entirely within the input. This shrinks the output by (kernel_size - 1) on each axis: a 7x7 input with a 3x3 kernel and stride 1 produces a 5x5 output.

Kernel Feature Detection

Each kernel acts as a specialized feature detector. The specific pattern of positive and negative weights determines what spatial pattern the kernel responds to most strongly. A horizontal edge kernel has negative weights on top and positive weights on bottom, so it produces a large output wherever the input transitions from dark (top) to light (bottom). A blur kernel averages all values equally, smoothing out local variation. A sharpen kernel amplifies the center relative to its neighbors, exaggerating contrast.

In a trained CNN, the network learns its own kernels through backpropagation. The first layer typically learns simple edge detectors remarkably similar to hand-designed Sobel or Gabor filters. Deeper layers learn increasingly complex detectors: corners, textures, parts of objects, and eventually entire object categories.

Kernel Feature Detector

Different kernels act as specialized feature detectors. An edge kernel responds strongly where intensity changes rapidly. A blur kernel smooths everything. Try each combination to see what each kernel “sees” in each pattern.

0
Max activation
-450
Min activation
450
Dynamic range

Vertical Edge: Detects vertical transitions (left-right changes). Applied to the vertical edge pattern, the feature map shows a moderate response, detecting some structure in the pattern.

Convolution Variants and Parameter Comparison

Not all convolutions are created equal. The standard convolution connects every input channel to every output channel through the full spatial kernel, making it the most expressive but also the most expensive. Several variants trade some expressiveness for dramatic efficiency gains.

Depthwise convolution applies a separate kernel to each input channel independently, with no cross-channel mixing. It is extremely cheap but cannot combine information from different channels.

Depthwise separable convolution chains a depthwise convolution with a 1x1 pointwise convolution. The depthwise step handles spatial filtering, and the pointwise step handles channel mixing. This factorization achieves nearly the same representational power as standard convolution at roughly one-eighth to one-twelfth the computational cost, depending on the kernel size and channel dimensions. MobileNet, EfficientNet, and most mobile-optimized architectures rely on this variant.

Grouped convolution splits the input channels into groups, applying independent convolutions within each group. Standard convolution is the special case with one group; depthwise convolution is the special case where the number of groups equals the number of channels. AlexNet originally used grouped convolutions to split computation across two GPUs, and ResNeXt later showed that many groups with narrow channels can outperform fewer groups with wide channels.

1x1 (pointwise) convolution uses a kernel of size one, performing no spatial filtering at all. It acts as a per-pixel fully connected layer, mixing information across channels. Bottleneck architectures like ResNet use 1x1 convolutions to reduce channel dimensions before expensive 3x3 convolutions, then expand them again afterward.

Transposed convolution (sometimes misleadingly called "deconvolution") upsamples the input by inserting zeros between elements and then convolving. It is the standard learnable upsampling operator in decoder networks, GANs, and semantic segmentation models like U-Net.

Convolution Type Comparison

Compare parameter counts and computational cost across convolution variants. Adjust the layer dimensions to see how costs scale.

Standard Conv
Full connectivity between all input and output channels. Most expressive but most expensive.
Params:73.9K
FLOPs:462.4M
Depthwise Conv
Each input channel is filtered independently with its own kernel. Extremely cheap but no cross-channel mixing.
Params:640(0.01x)
FLOPs:3.6M(0.01x)
Depthwise Separable
Depthwise conv followed by a 1x1 pointwise conv. Nearly as expressive as standard conv at a fraction of the cost.
Params:9.0K(0.12x)
FLOPs:55.0M(0.12x)
1x1 Conv (Pointwise)
Mixes channels without spatial filtering. Used for channel reduction, expansion, and feature recombination.
Params:8.3K(0.11x)
FLOPs:51.4M(0.11x)
Grouped Conv (g=4)
Splits channels into groups, each processed independently. Reduces parameters by factor of g.
Params:18.6K(0.25x)
FLOPs:115.6M(0.25x)
Transposed Conv
Upsamples by inserting zeros between input elements, then convolving. Used in decoders and generators.
Params:73.9K(1.00x)
FLOPs:462.4M(1.00x)
Output Size Formula
Output = floor( (Input + 2*Padding - Kernel) / Stride ) + 1
Example: (56 + 2*1 - 3) / 1 + 1 = 56 (same padding, stride 1)
Efficiency winner: Depthwise Separable

Used in MobileNet, EfficientNet, and most mobile architectures. Achieves nearly the same accuracy as standard convolution at 12% of the parameter cost. The key insight: spatial filtering and channel mixing can be separated.

When to use 1x1 convolutions

Pointwise (1x1) convolutions are the workhorse of modern architectures. They reduce or expand channel dimensions (bottleneck design), mix information across channels, and add nonlinearity without spatial filtering. ResNet and Inception use them extensively.

Multi-Channel Convolution

Real images have multiple channels (three for RGB, or hundreds of feature maps in intermediate layers), and convolution handles this naturally. A single convolutional filter is not a 2D matrix but a 3D tensor with dimensions Cin × k × k, where Cin is the number of input channels and k is the spatial kernel size. The filter performs 2D convolution independently on each input channel, then sums the results across channels to produce a single 2D output feature map.

To produce multiple output channels, the layer uses multiple filters. If the layer has Cout filters, the full weight tensor has shape Cout × Cin × k × k, and the output is a 3D tensor with Cout channels. Each output channel captures a different pattern in the input.

The total parameter count is Cout × Cin × k2 + Cout (the last term is the bias, one per output channel). For a typical layer with 64 input channels, 128 output channels, and a 3x3 kernel, that is 73,856 parameters. A fully connected layer between the same number of neurons on a 56x56 feature map would require billions of parameters, making convolution's parameter sharing essential for scaling to large images.

Common Pitfalls

Forgetting That Output Shrinks

Without padding, a 3x3 kernel on a 32x32 input produces a 30x30 output. Stack ten such layers and the spatial dimensions drop to 12x12. Modern architectures almost always use "same" padding to prevent this silent shrinkage, but it catches beginners off guard when debugging dimension mismatches.

Confusing Channel Ordering

PyTorch uses NCHW ordering (batch, channels, height, width), while TensorFlow defaults to NHWC (batch, height, width, channels). Passing a tensor in the wrong format produces silently wrong results because the convolution operates on the wrong axes. Always verify the expected data format of your framework.

Ignoring Receptive Field Growth

Each layer's kernel sees only a small local patch of its input, but stacking layers compounds the receptive field. Two 3x3 layers have an effective receptive field of 5x5, and three have 7x7. A deep stack of small kernels is more efficient and more nonlinear than a single large kernel with the same receptive field. VGG demonstrated this principle by replacing 7x7 kernels with three stacked 3x3 layers, using fewer parameters while adding more nonlinearity.

Checkerboard Artifacts in Transposed Convolutions

Transposed convolutions with stride greater than 1 can produce checkerboard patterns in the output when the kernel size is not divisible by the stride. The overlapping regions of the upsampled output receive more contributions than non-overlapping regions, creating visible grid-like artifacts. Using a kernel size that is a multiple of the stride (for example, 4x4 with stride 2) or replacing transposed convolutions with nearest-neighbor upsampling followed by a standard convolution avoids this issue.

Key Takeaways

  1. Convolution is element-wise multiply and sum. A small kernel slides across the input, computing a weighted sum at each position. The output feature map captures where the kernel's pattern appears in the input.

  2. Stride controls downsampling, padding controls size. Stride greater than 1 reduces spatial dimensions. Same padding preserves them. The output size formula is deterministic: floor((input + 2*padding - kernel) / stride) + 1.

  3. Different kernels detect different features. Edge detectors, blur, and sharpen are just specific weight patterns. Trained CNNs learn their kernels automatically, building from simple edges to complex objects across layers.

  4. Depthwise separable convolutions are the efficiency champion. Factoring spatial filtering from channel mixing achieves nearly the same accuracy at a fraction of the cost. Every mobile-optimized architecture uses this factorization.

  5. Parameter sharing is the key advantage. A 3x3 kernel has only 9 weights regardless of input size. This makes convolution scale gracefully to large images while a fully connected layer would require millions of parameters for every pixel.

  • Dilated Convolutions -- Expanding receptive fields without increasing kernel size or losing resolution
  • Feature Pyramid Networks -- Multi-scale feature extraction built on top of convolutional backbones
  • Receptive Field -- Understanding exactly how much of the input each neuron can "see"
  • Batch Normalization -- Stabilizing activations between convolutional layers for faster training
  • Skip Connections -- Residual connections that enable training of very deep convolutional networks

If you found this explanation helpful, consider sharing it with others.

Mastodon