Convolution Operations: The Heart of CNNs

Convolutional Neural Networks (CNNs) are designed to process grid-like data, especially images. The key operation is the convolution.

What is a Convolution?

A convolution slides a small matrix (called a kernel or filter) across the input image, computing element-wise multiplications and summing the results.

Input (5x5)     Kernel (3x3)      Output (3x3)
1 2 3 4 5       1 0 -1            
2 3 4 5 6   *   2 0 -2     =      Result values
3 4 5 6 7       1 0 -1            
4 5 6 7 8                         
5 6 7 8 9

Why Convolutions for Images?

1. Translation Invariance: A cat is a cat regardless of where it appears 2. Parameter Sharing: Same kernel used across entire image (fewer parameters) 3. Local Connectivity: Each output depends only on a small region (receptive field)

Mathematical Definition

(I * K)[i,j] = Σm Σn I[i+m, j+n] × K[m,n]

NumPy Implementation

import numpy as np
def convolve2d(image, kernel):
    """Simple 2D convolution (no padding, stride=1)"""
    h, w = image.shape
    kh, kw = kernel.shape
    output_h = h - kh + 1
    output_w = w - kw + 1
    
    output = np.zeros((output_h, output_w))
    
    for i in range(output_h):
        for j in range(output_w):
            
Extract patch and compute dot product
            patch = image[i:i+kh, j:j+kw]
            output[i, j] = np.sum(patch * kernel)
    
    return output
Edge detection kernel (Sobel)
sobel_x = np.array([[-1, 0, 1],
                    [-2, 0, 2],
                    [-1, 0, 1]])
Apply convolution
edges = convolve2d(image, sobel_x)

PyTorch Convolutions

import torch
import torch.nn as nn
2D Convolution layer
conv = nn.Conv2d(
    in_channels=3,      RGB input
    out_channels=16,    Number of filters
    kernel_size=3,      3x3 kernels
    stride=1,           Move 1 pixel at a time
    padding=1           Keep same spatial size
)
Input: (batch, channels, height, width)
x = torch.randn(1, 3, 224, 224)
output = conv(x)  Shape: (1, 16, 224, 224)