Backpropagation

How neural networks learn by propagating errors backward through layers.

Based on: Backpropagation Loss Functions Optimization

1 / ?

A Simple Neural Network

Let's visualize backpropagation with a simple network: 2 inputs, 2 hidden neurons, and 1 output. Each connection has a weight — a number that determines how strongly one neuron influences another.

Our goal: adjust these weights so the network makes accurate predictions.

Neurons (nodes) compute weighted sums plus activation
Weights are the learnable parameters
This network has 4 weights to learn

Forward Pass: Computing the Output

Data flows forward through the network. Inputs (x₁, x₂) are multiplied by weights, summed at each hidden neuron, passed through an activation function, then combined to produce the output.

This is a forward pass — inputs in, prediction out.

Each neuron computes: output = activation(Σ weights × inputs)
Forward pass is just matrix multiplication + activation
The final output is the network's prediction

How Wrong Are We?

We compare the prediction (ŷ = 0.7) to the actual target (y = 1.0). The loss function quantifies this error — here, squared error: (1.0 - 0.7)² = 0.09.

The larger the loss, the worse the prediction. Our job: minimize this loss.

Loss measures prediction error
Common losses: MSE, cross-entropy
Training = minimizing loss

Gradients Flow Backward

Now the magic: we ask "how does each weight contribute to the loss?" The answer is the gradient — the derivative of loss with respect to each weight.

We start at the output and work backward. The gradient ∂L/∂ŷ tells us how changes in the output affect the loss.

Gradient = direction of steepest loss increase
Negative gradient = direction to reduce loss
Computed via calculus (chain rule)

Chain Rule Through Layers

The chain rule lets us decompose the gradient through each layer. The gradient for w₃ combines the gradient from the output with the gradient through the activation.

At the hidden layer, gradients split — each hidden neuron receives gradients from all connections leading forward.

Chain rule: ∂L/∂w = ∂L/∂y × ∂y/∂h × ∂h/∂w
Gradients accumulate through layers
This is why it's called "backpropagation"

Learning: Adjusting Weights

Finally, we update weights in the opposite direction of the gradient. If a weight contributed to increasing the loss, we decrease it. The learning rate (α) controls step size.

After one update, the prediction improves: ŷ = 0.85. Repeat thousands of times and the network learns.

Update rule: w = w - α × gradient
Learning rate is a hyperparameter
Multiple iterations = training epochs
This is gradient descent

What's Next?

Backpropagation is the foundation of all modern deep learning. Next, explore optimization algorithms like Adam and SGD that make training faster and more stable.