Backpropagation
How neural networks learn by propagating errors backward through layers.
A Simple Neural Network
Let's visualize backpropagation with a simple network: 2 inputs, 2 hidden neurons, and 1 output. Each connection has a weight — a number that determines how strongly one neuron influences another.
Our goal: adjust these weights so the network makes accurate predictions.
- Neurons (nodes) compute weighted sums plus activation
- Weights are the learnable parameters
- This network has 4 weights to learn
Forward Pass: Computing the Output
Data flows forward through the network. Inputs (x₁, x₂) are multiplied by weights, summed at each hidden neuron, passed through an activation function, then combined to produce the output.
This is a forward pass — inputs in, prediction out.
- Each neuron computes: output = activation(Σ weights × inputs)
- Forward pass is just matrix multiplication + activation
- The final output is the network's prediction
How Wrong Are We?
We compare the prediction (ŷ = 0.7) to the actual target (y = 1.0). The loss function quantifies this error — here, squared error: (1.0 - 0.7)² = 0.09.
The larger the loss, the worse the prediction. Our job: minimize this loss.
- Loss measures prediction error
- Common losses: MSE, cross-entropy
- Training = minimizing loss
Gradients Flow Backward
Now the magic: we ask "how does each weight contribute to the loss?" The answer is the gradient — the derivative of loss with respect to each weight.
We start at the output and work backward. The gradient ∂L/∂ŷ tells us how changes in the output affect the loss.
- Gradient = direction of steepest loss increase
- Negative gradient = direction to reduce loss
- Computed via calculus (chain rule)
Chain Rule Through Layers
The chain rule lets us decompose the gradient through each layer. The gradient for w₃ combines the gradient from the output with the gradient through the activation.
At the hidden layer, gradients split — each hidden neuron receives gradients from all connections leading forward.
- Chain rule: ∂L/∂w = ∂L/∂y × ∂y/∂h × ∂h/∂w
- Gradients accumulate through layers
- This is why it's called "backpropagation"
Learning: Adjusting Weights
Finally, we update weights in the opposite direction of the gradient. If a weight contributed to increasing the loss, we decrease it. The learning rate (α) controls step size.
After one update, the prediction improves: ŷ = 0.85. Repeat thousands of times and the network learns.
- Update rule: w = w - α × gradient
- Learning rate is a hyperparameter
- Multiple iterations = training epochs
- This is gradient descent