Backpropagation

How neural networks learn by propagating errors backward through layers.

Read this as How does error assign credit or blame to each weight?
Failure Trap
Treating backprop as insight into meaning rather than a gradient computation over parameters.
Decision Rule
Follow loss to gradients to updates; debug training by checking each link in that chain.
Backpropagation in a small neural network A 2-input, 2-hidden-neuron, 1-output network. Data flows forward to a prediction, a loss measures the error, then gradients flow backward through the chain rule — two factors for the output weights, three factors for the hidden weights — and the weights are updated by gradient descent. Input Hidden Output w₁ w₂ w₃ w₄ x₁ x₂ h₁ h₂ ŷ A network with 4 learnable weights Input Hidden Output x₁ 0.5 x₂ 0.8 h₁ 0.6 h₂ 0.4 ŷ 0.7 Forward pass → prediction ŷ = 0.7 ŷ 0.7 prediction y 1.0 target Loss L = (y − ŷ)² = (1.0 − 0.7)² L = 0.09 Loss = 0.09 — prediction too low Backward: output weights w₃, w₄ h₁ h₂ ŷ L ∂L/∂w₃ ∂L/∂w₄ ∂L/∂w₃ = ∂L/∂ŷ · ∂ŷ/∂w₃ 2 factors — no hidden step Backward: hidden weights w₁, w₂ x₁ x₂ h₁ h₂ ∂L/∂w₁ ∂L/∂w₂ ∂L/∂w₁ = ∂L/∂ŷ · ∂ŷ/∂h₁ · ∂h₁/∂w₁ 3 factors — one more than the output w ← w − α · ∂L/∂w α = 0.1 x₁ x₂ h₁ h₂ ŷ 0.85 before L = 0.09 after L = 0.02 Updated — prediction improves to 0.85
1 / ?

A Simple Neural Network

Let's visualize backpropagation with a simple network: 2 inputs, 2 hidden neurons, and 1 output. Each connection has a weight — a number that determines how strongly one neuron influences another.

Our goal: adjust these weights so the network makes accurate predictions.

  • Neurons (nodes) compute weighted sums plus activation
  • Weights are the learnable parameters
  • This network has 4 weights to learn

Forward Pass: Computing the Output

Data flows forward through the network. Inputs (x₁, x₂) are multiplied by weights, summed at each hidden neuron, passed through an activation function, then combined to produce the output.

This is a forward pass — inputs in, prediction out.

  • Each neuron computes: output = activation(Σ weights × inputs)
  • Forward pass is just matrix multiplication + activation
  • The final output is the network's prediction

How Wrong Are We?

We compare the prediction (ŷ = 0.7) to the actual target (y = 1.0). The loss function quantifies this error — here, squared error: (1.0 - 0.7)² = 0.09.

The larger the loss, the worse the prediction. Our job: minimize this loss.

  • Loss measures prediction error
  • Common losses: MSE, cross-entropy
  • Training = minimizing loss

Output Weights: A Two-Factor Chain

Now the magic: we ask "how does each weight contribute to the loss?" The answer is the gradient — the derivative of loss with respect to each weight, computed with the chain rule.

The output weights w₃ and w₄ sit right next to the prediction, so their chain is short — just two factors: ∂L/∂w₃ = ∂L/∂ŷ × ∂ŷ/∂w₃. There is no hidden activation between them and the loss.

  • Start at the output and work backward
  • Output-weight gradient = 2 factors (loss → output → weight)
  • Negative gradient = the direction that reduces loss

Hidden Weights: A Three-Factor Chain

Hidden weights w₁ and w₂ are one layer deeper, so their signal must pass through the hidden neuron's activation before it reaches the loss. That adds a third link to the chain: ∂L/∂w₁ = ∂L/∂ŷ × ∂ŷ/∂h₁ × ∂h₁/∂w₁.

This is the key idea: chain depth grows with distance from the output. Deeper weights reuse the gradients already computed for the layers above them — which is exactly why it's called "backpropagation."

  • Output weights: 2 factors · hidden weights: 3 factors
  • Each layer reuses the gradient from the layer above it
  • Errors propagate backward, layer by layer

Learning: Adjusting Weights

Finally, we update weights in the opposite direction of the gradient. If a weight contributed to increasing the loss, we decrease it. The learning rate (α) controls step size.

After one update, the prediction improves: ŷ = 0.85. Repeat thousands of times and the network learns.

  • Update rule: w = w - α × gradient
  • Learning rate is a hyperparameter
  • Multiple iterations = training epochs
  • This is gradient descent