Read this as How does error assign credit or blame to each weight?
- Failure Trap
- Treating backprop as insight into meaning rather than a gradient computation over parameters.
- Decision Rule
- Follow loss to gradients to updates; debug training by checking each link in that chain.
A Simple Neural Network
Let's visualize backpropagation with a simple network: 2 inputs, 2 hidden neurons, and 1 output. Each connection has a weight — a number that determines how strongly one neuron influences another.
Our goal: adjust these weights so the network makes accurate predictions.
- Neurons (nodes) compute weighted sums plus activation
- Weights are the learnable parameters
- This network has 4 weights to learn
Forward Pass: Computing the Output
Data flows forward through the network. Inputs (x₁, x₂) are multiplied by weights, summed at each hidden neuron, passed through an activation function, then combined to produce the output.
This is a forward pass — inputs in, prediction out.
- Each neuron computes: output = activation(Σ weights × inputs)
- Forward pass is just matrix multiplication + activation
- The final output is the network's prediction
How Wrong Are We?
We compare the prediction (ŷ = 0.7) to the actual target (y = 1.0). The loss function quantifies this error — here, squared error: (1.0 - 0.7)² = 0.09.
The larger the loss, the worse the prediction. Our job: minimize this loss.
- Loss measures prediction error
- Common losses: MSE, cross-entropy
- Training = minimizing loss
Output Weights: A Two-Factor Chain
Now the magic: we ask "how does each weight contribute to the loss?" The answer is the gradient — the derivative of loss with respect to each weight, computed with the chain rule.
The output weights w₃ and w₄ sit right next to the prediction, so their chain is short — just two factors: ∂L/∂w₃ = ∂L/∂ŷ × ∂ŷ/∂w₃. There is no hidden activation between them and the loss.
- Start at the output and work backward
- Output-weight gradient = 2 factors (loss → output → weight)
- Negative gradient = the direction that reduces loss
Hidden Weights: A Three-Factor Chain
Hidden weights w₁ and w₂ are one layer deeper, so their signal must pass through the hidden neuron's activation before it reaches the loss. That adds a third link to the chain: ∂L/∂w₁ = ∂L/∂ŷ × ∂ŷ/∂h₁ × ∂h₁/∂w₁.
This is the key idea: chain depth grows with distance from the output. Deeper weights reuse the gradients already computed for the layers above them — which is exactly why it's called "backpropagation."
- Output weights: 2 factors · hidden weights: 3 factors
- Each layer reuses the gradient from the layer above it
- Errors propagate backward, layer by layer
Learning: Adjusting Weights
Finally, we update weights in the opposite direction of the gradient. If a weight contributed to increasing the loss, we decrease it. The learning rate (α) controls step size.
After one update, the prediction improves: ŷ = 0.85. Repeat thousands of times and the network learns.
- Update rule: w = w - α × gradient
- Learning rate is a hyperparameter
- Multiple iterations = training epochs
- This is gradient descent