Backpropagation | Explainers

Read this as How does error assign credit or blame to each weight?

Failure Trap: Treating backprop as insight into meaning rather than a gradient computation over parameters.
Decision Rule: Follow loss to gradients to updates; debug training by checking each link in that chain.

1 / ?

A Simple Neural Network

Let's visualize backpropagation with a simple network: 2 inputs, 2 hidden neurons, and 1 output. Each connection has a weight — a number that determines how strongly one neuron influences another.

Our goal: adjust these weights so the network makes accurate predictions.

Neurons (nodes) compute weighted sums plus activation
Weights are the learnable parameters
This network has 4 weights to learn

Forward Pass: Computing the Output

Data flows forward through the network. Inputs (x₁, x₂) are multiplied by weights, summed at each hidden neuron, passed through an activation function, then combined to produce the output.

This is a forward pass — inputs in, prediction out.

Each neuron computes: output = activation(Σ weights × inputs)
Forward pass is just matrix multiplication + activation
The final output is the network's prediction

How Wrong Are We?

We compare the prediction (ŷ = 0.7) to the actual target (y = 1.0). The loss function quantifies this error — here, squared error: (1.0 - 0.7)² = 0.09.

The larger the loss, the worse the prediction. Our job: minimize this loss.

Loss measures prediction error
Common losses: MSE, cross-entropy
Training = minimizing loss

Output Weights: A Two-Factor Chain

Now the magic: we ask "how does each weight contribute to the loss?" The answer is the gradient — the derivative of loss with respect to each weight, computed with the chain rule.

The output weights w₃ and w₄ sit right next to the prediction, so their chain is short — just two factors: ∂L/∂w₃ = ∂L/∂ŷ × ∂ŷ/∂w₃. There is no hidden activation between them and the loss.

Start at the output and work backward
Output-weight gradient = 2 factors (loss → output → weight)
Negative gradient = the direction that reduces loss

Hidden Weights: A Three-Factor Chain

Hidden weights w₁ and w₂ are one layer deeper, so their signal must pass through the hidden neuron's activation before it reaches the loss. That adds a third link to the chain: ∂L/∂w₁ = ∂L/∂ŷ × ∂ŷ/∂h₁ × ∂h₁/∂w₁.

This is the key idea: chain depth grows with distance from the output. Deeper weights reuse the gradients already computed for the layers above them — which is exactly why it's called "backpropagation."

Output weights: 2 factors · hidden weights: 3 factors
Each layer reuses the gradient from the layer above it
Errors propagate backward, layer by layer

Learning: Adjusting Weights

Finally, we update weights in the opposite direction of the gradient. If a weight contributed to increasing the loss, we decrease it. The learning rate (α) controls step size.

After one update, the prediction improves: ŷ = 0.85. Repeat thousands of times and the network learns.

Update rule: w = w - α × gradient
Learning rate is a hyperparameter
Multiple iterations = training epochs
This is gradient descent