Backpropagation computes how to adjust each weight to reduce error. Gradients flow backward through the network using the chain rule. Understanding vanishing gradients and residual connections explains why transformers scale to hundreds of layers.
Visual Overview
The Learning Problem and Gradient Intuition
The Learning Problem and Gradient Intuition
THE LEARNING PROBLEM┌───────────────────────────────────────────────────────────┐│││ We have: ││ • Input data ││ • Desired outputs (labels) ││ • A loss function (measures how wrong we are) ││ • Millions of weights to adjust ││││ We need: ││ • Which direction to adjust each weight ││ • How much to adjust each weight ││││ The answer: Compute the GRADIENT of the loss with ││ respect to each weight. Go opposite to decrease loss. │││└───────────────────────────────────────────────────────────┘GRADIENT INTUITION┌───────────────────────────────────────────────────────────┐│││ Loss │││││ 5 │ _ current position │││││ 3 │ gradient points uphill │││││ 1 │ _ after update (moved opposite) │││││ 0 │────────*───minimum││└────────────────────││ weight value ││││ update = weight - learning_rate × gradient │││└───────────────────────────────────────────────────────────┘
The Chain Rule
Neural networks are compositions of functions. To compute gradients through compositions, we use the chain rule.
Chain Rule
Chain Rule
CHAIN RULE┌───────────────────────────────────────────────────────────┐│││ If y = f(g(x)), then: ││││ dy/dx = (dy/dg) × (dg/dx) ││││ "Derivative of outer × derivative of inner" │││└───────────────────────────────────────────────────────────┘SIMPLE EXAMPLE┌───────────────────────────────────────────────────────────┐│││ y = (2x + 1)² ││││ Let g = 2x + 1, so y = g² ││││ dy/dg = 2g (derivative of square) ││ dg/dx = 2 (derivative of 2x + 1) ││││ dy/dx = 2g × 2 = 2(2x + 1) × 2 = 4(2x + 1) │││└───────────────────────────────────────────────────────────┘
Why it matters: A neural network is a long chain of operations. The chain rule lets us compute how the loss changes with respect to weights deep in the network.
This is why it’s called “back” propagation — gradients propagate from the output back to the input.
The Vanishing Gradient Problem
Deep networks had a critical issue: gradients got exponentially smaller in earlier layers.
Vanishing Gradient Problem
Vanishing Gradient Problem
GRADIENT FLOW IN A 4-LAYER NETWORK┌───────────────────────────────────────────────────────────┐│││Forward pass: ││ Input → [L1] → [L2] → [L3] → [L4] → Loss ││││Backward pass (gradients): ││ dL/dW1 ← dL/dW2 ← dL/dW3 ← dL/dW4 ← dL ││││ At each layer, gradients multiply: ││ dL/dW1 = (local_1) × (local_2) × (local_3) × (local_4)││││ If each local gradient less than 1: ││ 0.5 × 0.5 × 0.5 × 0.5 = 0.0625 ←16x smaller!││││ For 10 layers with 0.5 gradients: ││ 0.5^10 = 0.001 ←Gradient practically zero││││Early layers stop learning. │││└───────────────────────────────────────────────────────────┘WHY SIGMOID CAUSES THIS┌───────────────────────────────────────────────────────────┐│││ Sigmoid: s(x) = 1/(1 + e^(-x)) ││ Gradient: s'(x) = s(x) × (1 - s(x)) ││││ Maximum gradient: 0.25 (when x = 0) ││ Usually much smaller. ││││ s'(x) is always at most 0.25 ││ Multiply many of these: gradient vanishes│││└───────────────────────────────────────────────────────────┘
Solutions to Vanishing Gradients
ReLU Activation
ReLU Gradient
ReLU Gradient
RELU GRADIENT┌───────────────────────────────────────────────────────────┐│││ReLU(x) = max(0, x) ││││ Gradient: ││ x > 0: gradient = 1││ x < 0: gradient = 0 ││││ When active (x > 0), gradient = 1││No shrinking! 1 × 1 × 1 × 1 = 1 ││││ Problem: "Dead neurons" — if a neuron is always in ││ x < 0 region, gradient always 0, never learns│││└───────────────────────────────────────────────────────────┘
Skip Connections (Residual Connections)
Residual Connections
Residual Connections
RESIDUAL CONNECTION┌───────────────────────────────────────────────────────────┐│││ Standard layer: ││ output = f(input) ││││ Residual layer: ││ output = input + f(input) ││↑││ This is the skip connection││││ Gradient flow: ││ d(output)/d(input) = 1 + df/d(input) ││↑││Gradient of at least 1, always││││ Even if f's gradient vanishes, the "1" remains. ││ Gradients have a highway to flow through. │││└───────────────────────────────────────────────────────────┘GRADIENT HIGHWAY VISUAL┌───────────────────────────────────────────────────────────┐│││ Without residuals: ││ Input → [L1] → [L2] → [L3] → [L4] → Output ││↓↓↓↓││ Gradients must pass through every layer (shrink) ││││ With residuals: ││ Input ───────────────────────────────→ + → Out ││↓↓↓↓↑││ [L1] → [L2] → [L3] → [L4] ───────────┘││││Gradients can skip layers entirely. ││ "Residual stream" flows directly input to output. │││└───────────────────────────────────────────────────────────┘
This is why transformers work. Every attention layer and FFN layer is residual:
# Transformer layer (simplified)x = x + attention(x) # residual around attentionx = x + ffn(x) # residual around FFN
Why This Matters for Modern Models
Understanding LoRA
LoRA (Low-Rank Adaptation) freezes the base model and trains small adapter matrices.
LoRA Gradient Flow
LoRA Gradient Flow
LORA GRADIENT FLOW┌───────────────────────────────────────────────────────────┐│││ Base model weights: Frozen (no gradients computed) ││LoRA adapters: Trainable││││ W_new = W_base + A × B ││↑↑││ frozen trained ││││ Only A and B receive gradients. ││Much smaller→much faster training. │││└───────────────────────────────────────────────────────────┘
Understanding Training Instability
Diagnosing Training Issues
Diagnosing Training Issues
DIAGNOSING TRAINING ISSUES┌───────────────────────────────────────────────────────────┐│││ Loss not decreasing: ││ • Gradients too small? (vanishing) ││ • Learning rate too low? ││ • Bad initialization? ││││ Loss explodes (goes to NaN): ││ • Gradients too large? (exploding) ││ • Learning rate too high? ││ • Missing normalization? ││││ Loss oscillates wildly: ││ • Learning rate too high? ││ • Batch size too small? │││└───────────────────────────────────────────────────────────┘