Skip to content

Backpropagation

How neural networks learn: gradients, chain rule, vanishing gradients, and residual connections

TL;DR

Backpropagation computes how to adjust each weight to reduce error. Gradients flow backward through the network using the chain rule. Understanding vanishing gradients and residual connections explains why transformers scale to hundreds of layers.

Visual Overview

The Learning Problem and Gradient Intuition

The Chain Rule

Neural networks are compositions of functions. To compute gradients through compositions, we use the chain rule.

Chain Rule

Why it matters: A neural network is a long chain of operations. The chain rule lets us compute how the loss changes with respect to weights deep in the network.


Forward Pass vs Backward Pass

Training has two phases per batch:

Forward Pass vs Backward Pass

This is why it’s called “back” propagation — gradients propagate from the output back to the input.


The Vanishing Gradient Problem

Deep networks had a critical issue: gradients got exponentially smaller in earlier layers.

Vanishing Gradient Problem

Solutions to Vanishing Gradients

ReLU Activation

ReLU Gradient

Skip Connections (Residual Connections)

Residual Connections

This is why transformers work. Every attention layer and FFN layer is residual:

# Transformer layer (simplified)
x = x + attention(x)    # residual around attention
x = x + ffn(x)          # residual around FFN

Why This Matters for Modern Models

Understanding LoRA

LoRA (Low-Rank Adaptation) freezes the base model and trains small adapter matrices.

LoRA Gradient Flow

Understanding Training Instability

Diagnosing Training Issues

Key Formulas Summary

Backprop Essentials

When This Matters

SituationConcept to apply
Model isn’t learningCheck for vanishing gradients, dead ReLUs
Training explodes to NaNGradients exploding, reduce LR or add norm
Understanding LoRAOnly adapter params receive gradients
Understanding residual connectionsGradient highways for deep networks
Understanding transformer architectureResidual stream is the core design
Debugging fine-tuningGradients to frozen params = 0

See It In Action

Interview Notes
💼65% of ML interviews
Interview Relevance
65% of ML interviews
🏭Understanding training dynamics
Production Impact
Powers systems at Understanding training dynamics
Foundation for LoRA, fine-tuning
Performance
Foundation for LoRA, fine-tuning query improvement