Backpropagation | Concepts

TL;DR

Backpropagation computes how to adjust each weight to reduce error. Gradients flow backward through the network using the chain rule. Understanding vanishing gradients and residual connections explains why transformers scale to hundreds of layers.

Visual Overview

The Learning Problem and Gradient Intuition

The Chain Rule

Neural networks are compositions of functions. To compute gradients through compositions, we use the chain rule.

Chain Rule

Why it matters: A neural network is a long chain of operations. The chain rule lets us compute how the loss changes with respect to weights deep in the network.

Forward Pass vs Backward Pass

Training has two phases per batch:

Forward Pass vs Backward Pass

This is why it’s called “back” propagation — gradients propagate from the output back to the input.

The Vanishing Gradient Problem

Deep networks had a critical issue: gradients got exponentially smaller in earlier layers.

Vanishing Gradient Problem

Solutions to Vanishing Gradients

ReLU Activation

ReLU Gradient

Skip Connections (Residual Connections)

Residual Connections

This is why transformers work. Every attention layer and FFN layer is residual:

# Transformer layer (simplified)
x = x + attention(x)    # residual around attention
x = x + ffn(x)          # residual around FFN

Why This Matters for Modern Models

Understanding LoRA

LoRA (Low-Rank Adaptation) freezes the base model and trains small adapter matrices.

LoRA Gradient Flow

Understanding Training Instability

Diagnosing Training Issues

Key Formulas Summary

Backprop Essentials

When This Matters

Situation	Concept to apply
Model isn’t learning	Check for vanishing gradients, dead ReLUs
Training explodes to NaN	Gradients exploding, reduce LR or add norm
Understanding LoRA	Only adapter params receive gradients
Understanding residual connections	Gradient highways for deep networks
Understanding transformer architecture	Residual stream is the core design
Debugging fine-tuning	Gradients to frozen params = 0

See It In Action

Backpropagation Explainer - ~120 second animated visual explanation