TL;DR
Regularization prevents overfitting by constraining the model. Dropout randomly zeros neurons during training. Weight decay penalizes large weights. Early stopping halts training when validation loss stops improving.
Visual Overview
┌───────────────────────────────────────────────────────────┐ │ │ │ Training loss: ↓ decreasing nicely │ │ Validation loss: ↓ then ↑ starts increasing │ │ │ │ Loss │ │ │ │ │ 2 │ │ │ │ _____ val loss (starts going up) │ │ 1 │ ___________ │ │ │ __________ train loss (keeps going down) │ │ 0 └──────────────────────── │ │ 0 epochs 100 │ │ │ │ Model memorized training data. │ │ Doesn't generalize to new data. │ │ │ └───────────────────────────────────────────────────────────┘
When overfitting happens:
- Small dataset, large model
- Training too long
- Model has too much capacity for the task
- No regularization
Dropout
Randomly “drops” (zeros out) neurons during training. Forces the network to not rely on any single neuron.
┌───────────────────────────────────────────────────────────┐ │ │ │ Training (dropout=0.1): │ │ Randomly zero out 10% of activations each forward │ │ │ │ Before: [0.5, 0.3, 0.8, 0.2, 0.6] │ │ Mask: [1, 1, 0, 1, 1 ] ← 0.8 dropped │ │ After: [0.5, 0.3, 0.0, 0.2, 0.6] │ │ │ │ Inference: │ │ No dropout. Use all neurons. │ │ Scale activations by (1 - dropout_rate) to compensate │ │ │ └───────────────────────────────────────────────────────────┘ WHY IT WORKS ┌───────────────────────────────────────────────────────────┐ │ │ │ Without dropout: │ │ Network can rely on specific neurons │ │ "Neuron 47 always detects cats" │ │ If neuron 47 is wrong, whole prediction fails │ │ │ │ With dropout: │ │ Any neuron might be missing │ │ Network must build redundant representations │ │ Multiple neurons learn to detect cats │ │ More robust predictions │ │ │ └───────────────────────────────────────────────────────────┘
Typical dropout values:
| Model type | Dropout rate |
|---|---|
| Transformers | 0.1 (10%) |
| Older MLPs | 0.5 (50%) |
| CNNs | 0.25-0.5 |
| Fine-tuning | 0.1 or lower |
Where to apply:
- After attention layers
- After FFN layers
- Before final classification layer
- NOT inside attention computation itself
Weight Decay (L2 Regularization)
Penalizes large weights by adding their squared sum to the loss.
┌───────────────────────────────────────────────────────────┐ │ │ │ Standard loss: │ │ L = task_loss │ │ │ │ With weight decay (L2 regularization): │ │ L = task_loss + λ × Σ(w²) │ │ │ │ λ = weight decay coefficient (typically 0.01) │ │ w = all model weights │ │ │ └───────────────────────────────────────────────────────────┘ WHY IT WORKS ┌───────────────────────────────────────────────────────────┐ │ │ │ Large weights = model is very confident about features │ │ = likely memorizing training data │ │ │ │ Penalizing large weights: │ │ • Keeps weights small │ │ • Model can't "overfit" to any single feature │ │ • Smoother, more generalizable function │ │ │ │ Small weights = "softer" decision boundaries │ │ │ └───────────────────────────────────────────────────────────┘
AdamW vs Adam with Weight Decay
┌───────────────────────────────────────────────────────────┐ │ │ │ Adam with L2 (WRONG): │ │ gradient = task_gradient + λ × w │ │ m, v = update_momentum(gradient) │ │ w = w - lr × m / sqrt(v) │ │ │ │ Problem: Weight decay is entangled with adaptive LR │ │ High-variance params get less regularization │ │ │ │ AdamW (CORRECT): │ │ gradient = task_gradient ← No λ here │ │ m, v = update_momentum(gradient) │ │ w = w - lr × m / sqrt(v) - lr × λ × w │ │ ↑ │ │ Decay applied separately │ │ │ │ Weight decay is truly decoupled. │ │ This is what you should use. │ │ │ └───────────────────────────────────────────────────────────┘
Typical values:
- Language models: 0.01 - 0.1
- Vision models: 0.0001 - 0.01
- Fine-tuning: 0.01 (same as pre-training usually)
Early Stopping
Stop training when validation loss stops improving.
┌───────────────────────────────────────────────────────────┐ │ │ │ Monitor: validation loss (or another metric) │ │ Patience: how many epochs to wait for improvement │ │ │ │ Loss │ │ │ │ │ 2 │ │ │ │ _____ val loss │ │ 1 │ ___________ │ │ │ __________ train loss │ │ 0 └──────────────────────── │ │ 0 10 20 30 40 50 │ │ ↑ │ │ STOP HERE │ │ (val loss stopped improving) │ │ │ └───────────────────────────────────────────────────────────┘
Implementation:
best_val_loss = float('inf')
patience_counter = 0
patience = 5 # epochs to wait
for epoch in range(max_epochs):
train_loss = train_one_epoch()
val_loss = evaluate()
if val_loss < best_val_loss:
best_val_loss = val_loss
patience_counter = 0
save_checkpoint() # Save best model
else:
patience_counter += 1
if patience_counter >= patience:
print("Early stopping!")
break
# Load best checkpoint for final model
load_checkpoint()
Typical patience values:
- Fine-tuning: 2-3 epochs
- Training from scratch: 5-10 epochs
- Large models: 3-5 epochs
Label Smoothing
Don’t use hard labels (0 or 1). Spread some probability to other classes.
┌───────────────────────────────────────────────────────────┐ │ │ │ Hard labels (no smoothing): │ │ True class = 2: [0, 0, 1, 0] │ │ │ │ Soft labels (smoothing = 0.1): │ │ True class = 2: [0.033, 0.033, 0.9, 0.033] │ │ │ │ 10% of probability spread to other classes. │ │ │ └───────────────────────────────────────────────────────────┘
Why it works:
- Prevents model from being overconfident
- Encourages model to keep some probability for alternatives
- Acts as regularization on the output distribution
Typical value: 0.1 (10% smoothing)
Combining Techniques
Regularization techniques stack. Use multiple together.
┌───────────────────────────────────────────────────────────┐ │ │ │ 1. Dropout: 0.1 after attention and FFN │ │ 2. Weight decay: 0.01 with AdamW │ │ 3. Early stopping: patience=3 on val loss │ │ 4. Label smoothing: 0.1 (for classification) │ │ │ │ For fine-tuning, often reduce dropout (model already │ │ regularized). │ │ │ └───────────────────────────────────────────────────────────┘
Common combinations:
| Scenario | Regularization |
|---|---|
| Pre-training large LLM | Weight decay 0.1, dropout 0.1 |
| Fine-tuning | Weight decay 0.01, dropout 0.1, early stopping |
| Small dataset | Dropout 0.3, weight decay 0.1, data augmentation |
| Large dataset | Minimal — dropout 0.1, weight decay 0.01 |
Debugging Regularization
STILL OVERFITTING DESPITE REGULARIZATION ┌───────────────────────────────────────────────────────────┐ │ │ │ Symptoms: │ │ • Added dropout, weight decay │ │ • Train/val gap still large │ │ │ │ Causes: │ │ • Regularization too weak │ │ • Model still too large for data │ │ • Data augmentation would help │ │ │ │ Debug steps: │ │ 1. Increase dropout (0.1 → 0.3) │ │ 2. Increase weight decay (0.01 → 0.1) │ │ 3. Add data augmentation │ │ 4. Use smaller model │ │ 5. Get more data │ │ │ └───────────────────────────────────────────────────────────┘ UNDERFITTING (TRAIN LOSS HIGH) ┌───────────────────────────────────────────────────────────┐ │ │ │ Symptoms: │ │ • Training loss not decreasing enough │ │ • Model can't fit training data │ │ │ │ Causes: │ │ • Too much regularization │ │ • Dropout too high │ │ • Weight decay too strong │ │ • Model too small │ │ │ │ Debug steps: │ │ 1. Reduce dropout (0.3 → 0.1) │ │ 2. Reduce weight decay (0.1 → 0.01) │ │ 3. Remove early stopping temporarily │ │ 4. Use larger model │ │ │ └───────────────────────────────────────────────────────────┘
When This Matters
| Situation | What to apply |
|---|---|
| Fine-tuning on small dataset | All: dropout, weight decay, early stopping |
| Model overfitting | Increase dropout, weight decay |
| Model underfitting | Decrease regularization |
| Classification task | Add label smoothing |
| Training from scratch | Moderate regularization, data augmentation |
| Using AdamW | Set weight_decay parameter (not in loss) |
| Evaluating model | Ensure dropout is OFF (model.eval()) |
Production signal