Skip to content

Regularization

Dropout, weight decay, early stopping, and label smoothing to prevent overfitting

TL;DR

Regularization prevents overfitting by constraining the model. Dropout randomly zeros neurons during training. Weight decay penalizes large weights. Early stopping halts training when validation loss stops improving. Use all of these when fine-tuning on small datasets.

Visual Overview

The Overfitting Problem

When overfitting happens:

  • Small dataset, large model
  • Training too long
  • Model has too much capacity for the task
  • No regularization

Dropout

Randomly “drops” (zeros out) neurons during training. Forces the network to not rely on any single neuron.

Dropout

Typical dropout values:

Model typeDropout rate
Transformers0.1 (10%)
Older MLPs0.5 (50%)
CNNs0.25-0.5
Fine-tuning0.1 or lower

Where to apply:

  • After attention layers
  • After FFN layers
  • Before final classification layer
  • NOT inside attention computation itself

Weight Decay (L2 Regularization)

Penalizes large weights by adding their squared sum to the loss.

Weight Decay

AdamW vs Adam with Weight Decay

AdamW: The Right Way

Typical values:

  • Language models: 0.01 - 0.1
  • Vision models: 0.0001 - 0.01
  • Fine-tuning: 0.01 (same as pre-training usually)

Early Stopping

Stop training when validation loss stops improving.

Early Stopping

Implementation:

best_val_loss = float('inf')
patience_counter = 0
patience = 5  # epochs to wait

for epoch in range(max_epochs):
    train_loss = train_one_epoch()
    val_loss = evaluate()

    if val_loss < best_val_loss:
        best_val_loss = val_loss
        patience_counter = 0
        save_checkpoint()  # Save best model
    else:
        patience_counter += 1
        if patience_counter >= patience:
            print("Early stopping!")
            break

# Load best checkpoint for final model
load_checkpoint()

Typical patience values:

  • Fine-tuning: 2-3 epochs
  • Training from scratch: 5-10 epochs
  • Large models: 3-5 epochs

Label Smoothing

Don’t use hard labels (0 or 1). Spread some probability to other classes.

Label Smoothing

Why it works:

  • Prevents model from being overconfident
  • Encourages model to keep some probability for alternatives
  • Acts as regularization on the output distribution

Typical value: 0.1 (10% smoothing)


Combining Techniques

Regularization techniques stack. Use multiple together.

Typical Transformer Regularization

Common combinations:

ScenarioRegularization
Pre-training large LLMWeight decay 0.1, dropout 0.1
Fine-tuningWeight decay 0.01, dropout 0.1, early stopping
Small datasetDropout 0.3, weight decay 0.1, data augmentation
Large datasetMinimal — dropout 0.1, weight decay 0.01

Debugging Regularization

Debugging Regularization

When This Matters

SituationWhat to apply
Fine-tuning on small datasetAll: dropout, weight decay, early stopping
Model overfittingIncrease dropout, weight decay
Model underfittingDecrease regularization
Classification taskAdd label smoothing
Training from scratchModerate regularization, data augmentation
Using AdamWSet weight_decay parameter (not in loss)
Evaluating modelEnsure dropout is OFF (model.eval())
Interview Notes
💼60% of ML interviews
Interview Relevance
60% of ML interviews
🏭Every fine-tuning job
Production Impact
Powers systems at Every fine-tuning job
Preventing overfitting on small datasets
Performance
Preventing overfitting on small datasets query improvement