Write-Ahead Log (WAL)

A technique where changes are written to a durable log before being applied to the database, enabling crash recovery and replication in database systems

TL;DR

Write-Ahead Logging (WAL) is a database technique where all changes are first written to a durable append-only log before modifying the actual database. This enables crash recovery (replay the log), ACID guarantees (durability), and replication (ship the log to replicas). Used by PostgreSQL, MySQL, Redis, and virtually all production databases.

Visual Overview

Write-Ahead Logging Overview
WITHOUT WAL (Naive Approach - BROKEN)

  Transaction: Transfer $100                    
                                                
  1. UPDATE accounts SET balance=900 WHERE id=1 
      (write to disk)                          
  2. CRASH! ⚡                                   
  3. UPDATE accounts SET balance=1100 WHERE id=2
     (never executed)                           
                                                
  Result: Money disappears! Data corrupted


WITH WAL (Write-Ahead Logging)

 Transaction: Transfer $100                     
                                                
 STEP 1: Write to WAL first (sequential write)  
        
  WAL: [UPDATE id=1 balance=900]              
  WAL: [UPDATE id=2 balance=1100]             
  WAL: [COMMIT]                               
        
  (fsync - durable to disk)                    
 STEP 2: Acknowledge transaction               
                                                
 STEP 3: Apply to database (async, can crash)   
 - UPDATE id=1 balance=900                      
 - UPDATE id=2 balance=1100                     
                                                
 If crash after STEP 2:                         
  Replay WAL on restart                       
  Transaction completes successfully           


WAL RECOVERY AFTER CRASH:

 Database crashes during transaction            
                                               
 On restart:                                    
 1. Read WAL from last checkpoint               
 2. Replay all committed transactions           
 3. Roll back uncommitted transactions          
 4. Database state restored                    
                                                
 Timeline:                                      
 T0: Begin transaction                          
 T1: Write to WAL                               
 T2: fsync (durable)                            
 T3: CRASH
 T4: Restart, replay WAL                        
 T5: Database consistent again                  


Core Explanation

What is Write-Ahead Logging?

Write-Ahead Logging (WAL) is a technique where all modifications to the database are first written to a durable log before being applied to the actual data pages. This ensures that:

  1. Durability: Changes survive crashes (written to log = durable)
  2. Atomicity: All-or-nothing transactions (log has full transaction)
  3. Crash Recovery: Replay log to restore consistent state
  4. Replication: Ship log to replicas for consistency

Key Principle: “Log First, Then Apply”

Key Principle: Log First, Then Apply
RULE: No data page can be written to disk until its log record is on disk

Why?

- Log record = intent to change
- Data page = actual change
- If data page written first and crash occurs:
 Don't know if change was committed or not
 Cannot undo/redo

With WAL:

- Log written first = have complete transaction history
- Can replay (redo) committed transactions
- Can rollback (undo) uncommitted transactions

WAL Components

1. Log Records

Log Records
Log Record Structure:

 LSN: Log Sequence Number (unique ID)   
 Transaction ID: Which transaction?     
 Operation: INSERT/UPDATE/DELETE        
 Before Image: Old value (for undo)     
 After Image: New value (for redo)      
 Prev LSN: Previous record in this txn  


Example Log Records:
LSN=100: TXN=42 BEGIN
LSN=101: TXN=42 UPDATE accounts id=1 OLD=1000 NEW=900
LSN=102: TXN=42 UPDATE accounts id=2 OLD=1000 NEW=1100
LSN=103: TXN=42 COMMIT

Each record contains enough info to:

- REDO: Apply change (using NEW value)
- UNDO: Rollback change (using OLD value)

2. WAL Buffer

WAL Buffer
In-Memory Buffer for Log Records:

  Application writes to database     
                                    
  Generate log records               
                                    
  Append to WAL Buffer (in memory)   
                                    
  On commit: fsync() to disk         
                                    
  WAL File on disk (durable)         


Flush Triggers:

- Transaction commits (fsync immediately)
- WAL buffer fills up
- Checkpoint operation
- Periodic flush (e.g., every 1 second)

3. Data Pages (Database Files)

Data Pages
Actual database storage (modified later):

  WAL written to disk (durable)     
                                   
  Background process applies changes
  to data pages (async)             
                                   
  Data pages written to disk        
  (can be delayed for performance)  


Why delay?

- Sequential WAL writes are fast (600 MB/s)
- Random data page writes are slow (100 MB/s)
- Write to WAL immediately, data pages later

4. Checkpoints

Checkpoints
Checkpoint = Flush all dirty pages to disk

  Checkpoint Process:                   
  1. Mark checkpoint start in WAL       
  2. Write all modified data pages      
     to disk (can take seconds/minutes) 
  3. Mark checkpoint complete in WAL    
  4. Record checkpoint LSN              
                                        
  Purpose:                              
  - Limit recovery time (don't replay   
    entire WAL, just since checkpoint)  
  - Allow WAL truncation (delete old    
    log files before checkpoint)        
                                        
  Frequency: Every 5-15 minutes         


Recovery after crash:

- Find last checkpoint LSN
- Replay WAL from that point forward
- Much faster than replaying full history

How WAL Enables Crash Recovery

Recovery Algorithm (ARIES-style):

Recovery Algorithm (ARIES-style)
Phase 1: ANALYSIS
- Scan WAL from last checkpoint
- Identify committed transactions
- Identify uncommitted transactions
- Build transaction table

Phase 2: REDO

- Replay all operations (committed + uncommitted)
- Restore database to state at crash time
- "Redo everything"

Phase 3: UNDO

- Roll back uncommitted transactions
- Use "before images" from log records
- Leave only committed transactions

Result: Database in consistent state 

Example Recovery:

Example Recovery
WAL Contents:
LSN=100: TXN=1 BEGIN
LSN=101: TXN=1 UPDATE accounts id=1 OLD=1000 NEW=900
LSN=102: TXN=1 COMMIT 
LSN=103: TXN=2 BEGIN
LSN=104: TXN=2 UPDATE accounts id=2 OLD=1000 NEW=800
LSN=105: CRASH ⚡ (TXN=2 not committed)

Recovery Process:

1. ANALYSIS:
 - TXN=1 is committed
 - TXN=2 is not committed

2. REDO:
 - Apply LSN=101: accounts id=1  900
 - Apply LSN=104: accounts id=2  800

3. UNDO:
 - Roll back LSN=104: accounts id=2  1000 (use OLD value)

Final State:

- accounts id=1 = 900 (TXN=1 committed) 
- accounts id=2 = 1000 (TXN=2 rolled back) 

WAL for Replication

Streaming Replication:

Streaming Replication
Primary  Replica Replication via WAL

  Primary Database:                     
  1. Client writes data                 
  2. Generate WAL records               
  3. Write WAL to local disk            
  4. Stream WAL to replicas (async)     
  5. Acknowledge client                 
                                        
  Replica Database:                     
  1. Receive WAL stream from primary    
  2. Apply WAL records                  
     same as crash recovery             
  3. Replay operations  same state     
                                        
  Result: Replicas eventually consistent
  with primary (usually <1s lag)        


Synchronous Replication:

- Wait for replica to acknowledge WAL write
- Guarantees zero data loss (RPO=0)
- Higher latency (wait for network + replica)

Asynchronous Replication:

- Don't wait for replica
- Lower latency
- Possible data loss if primary fails

WAL Performance Optimization

1. Group Commit

Group Commit
Batch multiple transactions into single fsync():

  Transaction 1: Write to WAL buffer
  Transaction 2: Write to WAL buffer
  Transaction 3: Write to WAL buffer
                                   
  Single fsync() for all three      
                                   
  All three transactions durable    


Benefit:

- fsync() is expensive (~5-10ms)
- Amortize cost across multiple transactions
- Throughput: 1000s of transactions/sec possible

2. Sequential Writes

Sequential Writes
WAL is append-only (sequential writes):
- Modern disks: Sequential ~600 MB/s
- Random writes: ~100 MB/s (6x slower)

WAL exploits sequential write performance:

- Append records to end of log
- No random seeks
- Result: Very high write throughput

3. Write Caching

Write Caching
Use battery-backed write cache (NVRAM):
- fsync() writes to NVRAM (fast)
- NVRAM persists to disk later
- Survives power failure
- Latency: <1ms (vs 5-10ms for disk)

Used in enterprise storage systems

Real Systems Using WAL

SystemWAL NameLog FormatUse CaseReplication
PostgreSQLWAL (Write-Ahead Log)Binary, page-basedOLTP databaseStreaming replication (WAL shipping)
MySQLBinary Log (binlog)Row-based or statementOLTP databaseMaster-slave replication
RedisAOF (Append-Only File)Text commandsIn-memory cacheAOF rewrite, RDB snapshots
MongoDBOplog (Operations Log)BSON documentsDocument databaseReplica set synchronization
SQLiteWAL ModeBinaryEmbedded databaseNo replication
KafkaPartition logMessage-basedMessage brokerTopic replication

Case Study: PostgreSQL WAL

PostgreSQL WAL Architecture

  1. Client executes: UPDATE users SET ...    
                                             
  2. Generate WAL record (XLogInsert)         
                                             
  3. Write to WAL buffer in memory            
                                             
  4. On COMMIT: fsync WAL to disk             
                                             
  5. Acknowledge transaction to client       
                                             
  6. Background writer applies to data pages  
                                             
  7. Checkpoint flushes dirty pages           


WAL File Structure:

- Files: 00000001000000000000001 (16 MB segments)
- Location: pg_wal/ directory
- Retention: Keep files needed for recovery + replication
- Archiving: Copy to backup storage for point-in-time recovery

Configuration:
wal_level = replica # Enable replication
fsync = on # Ensure durability
synchronous_commit = on # Wait for fsync
wal_buffers = 16MB # WAL buffer size
checkpoint_timeout = 5min # Checkpoint frequency
max_wal_size = 1GB # Trigger checkpoint

Case Study: Redis AOF

Redis AOF
Redis Append-Only File (AOF):

  Every write command logged to AOF:    
                                        
  SET key1 "value1"                     
  INCR counter                          
  LPUSH list "item"                     
                                        
  On restart: Replay AOF commands       


AOF Fsync Policies:

1. always: fsync after every command (safest, slowest)
2. everysec: fsync every 1 second (default, good balance)
3. no: Let OS decide (fastest, least safe)

AOF Rewrite:

- Problem: AOF grows forever (SET key1 100 times)
- Solution: Rewrite AOF with current state only
- Background process creates new AOF
- Atomic switch to new file

Example:
Before rewrite (100 commands):
SET key1 "a"
SET key1 "b"
SET key1 "c"
... (97 more)

After rewrite (1 command):
SET key1 "final_value"

Case Study: MySQL Binary Log

MySQL Binary Log
MySQL Binlog:

  Logs all database changes             
  Format: Row-based (default) or        
          Statement-based               
                                        
  Uses:                                 
  - Replication (master  slave)        
  - Point-in-time recovery              
  - Audit trail                         


Row-Based Replication:

- Log actual row changes (binary format)
- Example: "Change row id=42 col1=100"
- Benefit: Safe, deterministic

Statement-Based Replication:

- Log SQL statements
- Example: "UPDATE users SET ... WHERE ..."
- Benefit: Compact log size
- Problem: Non-deterministic functions (NOW(), RAND())

Configuration:
binlog_format = ROW
sync_binlog = 1 # Sync to disk on commit
binlog_expire_logs_seconds = 604800 # 7 days retention

When to Use WAL

Use WAL When:

Use CaseScenarioRequirementSolutionTrade-off
Durability Required (ACID Compliance)Financial transaction databaseNo data loss after commitWAL with synchronous fsyncHigher write latency (~5-10ms)
Crash Recovery NeededDatabase server can crash unexpectedlyAutomatic recovery to consistent stateWAL enables automatic replay on restartRecovery time proportional to log size (use checkpoints)
Replication RequiredMulti-datacenter database deploymentKeep replicas in sync with primaryStream WAL to replicasNetwork bandwidth for WAL shipping

Alternatives to WAL:

Shadow Paging

Shadow Paging Alternative
Instead of WAL, use copy-on-write:
- Never modify pages in place
- Create new versions on update
- Atomic switch to new version

Example: LMDB database
Benefit: No separate log
Trade-off: More complex, fragmentation

In-Memory Only (No Durability)

In-Memory Only Alternative
Don't persist to disk at all:
- Example: Pure in-memory cache (Memcached)
- Benefit: Maximum performance
- Trade-off: Data lost on crash

Interview Application

Common Interview Question

Q: “How does a database ensure durability while maintaining good performance?”

Strong Answer:

“Databases use Write-Ahead Logging (WAL) to guarantee durability without sacrificing performance:

WAL Mechanism:

  1. Before modifying data pages, write a log record describing the change
  2. Append to sequential WAL file (fast: sequential writes ~600 MB/s)
  3. fsync() WAL to disk before acknowledging transaction (durable)
  4. Modify data pages later in background (can be delayed)

Why This Works:

  • Sequential writes are 6x faster than random writes (WAL vs data pages)
  • Group commit: Batch multiple transactions into single fsync() call
  • Crash recovery: Replay WAL to reconstruct lost data page changes
  • Separation of concerns: Durability (WAL) vs performance (delayed page writes)

Performance Optimizations:

  1. Group commit: fsync() every 10-100ms for batch of transactions
  2. WAL buffering: Accumulate records in memory before disk write
  3. Checkpoints: Periodically flush data pages, truncate old WAL
  4. Write caching: Use battery-backed cache for sub-millisecond fsync

Trade-offs:

  • Latency: Synchronous fsync adds ~5-10ms per transaction
  • Throughput: Group commit amortizes cost → 1000s txn/sec possible
  • Storage: WAL requires additional disk space (usually under 10% overhead)
  • Recovery time: Must replay WAL from last checkpoint (use frequent checkpoints)

Real Example: PostgreSQL achieves 10,000+ TPS with WAL by:

  • Group commit (default: commit_delay=0, but waits for other commits)
  • 16MB WAL segments with efficient buffering
  • Background writer applies changes to data pages
  • 5-minute checkpoint intervals balance recovery time vs overhead”

Code Example

Simple WAL Implementation

import os
import struct
import json
from typing import Dict, Any

class WriteAheadLog:
    """
    Simplified WAL implementation for educational purposes
    """
    def __init__(self, wal_path: str, data_path: str):
        self.wal_path = wal_path
        self.data_path = data_path
        self.data: Dict[str, Any] = {}
        self.wal_file = None
        self.lsn = 0  # Log Sequence Number

        # Recovery on initialization
        self.recover()

    def recover(self):
        """Replay WAL to restore consistent state"""
        if not os.path.exists(self.wal_path):
    # ... omitted: keep concept snippets short
    # Checkpoint (write to disk, truncate WAL)
    db.checkpoint()

    db.close()

# Simulating crash recovery:
# 1. Run script once (creates WAL)
# 2. Kill process before checkpoint
# 3. Restart script
# 4. WAL automatically replayed, data restored ✓

WAL with Group Commit

import time
import threading
from queue import Queue
from typing import List

class WALWithGroupCommit:
    """
    WAL with group commit optimization
    """
    def __init__(self, wal_path: str, commit_delay_ms: int = 10):
        self.wal_path = wal_path
        self.commit_delay_ms = commit_delay_ms / 1000.0
        self.wal_file = open(wal_path, 'a')

        # Pending commits waiting for group commit
        self.pending_commits: Queue = Queue()

        # Start group commit background thread
        self.group_commit_thread = threading.Thread(
            target=self._group_commit_worker,
            daemon=True
        )
    # ... omitted: keep concept snippets short
    print(f"Transaction {txn_id} durable")

# Run 100 concurrent transactions
import concurrent.futures
with concurrent.futures.ThreadPoolExecutor(max_workers=100) as executor:
    futures = [executor.submit(transaction, i) for i in range(100)]
    concurrent.futures.wait(futures)

# Output: Group commit batches transactions together
# Instead of 100 fsync() calls, maybe only 10-20 needed

Prerequisites:

Related Concepts:

Used In Systems:

  • PostgreSQL, MySQL, SQLite (database WAL)
  • Redis (AOF - Append-Only File)
  • Kafka (partition log is essentially WAL)

Explained In Detail:

  • Database Internals Deep Dive - WAL implementation details

See It In Action

Quick Self-Check

  • Can explain WAL in 60 seconds?
  • Understand why log is written before data pages?
  • Know how WAL enables crash recovery (REDO/UNDO)?
  • Can explain checkpointing and WAL truncation?
  • Understand group commit optimization?
  • Know how WAL enables replication?

Production signal

Why this concept matters

Interview 65% of database interviews
Production PostgreSQL, MySQL, Redis
Performance ACID durability
Scale Crash recovery