Write-Ahead Log (WAL)

TL;DR

Write-Ahead Logging (WAL) is a database technique where all changes are first written to a durable append-only log before modifying the actual database. This enables crash recovery (replay the log), ACID guarantees (durability), and replication (ship the log to replicas). Used by PostgreSQL, MySQL, Redis, and virtually all production databases.

Visual Overview

Write-Ahead Logging Overview

Core Explanation

What is Write-Ahead Logging?

Write-Ahead Logging (WAL) is a technique where all modifications to the database are first written to a durable log before being applied to the actual data pages. This ensures that:

Durability: Changes survive crashes (written to log = durable)
Atomicity: All-or-nothing transactions (log has full transaction)
Crash Recovery: Replay log to restore consistent state
Replication: Ship log to replicas for consistency

Key Principle: “Log First, Then Apply”

Key Principle: Log First, Then Apply

WAL Components

1. Log Records

Log Records

2. WAL Buffer

WAL Buffer

3. Data Pages (Database Files)

Data Pages

4. Checkpoints

Checkpoints

How WAL Enables Crash Recovery

Recovery Algorithm (ARIES-style):

Recovery Algorithm (ARIES-style)

Example Recovery:

Example Recovery

WAL for Replication

Streaming Replication:

Streaming Replication

WAL Performance Optimization

1. Group Commit

Group Commit

2. Sequential Writes

Sequential Writes

3. Write Caching

Write Caching

Real Systems Using WAL

System	WAL Name	Log Format	Use Case	Replication
PostgreSQL	WAL (Write-Ahead Log)	Binary, page-based	OLTP database	Streaming replication (WAL shipping)
MySQL	Binary Log (binlog)	Row-based or statement	OLTP database	Master-slave replication
Redis	AOF (Append-Only File)	Text commands	In-memory cache	AOF rewrite, RDB snapshots
MongoDB	Oplog (Operations Log)	BSON documents	Document database	Replica set synchronization
SQLite	WAL Mode	Binary	Embedded database	No replication
Kafka	Partition log	Message-based	Message broker	Topic replication

Case Study: PostgreSQL WAL

PostgreSQL WAL Architecture

Case Study: Redis AOF

Redis AOF

Case Study: MySQL Binary Log

MySQL Binary Log

When to Use WAL

Use WAL When:

Durability Required (ACID Compliance)

Durability Required Use Case

Crash Recovery Needed

Crash Recovery Use Case

Replication Required

Replication Required Use Case

Alternatives to WAL:

Shadow Paging

Shadow Paging Alternative

In-Memory Only (No Durability)

In-Memory Only Alternative

Interview Application

Common Interview Question

Q: “How does a database ensure durability while maintaining good performance?”

Strong Answer:

“Databases use Write-Ahead Logging (WAL) to guarantee durability without sacrificing performance:

WAL Mechanism:

Before modifying data pages, write a log record describing the change

Append to sequential WAL file (fast: sequential writes ~600 MB/s)

fsync() WAL to disk before acknowledging transaction (durable)

Modify data pages later in background (can be delayed)

Why This Works:

Sequential writes are 6x faster than random writes (WAL vs data pages)

Group commit: Batch multiple transactions into single fsync() call

Crash recovery: Replay WAL to reconstruct lost data page changes

Separation of concerns: Durability (WAL) vs performance (delayed page writes)

Performance Optimizations:

Group commit: fsync() every 10-100ms for batch of transactions

WAL buffering: Accumulate records in memory before disk write

Checkpoints: Periodically flush data pages, truncate old WAL

Write caching: Use battery-backed cache for sub-millisecond fsync

Trade-offs:

Latency: Synchronous fsync adds ~5-10ms per transaction

Throughput: Group commit amortizes cost → 1000s txn/sec possible

Storage: WAL requires additional disk space (usually under 10% overhead)

Recovery time: Must replay WAL from last checkpoint (use frequent checkpoints)

Real Example: PostgreSQL achieves 10,000+ TPS with WAL by:

Group commit (default: commit_delay=0, but waits for other commits)

16MB WAL segments with efficient buffering

Background writer applies changes to data pages

5-minute checkpoint intervals balance recovery time vs overhead”

Code Example

Simple WAL Implementation

import os
import struct
import json
from typing import Dict, Any

class WriteAheadLog:
    """
    Simplified WAL implementation for educational purposes
    """
    def __init__(self, wal_path: str, data_path: str):
        self.wal_path = wal_path
        self.data_path = data_path
        self.data: Dict[str, Any] = {}
        self.wal_file = None
        self.lsn = 0  # Log Sequence Number

        # Recovery on initialization
        self.recover()

    def recover(self):
        """Replay WAL to restore consistent state"""
        if not os.path.exists(self.wal_path):
            return

        print("Performing crash recovery...")
        with open(self.wal_path, 'r') as f:
            for line in f:
                record = json.loads(line.strip())
                self._replay_record(record)

        print(f"Recovery complete. Replayed {self.lsn} log records")

    def _replay_record(self, record):
        """Apply a single log record"""
        self.lsn = record['lsn']

        if record['op'] == 'BEGIN':
            pass  # Transaction start
        elif record['op'] == 'SET':
            self.data[record['key']] = record['value']
        elif record['op'] == 'DELETE':
            self.data.pop(record['key'], None)
        elif record['op'] == 'COMMIT':
            pass  # Transaction end

    def begin_transaction(self) -> int:
        """Start a new transaction"""
        txn_id = self.lsn + 1
        self._write_log({
            'lsn': self.lsn + 1,
            'txn_id': txn_id,
            'op': 'BEGIN'
        })
        return txn_id

    def set(self, key: str, value: Any, txn_id: int):
        """Set a key-value pair (write to WAL first)"""
        # STEP 1: Write to WAL (not yet durable)
        self._write_log({
            'lsn': self.lsn + 1,
            'txn_id': txn_id,
            'op': 'SET',
            'key': key,
            'value': value,
            'old_value': self.data.get(key)  # For undo
        })

        # STEP 2: Apply to in-memory data
        self.data[key] = value

    def delete(self, key: str, txn_id: int):
        """Delete a key (write to WAL first)"""
        self._write_log({
            'lsn': self.lsn + 1,
            'txn_id': txn_id,
            'op': 'DELETE',
            'key': key,
            'old_value': self.data.get(key)
        })

        if key in self.data:
            del self.data[key]

    def commit(self, txn_id: int):
        """Commit transaction (make durable)"""
        # Write COMMIT record
        self._write_log({
            'lsn': self.lsn + 1,
            'txn_id': txn_id,
            'op': 'COMMIT'
        })

        # CRITICAL: fsync() to ensure durability
        self._sync_wal()

        print(f"Transaction {txn_id} committed (durable)")

    def _write_log(self, record):
        """Append record to WAL"""
        if self.wal_file is None:
            self.wal_file = open(self.wal_path, 'a')

        self.lsn = record['lsn']
        self.wal_file.write(json.dumps(record) + '\n')
        # Note: Not yet synced to disk (buffered)

    def _sync_wal(self):
        """Force WAL to disk (fsync)"""
        if self.wal_file:
            self.wal_file.flush()
            os.fsync(self.wal_file.fileno())
        # Now durable! Survives crash

    def checkpoint(self):
        """Write data pages to disk, truncate WAL"""
        print("Starting checkpoint...")

        # Write current state to data file
        with open(self.data_path, 'w') as f:
            json.dump(self.data, f)
            f.flush()
            os.fsync(f.fileno())

        # Truncate WAL (all data now in data file)
        if self.wal_file:
            self.wal_file.close()
            self.wal_file = None

        with open(self.wal_path, 'w') as f:
            f.truncate()

        print("Checkpoint complete. WAL truncated")

    def get(self, key: str) -> Any:
        """Read a key (from in-memory data)"""
        return self.data.get(key)

    def close(self):
        """Clean shutdown"""
        if self.wal_file:
            self.wal_file.close()

# Usage Example
if __name__ == '__main__':
    db = WriteAheadLog('db.wal', 'db.data')

    # Transaction 1: Transfer money
    txn1 = db.begin_transaction()
    db.set('account_1', 900, txn1)
    db.set('account_2', 1100, txn1)
    db.commit(txn1)  # fsync() here - durable!

    print(f"Account 1: {db.get('account_1')}")
    print(f"Account 2: {db.get('account_2')}")

    # Simulate crash (don't checkpoint)
    # If you kill process here and restart:
    # WAL replay will restore state!

    # Checkpoint (write to disk, truncate WAL)
    db.checkpoint()

    db.close()

# Simulating crash recovery:
# 1. Run script once (creates WAL)
# 2. Kill process before checkpoint
# 3. Restart script
# 4. WAL automatically replayed, data restored ✓

WAL with Group Commit

import time
import threading
from queue import Queue
from typing import List

class WALWithGroupCommit:
    """
    WAL with group commit optimization
    """
    def __init__(self, wal_path: str, commit_delay_ms: int = 10):
        self.wal_path = wal_path
        self.commit_delay_ms = commit_delay_ms / 1000.0
        self.wal_file = open(wal_path, 'a')

        # Pending commits waiting for group commit
        self.pending_commits: Queue = Queue()

        # Start group commit background thread
        self.group_commit_thread = threading.Thread(
            target=self._group_commit_worker,
            daemon=True
        )
        self.group_commit_thread.start()

    def write_record(self, record: dict):
        """Write log record (buffered, not yet durable)"""
        self.wal_file.write(json.dumps(record) + '\n')

    def commit_async(self) -> threading.Event:
        """
        Commit transaction asynchronously.
        Returns event that will be set when durable.
        """
        commit_event = threading.Event()
        self.pending_commits.put(commit_event)
        return commit_event

    def _group_commit_worker(self):
        """Background thread that performs group commits"""
        while True:
            time.sleep(self.commit_delay_ms)

            # Collect all pending commits
            commits_to_sync: List[threading.Event] = []
            while not self.pending_commits.empty():
                commits_to_sync.append(self.pending_commits.get())

            if commits_to_sync:
                # Single fsync for all pending commits!
                self.wal_file.flush()
                os.fsync(self.wal_file.fileno())

                print(f"Group commit: {len(commits_to_sync)} transactions synced")

                # Notify all waiting transactions
                for event in commits_to_sync:
                    event.set()

    def commit(self):
        """Synchronous commit (waits for fsync)"""
        event = self.commit_async()
        event.wait()  # Block until group commit completes

# Usage
wal = WALWithGroupCommit('db.wal', commit_delay_ms=10)

# Multiple transactions can commit concurrently
# All will be synced in single fsync()

def transaction(txn_id):
    wal.write_record({'txn_id': txn_id, 'op': 'UPDATE'})
    wal.commit()  # Waits for group commit
    print(f"Transaction {txn_id} durable")

# Run 100 concurrent transactions
import concurrent.futures
with concurrent.futures.ThreadPoolExecutor(max_workers=100) as executor:
    futures = [executor.submit(transaction, i) for i in range(100)]
    concurrent.futures.wait(futures)

# Output: Group commit batches transactions together
# Instead of 100 fsync() calls, maybe only 10-20 needed

Prerequisites:

ACID Transactions - Understanding durability
Distributed Systems Basics - Foundation concepts

Related Concepts:

Log-Based Storage - Append-only storage systems
Immutability - WAL is immutable log
Checkpointing - Periodic state snapshots
Replication - WAL shipping for replicas

Used In Systems:

PostgreSQL, MySQL, SQLite (database WAL)
Redis (AOF - Append-Only File)
Kafka (partition log is essentially WAL)

Explained In Detail:

Database Internals Deep Dive - WAL implementation details

See It In Action

Write-Ahead Log Explainer - ~85 second animated visual explanation

Quick Self-Check

Can explain WAL in 60 seconds?
Understand why log is written before data pages?
Know how WAL enables crash recovery (REDO/UNDO)?
Can explain checkpointing and WAL truncation?
Understand group commit optimization?
Know how WAL enables replication?