Log-based storage uses an append-only, immutable sequence of records (a log) as the primary data structure. Records are written sequentially to the end of the log, never modified in place. This design enables high write throughput, simple crash recovery, built-in time-travel, and forms the foundation of systems like Kafka, databases, and distributed consensus.
Visual Overview
Log-Based Storage Overview
Log-Based Storage Overview
LOG-BASED STORAGE STRUCTURE:
Append-Only Log (Partition 0)
┌─────────┬─────────┬─────────┬─────────┬─────────┬─────────┐│Offset 0│Offset 1│Offset 2│Offset 3│Offset 4│Offset 5││ Record A│ Record B│ Record C│ Record D│ Record E│ Record F││ Time: T0│ Time: T1│ Time: T2│ Time: T3│ Time: T4│ Time: T5│└─────────┴─────────┴─────────┴─────────┴─────────┴─────────┘↑
New writesappend here
KEY PROPERTIES:
├──Append-only: New records added to end, never modified
├──Immutable: Records never changed once written
├──Ordered: Each record has monotonically increasing offset
├──Sequential: Physical writes are sequential on disk
└──Time-indexed: Records naturally ordered by time
DISK LAYOUT (Segment-Based):
Partition Directory:
├── 00000000000000000000.log (Segment 1: offsets 0-999)
├── 00000000000000000000.index (Index for segment 1)
├── 00000000000001000000.log (Segment 2: offsets 1000-1999)
├── 00000000000001000000.index (Index for segment 2)
└── 00000000000002000000.log (Active segment, offsets 2000+)
Each segment file:
- Fixed max size (e.g., 1 GB)
- Sequentialwrites (O(1) append)
- Independent deletion (retention)
Core Explanation
What is Log-Based Storage?
A log in distributed systems is an append-only, totally-ordered sequence of records. Think of it as an immutable array where:
Append-only: Records added to the end, never inserted in middle
Immutable: Once written, records never change
Ordered: Each record has a unique sequential offset (0, 1, 2, …)
Durable: Persisted to disk before acknowledging write
Traditional vs Log-Based Storage
Traditional vs Log-Based Storage
Traditional Database (In-Place Updates):
┌──────────────┐│ User Table │├──────────────┤│ id: 123 │←UPDATE changes this record in place
│ name: "John" │←Old value lost forever│ age: 30 │←No history preserved└──────────────┘Log-Based Storage (Append-Only):
┌────────────────────────────────────────────────┐│ Log: User Events│├────────────────────────────────────────────────┤│ [0] UserCreated(id=123, name="John", age=25) ││ [1] UserUpdated(id=123, age=26) ││ [2] UserUpdated(id=123, age=30) │← New event appended└────────────────────────────────────────────────┘↑Complete history preserved, can replay to any point
Traditional B-Tree Index (Random Writes):
Write "user_456" →
1. Seek to index page (random disk seek)
2. Read page into memory
3. Modify page
4. Write page back (random write)
Result: 10ms per writeLog-Based Storage (Sequential Writes):
Write "user_456" →
1. Append to end of current segment file
2. Flush to disk sequentially
Result: 0.01ms per write100x-1000x faster!
Log Segments and Retention
Why Segments?
Instead of one giant log file, logs are split into segments:
Why Segments
Why Segments
WHY SEGMENTS?
Problem with Single Log File:
┌─────────────────────────────────────┐│ single-log-file.log (100 GB) ││││Can't delete old data without ││rewriting entire file! │└─────────────────────────────────────┘Solution with Segments:
├── segment-0.log (1 GB) ←Delete this entire file (old data)
├── segment-1.log (1 GB) ←Keep├── segment-2.log (1 GB) ←Keep└── segment-3.log (500 MB, active) ← Currently writing
Deletion = O(1) file delete, no rewriting!
Segment Management:
// Segment configurationsegment.bytes = 1073741824 // 1 GB per segmentsegment.ms = 604800000 // New segment every 7 days (whichever comes first)// Retention strategieslog.retention.bytes = 107374182400 // Keep 100 GB totallog.retention.ms = 604800000 // OR keep 7 days (whichever hits first)// Deletion process (runs periodically)for (Segment segment : segments) { if (segment.isExpired() || totalSize > maxSize) { segment.delete(); // Simple file deletion, O(1) }}
Indexing for Fast Reads
The Challenge:
Indexing Challenge
Indexing Challenge
Problem: How to find offset 12,345,678 quickly in a 100 GB log?
Naive approach:
- Read sequentially from beginning
- Time: O(n), could take minutes!
The Solution: Sparse Index
Sparse Index Structure
Sparse Index Structure
SPARSE INDEX STRUCTURE:
Segment: 00000000000010000000.log (offsets 10M-11M)
Index File: 00000000000010000000.index
┌──────────────────────────────────────┐│Offset→File Position│├──────────────────────────────────────┤│ 10,000,000 → byte 0 │← Index every N records
│ 10,010,000 → byte 52,428,800 │← (e.g., every 10K)
│ 10,020,000 → byte 104,857,600 ││ 10,030,000 → byte 157,286,400 ││ ... │└──────────────────────────────────────┘
To find offset 10,025,000:
1. Binary search index → Find 10,020,000 at byte 104,857,600
2. Seek to byte 104,857,600
3. Scan sequentially 5,000 records (fast, in memory)
Result: O(log n) index lookup + small sequential scan
Time-Travel and Replay
Core Feature of Log Storage:
Time-Travel and Replay
Time-Travel and Replay
LOG WITH TIME-INDEXED DATA:
[0] UserCreated(id=123) at 2025-01-01 10:00
[1] UserUpdated(id=123, age=26) at 2025-01-02 11:00
[2] UserUpdated(id=123, age=30) at 2025-01-03 12:00
[3] UserDeleted(id=123) at 2025-01-04 13:00
QUERIES:
- "What was user 123's state on 2025-01-02?"
→Replay log up to offset 1 → {id=123, name="John", age=26}
- "Rebuild entire database from scratch"
→Replay log from offset 0 →Full reconstruction
- "Reprocess last 7 days with new business logic"
→seekToTimestamp(7 days ago) →Replay with new code
Production Use Case:
Bug Recovery with Log Storage
Bug Recovery with Log Storage
Bug discovered in analytics pipeline:
1. Pipelineprocessed 1 million events with incorrect logic
2. Traditional system: Data corrupted, hard to fix
3. Log-based system:
- Keep log intact
- seekToTimestamp(bug_introduced_time)
- Replay events with corrected code
- Output to new table
- Validate and switch
Result: Zero data loss, safe recovery
Compaction: Log Cleanup with Retention
Time-Based Retention:
Time-Based Retention
Time-Based Retention
Delete segments older than N days:
Day 1: [Seg 0][Seg 1][Seg 2][Seg 3]
Day 8: [Seg 1][Seg 2][Seg 3][Seg 4] ← Seg 0 deleted (> 7 days old)
Day 15: [Seg 2][Seg 3][Seg 4][Seg 5] ← Seg 1 deleted
Size-Based Retention:
Size-Based Retention
Size-Based Retention
Keep only last 100 GB:
[Seg 0: 20GB][Seg 1: 20GB][Seg 2: 20GB][Seg 3: 20GB][Seg 4: 30GB]
Total: 110 GB →Delete Seg 0 →90 GB✓
✓ Extremely high write throughput (sequential I/O)
✓ Simple crash recovery (just find last valid offset)
✓ Built-in audit trail and time-travel
✓ Easy replication (just copy log segments)
✓ Immutability eliminates update anomalies
Disadvantages:
✕ Slow point reads without good indexing
✕ Space amplification (old versions kept until deletion)
✕ Range queries require scanning
✕ Compaction overhead for key-based retention
Real Systems Using This
Apache Kafka
Implementation: Partitioned, replicated logs as primary abstraction
Scale: 7+ trillion messages/day at LinkedIn
Segments: 1 GB segments, time or size-based retention
Typical Setup: 7 day retention, 100+ partitions
Database Write-Ahead Logs (WAL)
PostgreSQL WAL: All changes written to log before data files
MySQL Binlog: Replication and point-in-time recovery
Redis AOF: Append-only file for durability
Purpose: Crash recovery, replication, backups
Distributed Consensus (Raft, Paxos)
Implementation: Replicated log of commands
Purpose: Ensure all nodes apply same operations in same order
Examples: etcd, Consul, ZooKeeper
Event Sourcing Systems
EventStore: Specialized event sourcing database
Axon Framework: CQRS/ES on top of logs
Purpose: Complete audit trail, temporal queries
When to Use Log-Based Storage
✓ Perfect Use Cases
High-Throughput Writes
High-Throughput Writes Use Case
High-Throughput Writes Use Case
Scenario: Ingesting millions of events per second
Why logs: Sequential writesmax out disk bandwidthExample: IoT sensor data, clickstream analytics
Audit and Compliance
Audit and Compliance Use Case
Audit and Compliance Use Case
Scenario: Financial transactions requiring complete audit trail
Why logs: Immutable, ordered history of all changes
Example: Banking transactions, healthcare records
Event Sourcing / CQRS
Event Sourcing Use Case
Event Sourcing Use Case
Scenario: Need to rebuild state from historical events
Why logs: Natural event stream with replay capabilityExample: E-commerce order processing, booking systems
Stream Processing
Stream Processing Use Case
Stream Processing Use Case
Scenario: Real-time data pipelines and transformationsWhy logs: Natural fit for streaming frameworksExample: Fraud detection, real-time recommendations
✕ When NOT to Use
Point Queries Without Indexing
Point Queries Warning
Point Queries Warning
Problem: "Find user by email address"
Issue: Must scan entire log or build secondary index
Alternative: Use indexed database (B-tree, hash index)
Frequent Updates to Same Key
Frequent Updates Warning
Frequent Updates Warning
Problem: Updating same sensor reading every 10ms
Issue: Millions of log entries for same key
Alternative: In-memory cache + periodic snapshots
Need to Delete Individual Records (GDPR)
GDPR Deletion Warning
GDPR Deletion Warning
Problem: "Delete all data for user_123"
Issue: Logs are append-only, can't delete from middleAlternative: Tombstone records + compaction, or use mutable database
Interview Application
Common Interview Question 1
Q: “Why does Kafka achieve such high throughput compared to traditional message queues?”
Strong Answer:
“Kafka’s high throughput comes from its log-based storage design that exploits sequential I/O:
Sequential writes: All writes append to the end of a log segment file, achieving 100K+ writes/sec on HDDs vs ~100/sec for random writes in traditional MQs
Zero-copy transfers: Kafka uses sendfile() to transfer data from disk → OS page cache → network socket without copying to application memory