Mechanisms by which message producers receive confirmation that their messages were successfully persisted, enabling reliability tradeoffs between latency and durability
65% of messaging interviews
Powers systems at Durability control
Latency vs safety tradeoffs query improvement
Data loss prevention
TL;DR
Producer acknowledgments (acks) control when Kafka considers a message successfully written. Options include acks=0 (no confirmation), acks=1 (leader confirms), and acks=all (all replicas confirm), trading latency for durability guarantees. Critical for balancing performance vs data safety in message brokers.
Visual Overview
ACKS = 0 (Fire and Forget)
ββββββββββββββββββββββββββββββββββββββββββββββββββ
β Producer β Message β Kafka Leader β
β β (don't wait) β
β Immediate return β β
β Latency: <1ms β
β β
β Risk: Message may be lost if: β
β - Network failure before reaching leader β
β - Leader crashes before writing to disk β
β - Leader crashes before replication β
β β
β Use case: Metrics, logs (lossy OK) β
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ACKS = 1 (Leader Acknowledgment)
ββββββββββββββββββββββββββββββββββββββββββββββββββ
β Producer β Message β Kafka Leader β
β β β
β Write to log β β
β β β
β Send ACK β Producer β
β Latency: 5-10ms β
β β
β Meanwhile (async): β
β Leader β Replicate β Follower 1 β
β Leader β Replicate β Follower 2 β
β β
β Risk: Message lost if leader crashes β
β before replication completes β
β β
β Use case: Most production workloads (default) β
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ACKS = ALL (Full Quorum)
ββββββββββββββββββββββββββββββββββββββββββββββββββ
β Producer β Message β Kafka Leader β
β β β
β Write to log β
β β β
β Replicate to all ISR replicas β
β β β
β Follower 1: Written β β
β Follower 2: Written β β
β β β
β Send ACK β Producer β
β Latency: 10-50ms (network + replication) β
β β
β Risk: Message never lost (replicated) β
β (unless all ISR replicas fail simultaneously) β
β β
β Use case: Financial transactions, orders β
ββββββββββββββββββββββββββββββββββββββββββββββββββ
TIMELINE COMPARISON:
ββββββββββββββββββββββββββββββββββββββββββββββββββ
β acks=0: β
β T0: Send β
β T1: Return (1ms) β β
β β
β acks=1: β
β T0: Send β
β T5: Leader writes β
β T10: Return (10ms) β β
β β
β acks=all: β
β T0: Send β
β T5: Leader writes β
β T15: Follower 1 writes β
β T20: Follower 2 writes β
β T25: Return (25ms) β β
ββββββββββββββββββββββββββββββββββββββββββββββββββ
Core Explanation
What are Producer Acknowledgments?
Producer acknowledgments (acks) control when a Kafka producer considers a write operation successful. This determines:
- When producer receives confirmation that message is safe
- How many replicas must persist the message
- Trade-off between latency and durability
Three Levels:
acks=0: No acknowledgment (fire-and-forget)
acks=1: Leader acknowledgment (default)
acks=all: Full ISR acknowledgment (safest)
acks=0: No Acknowledgment
Behavior:
Producer sends message, immediately considers it sent
Leader receives message (maybe)
No confirmation sent back
Result:
- Highest throughput (no waiting)
- Lowest latency (<1ms)
- Zero durability guarantee
When Message Can Be Lost:
1. Network failure before reaching broker
Producer β [Network drops packet] β Leader (never arrives)
2. Leader crash before writing to disk
Producer β Leader (in memory) β [Crash] β
3. Leader crash before replication
Producer β Leader (written) β [Crash before replicating] β
Probability of loss: Relatively high (1-5%)
Configuration:
const producer = kafka.producer({
acks: 0, // No acknowledgment
compression: "gzip", // Often used with acks=0 for max throughput
});
Use Cases:
β Log aggregation (OK to lose some logs)
β Metrics collection (OK to lose some data points)
β IoT sensor data (high volume, redundancy)
β Clickstream tracking (lossy acceptable)
β Financial transactions
β User-facing data (messages, posts)
β Critical business events
acks=1: Leader Acknowledgment
Behavior:
Producer sends message
Leader writes to local log (durable on leader disk)
Leader sends ACK to producer
Producer considers message sent β
Meanwhile (asynchronous):
Leader replicates to followers (background)
Result:
- Good throughput
- Moderate latency (5-10ms)
- Durability: Survives producer/network failure
- Risk: Lost if leader fails before replication
When Message Can Be Lost:
Scenario: Leader fails before replication
T0: Producer β Leader (message written to leader)
T1: Leader β ACK β Producer β
T2: Producer moves on
T3: Leader crashes β‘ (before replicating)
T4: Follower promoted to new leader
T5: Message is GONE β (was only on failed leader)
Probability: Low (1-2% during failures)
Window of vulnerability: ~500ms (replication lag)
Configuration:
const producer = kafka.producer({
acks: 1, // Leader acknowledgment (default)
timeout: 30000, // 30s timeout
retry: {
retries: 3, // Retry on failure
},
});
Use Cases:
β Most production workloads (default choice)
β High-throughput messaging
β Real-time analytics
β Event streaming
Balance between performance and safety
acks=all: Full ISR Acknowledgment
Behavior:
Producer sends message
Leader writes to local log
Leader waits for ALL in-sync replicas (ISR) to acknowledge
All ISR replicas write to their logs
Leader sends ACK to producer
Producer considers message sent β
Result:
- Lower throughput
- Higher latency (10-50ms)
- Maximum durability
- Message replicated before acknowledgment
In-Sync Replicas (ISR):
ISR = Set of replicas that are "caught up" with leader
Example:
- Leader: Broker 1
- Followers: Broker 2 (in sync), Broker 3 (lagging)
- ISR = {Broker 1, Broker 2}
acks=all waits for: Broker 1 + Broker 2
If Broker 2 falls behind (network issue):
ISR = {Broker 1} (just leader)
acks=all waits for: Broker 1 only (no followers!)
This is why min.insync.replicas is critical!
min.insync.replicas:
Configuration: Minimum ISR size required for writes
min.insync.replicas=2 (recommended for acks=all)
- Requires at least 2 replicas in ISR
- If ISR shrinks to 1, producer gets error
- Prevents data loss when only leader is alive
Example with 3 replicas:
βββββββββββββββββββββββββββββββββββββββββββββββ
β Normal: ISR = {Leader, Follower1, Follower2}β
β acks=all waits for Leader + Follower1 β
β (or Leader + Follower2, first to respond) β
β β
β Follower1 fails: ISR = {Leader, Follower2} β
β acks=all waits for Leader + Follower2 β β
β β
β Follower2 also fails: ISR = {Leader} β
β acks=all REJECTS writes β β
β (ISR size 1 < min.insync.replicas 2) β
βββββββββββββββββββββββββββββββββββββββββββββββ
Protection: Cannot lose data if leader fails,
because message is on at least 2 replicas
Configuration:
const producer = kafka.producer({
acks: -1, // -1 means "all" (acks=all)
timeout: 30000,
retry: {
retries: 5,
},
});
// Topic configuration
min.insync.replicas = 2; // At least 2 replicas must ack
replication.factor = 3; // Total of 3 replicas
Use Cases:
β Financial transactions
β E-commerce orders
β User-generated content (posts, messages)
β Critical business events
β Regulatory/compliance data
Anywhere data loss is unacceptable
Real Systems Using Producer Acks
System | Default acks | Typical Config | Rationale |
---|---|---|---|
Kafka Streams | acks=all | acks=all, min.insync.replicas=2 | State stores require durability |
Netflix (Keystone) | acks=1 | acks=1, replication=3 | High throughput, tolerate rare loss |
acks=all | acks=all, min.insync.replicas=2 | Business-critical events | |
Uber | acks=1 | acks=1 (logs), acks=all (trips) | Mixed based on data criticality |
Confluent Cloud | acks=all | acks=all, min.insync.replicas=2 | Default for safety |
Case Study: Kafka at LinkedIn
LinkedIn's Kafka usage (origin of Kafka):
- 100+ billion messages/day
- 1000s of topics
- Multi-datacenter deployment
Acknowledgment Strategy:
βββββββββββββββββββββββββββββββββββββββββββββ
β Critical Data (jobs, connections): β
β - acks=all β
β - min.insync.replicas=2 β
β - replication.factor=3 β
β β Latency: 20-30ms β
β β Zero data loss β
β β
β Metrics/Logs (high volume): β
β - acks=1 β
β - replication.factor=2 β
β β Latency: 5-10ms β
β β Acceptable loss rate: <0.1% β
β β
β Analytics Events (ultra-high volume): β
β - acks=0 β
β - compression=gzip β
β β Latency: 1-2ms β
β β Loss rate: 1-2% (acceptable) β
βββββββββββββββββββββββββββββββββββββββββββββ
Lesson: Different acks for different data criticality
When to Use Each Ack Level
acks=0: Fire and Forget
Use When:
β High throughput required (100k+ msg/sec)
β Data loss is acceptable (logs, metrics)
β Data has natural redundancy (sensor arrays)
β Ultra-low latency required (<1ms)
Example: IoT sensor network
- 1000 sensors sending data every second
- If 1% of readings lost, still have 99%
- Aggregate statistics still accurate
acks=1: Leader Only
Use When:
β Good balance of performance and safety
β Occasional loss acceptable during failures
β High throughput with moderate durability
β Default choice for most workloads
Example: User activity tracking
- Click events, page views, etc.
- Occasional loss during broker failure OK
- Still maintain 99%+ delivery
acks=all: Full Replication
Use When:
β Zero data loss required
β Regulatory/compliance requirements
β Financial or critical business data
β Can tolerate higher latency (10-50ms)
Example: E-commerce order placement
- User places order (creates Kafka event)
- Order must not be lost
- OK to wait 20-30ms for full replication
- Worth latency cost for safety
Hybrid Approach
Different Topics, Different Acks:
// Critical orders: acks=all
const orderProducer = kafka.producer({
acks: -1,
timeout: 30000,
});
// Analytics events: acks=1
const analyticsProducer = kafka.producer({
acks: 1,
timeout: 10000,
});
// Metrics: acks=0
const metricsProducer = kafka.producer({
acks: 0,
compression: "gzip",
});
Interview Application
Common Interview Question
Q: βHow would you ensure zero data loss in a Kafka-based order processing system?β
Strong Answer:
βTo ensure zero data loss for orders, Iβd configure producers with acks=all and proper ISR settings:
Producer Configuration:
acks=all (or acks=-1) min.insync.replicas=2 replication.factor=3 retries=MAX_INT (infinite retries) max.in.flight.requests=1 (for ordering)
How This Prevents Loss:
- acks=all: Producer waits for full replication before considering write successful
- min.insync.replicas=2: Requires at least 2 replicas (leader + 1 follower) to acknowledge
- replication.factor=3: Total of 3 copies across brokers
- Result: Message on β₯2 replicas before ACK
Failure Scenarios:
- Network failure: Producer retries until successful
- Leader failure: Message already on follower (promoted to new leader)
- Follower failure: Still have leader + other follower (meets min ISR)
- Leader + Follower fail: Third replica exists, can rebuild ISR
Only lose data if: All 3 replicas fail simultaneously (extremely rare)
Trade-offs:
- Latency: 20-30ms vs 5-10ms for acks=1
- Throughput: Lower (wait for replication)
- Availability: May reject writes if ISR < 2
Worth It: For orders where data loss = lost revenue + angry customers
Monitoring: Alert if ISR falls below min.insync.replicasβ
Code Example
Producer with Different Ack Levels
const { Kafka } = require("kafkajs");
const kafka = new Kafka({
clientId: "my-producer",
brokers: ["kafka1:9092", "kafka2:9092", "kafka3:9092"],
});
// Configuration 1: acks=0 (Fire and Forget)
async function sendMetrics() {
const producer = kafka.producer({
acks: 0, // No acknowledgment
compression: "gzip",
});
await producer.connect();
const start = Date.now();
await producer.send({
topic: "metrics",
messages: [{ value: JSON.stringify({ cpu: 80, mem: 60 }) }],
});
const latency = Date.now() - start;
console.log(`Metrics sent (acks=0): ${latency}ms`);
// Typical output: 1-2ms
// Risk: Message may be lost
}
// Configuration 2: acks=1 (Leader Acknowledgment)
async function sendUserActivity() {
const producer = kafka.producer({
acks: 1, // Leader acknowledgment (default)
timeout: 30000,
retry: {
retries: 3,
initialRetryTime: 100,
},
});
await producer.connect();
const start = Date.now();
await producer.send({
topic: "user-activity",
messages: [
{
key: "user-123",
value: JSON.stringify({ action: "click", page: "/products" }),
},
],
});
const latency = Date.now() - start;
console.log(`Activity sent (acks=1): ${latency}ms`);
// Typical output: 5-10ms
// Risk: Lost if leader fails before replication
}
// Configuration 3: acks=all (Full ISR Acknowledgment)
async function sendOrder() {
const producer = kafka.producer({
acks: -1, // acks=all (wait for full ISR)
timeout: 30000,
retry: {
retries: Number.MAX_VALUE, // Retry forever
initialRetryTime: 100,
maxRetryTime: 30000,
},
idempotent: true, // Exactly-once semantics
maxInFlightRequests: 1, // Preserve ordering
});
await producer.connect();
const start = Date.now();
try {
await producer.send({
topic: "orders", // Topic config: min.insync.replicas=2, replication.factor=3
messages: [
{
key: "order-456",
value: JSON.stringify({
orderId: "456",
userId: "123",
total: 99.99,
items: [{ id: "product-1", qty: 2 }],
}),
},
],
});
const latency = Date.now() - start;
console.log(`Order sent (acks=all): ${latency}ms`);
// Typical output: 15-30ms
// Guarantee: Message on β₯2 replicas, zero loss
} catch (error) {
if (error.type === "NOT_ENOUGH_REPLICAS") {
// ISR < min.insync.replicas (degraded cluster)
console.error("Cluster degraded: Not enough in-sync replicas");
// Alert operations team
// Queue order for retry
}
throw error;
}
}
// Demonstrating latency differences
async function benchmark() {
console.log("Benchmarking producer acknowledgments...\n");
await sendMetrics(); // ~1-2ms
await sendUserActivity(); // ~5-10ms
await sendOrder(); // ~15-30ms
// Trade-off: Latency vs Durability
// acks=0: Fastest, least safe
// acks=1: Balanced (default)
// acks=all: Slowest, safest
}
benchmark();
Error Handling with acks=all
async function sendCriticalData(data) {
const producer = kafka.producer({
acks: -1,
retry: {
retries: 5,
initialRetryTime: 300,
},
});
await producer.connect();
try {
await producer.send({
topic: "critical-data",
messages: [{ value: JSON.stringify(data) }],
});
console.log("Data persisted successfully (acks=all)");
} catch (error) {
// Error types to handle:
if (error.type === "NOT_ENOUGH_REPLICAS") {
// ISR < min.insync.replicas
console.error("Not enough in-sync replicas");
// Action: Alert operations, queue for retry
}
if (error.type === "NOT_ENOUGH_REPLICAS_AFTER_APPEND") {
// Message written to leader, but ISR shrank before replication
console.error("Replication failed after append");
// Action: Retry (may be duplicate, use idempotent producer)
}
if (error.type === "REQUEST_TIMED_OUT") {
// Replication took longer than timeout
console.error("Acknowledgment timeout");
// Action: Retry (may be duplicate)
}
// Store in dead letter queue for manual review
await storeInDLQ(data, error);
throw error;
}
}
Related Content
Prerequisites:
- Leader-Follower Replication - Understanding ISR
- Topic Partitioning - Kafka architecture
Related Concepts:
- Quorum - ISR is a form of quorum
- Idempotence - Idempotent producer with acks=all
- Exactly-Once Semantics - Combines idempotence + acks=all
Used In Systems:
- Kafka (producer acknowledgments)
- Pulsar (similar ack levels)
- RabbitMQ (publisher confirms)
Explained In Detail:
- Kafka Deep Dive - Producer mechanics and acknowledgments
Quick Self-Check
- Can explain acks=0/1/all in 60 seconds?
- Understand latency vs durability trade-offs?
- Know when messages can be lost for each ack level?
- Can explain min.insync.replicas and ISR?
- Understand acks=all + min.insync.replicas=2 pattern?
- Know which ack level to use for different use cases?