How databases horizontally partition data across multiple servers for scalability, using partition keys to distribute and route data efficiently
80% of system design interviews
Powers systems at MongoDB, Cassandra, Instagram
High throughput query improvement
Thousands to billions of records
TL;DR
Sharding is a database architecture pattern that horizontally partitions data across multiple servers (shards), where each shard holds a subset of the total data. A shard key determines which shard stores each record, enabling systems to scale beyond single-server capacity while maintaining reasonable query performance.
Visual Overview
WITHOUT SHARDING (Single Database)
ββββββββββββββββββββββββββββββββββββββ
β Single Database Server β
β Users 1-1,000,000 β
β Storage: 500GB β
β CPU: 100% (bottleneck) β
β Throughput: 5K queries/sec MAX β
ββββββββββββββββββββββββββββββββββββββ
β
Limited by single server capacity
WITH SHARDING (4 Shards)
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β APPLICATION LAYER β
β (Routes queries based on shard key) β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
ββββββββββββββββββββΌβββββββββββββββββββ
β β β β
βΌ βΌ βΌ βΌ
βββββββββββ βββββββββββ βββββββββββ βββββββββββ
β Shard 1 β β Shard 2 β β Shard 3 β β Shard 4 β
βUsers β βUsers β βUsers β βUsers β
β1-250K β β250K-500Kβ β500K-750Kβ β750K-1M β
β β β β β β β β
β125GB β β125GB β β125GB β β125GB β
βCPU: 25% β βCPU: 25% β βCPU: 25% β βCPU: 25% β
βββββββββββ βββββββββββ βββββββββββ βββββββββββ
Total Throughput: 20K queries/sec (4x improvement)
Total Storage: 500GB (distributed)
Horizontal Scalability: Add more shards to scale
Core Explanation
What is Sharding?
Sharding is a database partitioning technique that splits a large dataset horizontally across multiple independent databases (shards). Each shard contains a unique subset of the data, determined by a shard key (also called partition key).
Think of sharding like dividing a massive library across multiple buildings:
- Single library (no sharding): All 10 million books in one building, limited by building capacity
- Sharded library (4 buildings): Books divided alphabetically - Building A holds A-F, Building B holds G-L, etc. Each building handles fewer books and concurrent visitors
Key Characteristics:
- Horizontal partitioning: Splits data by rows, not columns
- Shard key: Determines which shard stores each record
- Independent shards: Each shard is a separate database server
- Distributed queries: Queries may hit one shard (ideal) or multiple shards
Shard Key Selection
The shard key is the most critical design decision in sharding. A good shard key:
- Distributes data evenly: Avoids βhot shardsβ with disproportionate load
- Aligns with query patterns: Minimizes cross-shard queries
- Has high cardinality: Many unique values to distribute across shards
- Is immutable: Changing a shard key requires data migration
Example: E-commerce Database
OPTION 1: Shard by user_id (GOOD for user queries)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Shard Key: user_id β
β Hash(user_id) % 4 = shard_number β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β Shard 1: Users 1-250K β All data for user 12345 β
β Shard 2: Users 250K-500K β All data for user 340987 β
β Shard 3: Users 500K-750K β All data for user 655432 β
β Shard 4: Users 750K-1M β All data for user 923456 β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β User profile query: Hits 1 shard (fast)
β User orders query: Hits 1 shard (all user's orders on same shard)
β Global analytics: Must query all 4 shards (slow)
OPTION 2: Shard by product_category (BAD - uneven distribution)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Shard 1: Electronics (70% of traffic, HOT SHARD) β
β Shard 2: Books (15% of traffic) β
β Shard 3: Clothing (10% of traffic) β
β Shard 4: Home (5% of traffic) β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Uneven load: Shard 1 is overloaded
β Hot shard becomes bottleneck
β Scales poorly (can't split popular category easily)
Shard Key Strategies
1. Hash-Based Sharding (Most Common)
# Distribute data evenly using hash function
shard_number = hash(user_id) % number_of_shards
# Example:
hash(user_12345) % 4 = 1 β Store in Shard 1
hash(user_67890) % 4 = 3 β Store in Shard 3
Pros:
- β Even distribution (hash function randomizes)
- β Simple to implement
- β Predictable shard lookup
Cons:
- β Difficult to add/remove shards (rehashing required)
- β Canβt do range queries across shard key (user_id > 50000)
2. Range-Based Sharding
Shard 1: user_id 1-250,000
Shard 2: user_id 250,001-500,000
Shard 3: user_id 500,001-750,000
Shard 4: user_id 750,001-1,000,000
Pros:
- β Supports range queries (all users with id > 500,000)
- β Easy to add shards (just split ranges)
Cons:
- β Risk of uneven distribution (new users may cluster in recent ranges)
- β Hot shards if data is time-based (newest users get all traffic)
3. Geographic Sharding
Shard 1: Users in North America
Shard 2: Users in Europe
Shard 3: Users in Asia
Shard 4: Users in South America
Pros:
- β Reduces latency (data close to users)
- β Regulatory compliance (data residency requirements)
- β Simple routing (IP-based or user-selected)
Cons:
- β Uneven distribution (more users in some regions)
- β Cross-region queries are slow
Query Routing
When a query arrives, the application layer must determine which shard(s) to query:
Single-Shard Query (Ideal - Fast)
-- Query includes shard key
SELECT * FROM orders WHERE user_id = 12345;
Routing:
1. Extract shard key: user_id = 12345
2. Calculate shard: hash(12345) % 4 = 1
3. Query only Shard 1
4. Return results
Performance: O(1) shard lookup + single database query
Multi-Shard Query (Scatter-Gather - Slow)
-- Query does NOT include shard key
SELECT * FROM orders WHERE product_id = 'XYZ123';
Routing:
1. No shard key β Must query all shards
2. Send query to Shards 1, 2, 3, 4 in parallel
3. Merge results from all shards
4. Sort/paginate combined results
5. Return to client
Performance: O(N) where N = number of shards
Query Pattern Optimization:
GOOD: Include shard key in WHERE clause
SELECT * FROM users WHERE user_id = 123; β Hits 1 shard
BAD: Query without shard key
SELECT * FROM users WHERE email = 'user@example.com'; β Hits all shards
SOLUTION: Secondary index or denormalization
- Option 1: Maintain email β user_id mapping in separate lookup table
- Option 2: Duplicate user data in email-sharded collection
Rebalancing and Resharding
As data grows, you may need to add more shards. This is challenging:
The Resharding Problem:
BEFORE: 4 shards
Shard 0: hash(key) % 4 = 0
Shard 1: hash(key) % 4 = 1
Shard 2: hash(key) % 4 = 2
Shard 3: hash(key) % 4 = 3
AFTER: 8 shards (doubled)
Shard 0: hash(key) % 8 = 0 β Different assignment!
Shard 1: hash(key) % 8 = 1
...
Shard 7: hash(key) % 8 = 7
Problem: Most keys now map to different shards
Result: Must migrate ~75% of data to new locations
Solutions:
- Consistent Hashing: Minimizes data movement when adding shards
- Virtual Shards: More shards than physical servers, easier rebalancing
- Pre-sharding: Start with more shards than needed (e.g., 256 shards on 4 servers)
Tradeoffs
Advantages:
- β Horizontal scalability: Add more servers to handle more data
- β Improved throughput: Queries distributed across multiple databases
- β Fault isolation: One shard failure doesnβt affect others
- β Reduced latency: Smaller datasets per shard = faster queries
Disadvantages:
- β Increased complexity: Application must handle routing logic
- β Cross-shard queries are expensive: Scatter-gather operations slow
- β Transactions across shards: Difficult or impossible (need distributed transactions)
- β Rebalancing is hard: Adding/removing shards requires data migration
- β Hot shards: Poor shard key choice leads to uneven load
Real Systems Using Sharding
MongoDB (Auto-Sharding)
- Implementation: Chunk-based sharding with automatic balancing
- Shard Key: Chosen by user (e.g., user_id, timestamp)
- Scale: Supports thousands of shards
- Typical Setup: Start with 3 shards, auto-split and rebalance as data grows
MongoDB Sharding Architecture:
βββββββββββββββββββββββββββββββββββββββββββ
β mongos (Router) β
β Routes queries based on shard key β
βββββββββββββββββββ¬ββββββββββββββββββββββββ
β
βββββββββββΌββββββββββ
β β β
βΌ βΌ βΌ
ββββββββββ ββββββββββ ββββββββββ
βShard 1 β βShard 2 β βShard 3 β
βReplica β βReplica β βReplica β
β Set β β Set β β Set β
ββββββββββ ββββββββββ ββββββββββ
Cassandra (Hash Partitioning)
- Implementation: Consistent hashing with virtual nodes
- Partition Key: First part of primary key
- Scale: Designed for massive scale (Instagram uses 1000+ nodes)
- Typical Setup: 256 virtual nodes per physical server
DynamoDB (Managed Sharding)
- Implementation: Automatic partitioning by AWS
- Partition Key: Required in table schema
- Scale: Auto-scales partitions based on throughput
- Typical Setup: Transparent to user (AWS manages shards)
Instagram (Custom Sharding)
- Implementation: PostgreSQL with application-level sharding
- Shard Key: user_id
- Scale: Thousands of database servers
- Strategy: Store all user data (photos, likes, followers) on same shard for single-shard queries
When to Use Sharding
β Perfect Use Cases
High Write Throughput
Scenario: Social media platform with 10M+ writes/second
Challenge: Single database can't handle write load
Solution: Shard by user_id, distribute writes across 100+ shards
Result: Each shard handles 100K writes/second (manageable)
Large Dataset That Doesnβt Fit on One Server
Scenario: E-commerce with 5TB product catalog + order history
Challenge: Single database server has 1TB storage
Solution: Shard by user_id or timestamp
Result: Each shard stores 500GB (5TB / 10 shards)
Read-Heavy Workload with Query Patterns
Scenario: User-facing app where 95% of queries are per-user
Solution: Shard by user_id, most queries hit single shard
Result: Linear scalability (add shards = add capacity)
β When NOT to Use Sharding
Small Dataset (<100GB)
Problem: Sharding adds complexity
Alternative: Vertical scaling (bigger server) or read replicas
Threshold: Consider sharding when >500GB or >10K writes/sec
Frequent Cross-Shard Queries
Problem: Analytics queries require aggregating across all shards
Example: "Total revenue for product category X across all users"
Alternative: Separate OLAP database (e.g., data warehouse)
Need for ACID Transactions Across Entities
Problem: Sharding breaks cross-shard transactions
Example: Money transfer between users on different shards
Alternative: Keep transactional data unsharded, shard read-heavy data
Interview Application
Common Interview Question 1
Q: βDesign a database for Twitter. How would you shard the data?β
Strong Answer:
βIβd shard by user_id using hash-based partitioning. Hereβs why:
Rationale:
- Most queries are user-centric: get userβs tweets, timeline, followers
- Sharding by user_id means all user data lives on one shard
- Single-shard queries are fast and donβt require cross-shard operations
Shard Key:
hash(user_id) % number_of_shards
Data Co-location:
- User profile β Shard X
- Userβs tweets β Shard X
- Userβs followers β Shard X
- Userβs timeline cache β Shard X
Query Patterns:
- Get user profile: Single-shard query β
- Get userβs tweets: Single-shard query β
- Post new tweet: Single-shard write β
Cross-Shard Challenge:
- Building home timeline (tweets from followed users) requires cross-shard queries
- Solution: Pre-compute timelines using fan-out on write (write tweets to follower timelines)
Scaling Strategy:
- Start with 16 shards (over-provision)
- As users grow, add more shards using consistent hashing
- Use virtual shards (256 virtual shards, 16 physical servers initially)β
Why This Answer Works:
- Identifies appropriate shard key with reasoning
- Explains query pattern optimization
- Addresses cross-shard challenge with solution
- Discusses scaling strategy
Common Interview Question 2
Q: βYour sharded database has a βhot shardβ thatβs getting 10x more traffic than others. How do you fix it?β
Strong Answer:
βHot shard indicates poor shard key distribution. Hereβs how Iβd address it:
Immediate Fix (Short-term):
- Vertical scaling: Upgrade the hot shardβs hardware temporarily
- Read replicas: Add read replicas for hot shard to distribute read load
- Caching: Cache frequently accessed data from hot shard
Root Cause Analysis:
- Is it a specific celebrity user? (data skew)
- Is it timestamp-based clustering? (recent data hotspot)
- Is it a geographic region? (regional load)
Long-term Fix (Depends on cause):
If celebrity users:
- Give top 1% users dedicated shards
- Use composite shard key: (is_celebrity, user_id)
- Celebrities distributed separately
If timestamp clustering:
- Switch from range-based to hash-based sharding
- Use: hash(user_id) instead of timestamp ranges
If geographic:
- Further subdivide hot region
- E.g., Split βNorth Americaβ into US-East, US-West, Canada
Rebalancing Strategy:
- Use consistent hashing to minimize data movement
- Perform migration during low-traffic hours
- Keep old shard online during migration (dual writes)
- Cutover once new shard is caught up
Prevention:
- Monitor shard metrics (CPU, throughput, latency)
- Alert when shard imbalance >20%
- Choose shard keys with high cardinality and even distributionβ
Why This Answer Works:
- Immediate actions + root cause analysis
- Multiple solutions depending on scenario
- Rebalancing strategy with minimal downtime
- Preventive measures
Red Flags to Avoid
- β Suggesting sharding for small datasets (<100GB)
- β Not considering query patterns when choosing shard key
- β Ignoring cross-shard query challenges
- β Not explaining how to handle hot shards
- β Forgetting about rebalancing complexity
Quick Self-Check
Before moving on, can you:
- Explain sharding in 60 seconds?
- Draw a diagram showing data distributed across shards?
- Explain 3 shard key strategies (hash, range, geographic)?
- Describe the difference between single-shard and cross-shard queries?
- Identify when to use vs NOT use sharding?
- Explain how to handle a hot shard?
Related Content
Prerequisites
None - foundational database scaling concept
Related Concepts
- Topic Partitioning - Similar concept for message streams
- Load Balancing - Distributing traffic across servers
- Consensus - Coordination in distributed sharded systems
Used In Systems
- Twitter: User-sharded database
- Instagram: Photo and user data sharded by user_id
- Uber: Trips sharded by geohash
Explained In Detail
- Scaling Databases - Comprehensive sharding strategies
Next Recommended: Consensus - Learn how shards coordinate in distributed systems