GPU memory, precision formats, quantization (INT4/INT8), and practical GPU selection for LLMs
TL;DR
GPU memory (VRAM) limits what models you can run. A 7B model needs ~14GB in FP16 or ~3.5GB in INT4. Quantization trades small quality loss for huge memory savings. Understanding these tradeoffs is essential for deploying LLMs.
Visual Overview
GPU vs CPU for ML
GPU vs CPU for ML
GPU VS CPU FOR ML┌───────────────────────────────────────────────────────────┐│││CPU: 8-64 powerful cores ││ Good at complex, sequential tasks ││ Each core handles different work ││││GPU: 1000-10000+ simple cores ││ Good at simple, parallel tasks ││ All cores do same operation on different data ││││Matrix multiplication (core ML operation): ││CPU: Compute elements one by one (or few at a time) ││GPU: Compute thousands of elements simultaneously││││ Result: GPUs are 10-100x faster for ML workloads. │││└───────────────────────────────────────────────────────────┘
Key GPU specs:
Spec
What it means
Why it matters
CUDA cores
Number of parallel processors
More = faster
VRAM
Video memory
Limits model size
Memory bandwidth
Data transfer speed
Limits throughput
Tensor cores
Specialized matrix units
2-4x faster for ML
VRAM (Video RAM)
GPU memory is separate from system RAM. Models must fit in VRAM.
VRAM Usage
VRAM Usage
WHAT USES VRAM┌───────────────────────────────────────────────────────────┐│││ Training: ││ VRAM = Model + Activations + Gradients + Optimizer││││ Inference: ││ VRAM = Model + Activations (+ KV cache for LLMs) │││└───────────────────────────────────────────────────────────┘TRAINING MEMORY (FP16)
┌───────────────────────────────────────────────────────────┐│││ 7B parameter model: ││││ Model weights: 7B × 2 bytes = 14 GB ││Gradients: 7B × 2 bytes = 14 GB ││Optimizer (Adam): 7B × 8 bytes = 56 GB ││ • m (momentum): 7B × 2 bytes ││ • v (variance): 7B × 2 bytes ││ • Master weights FP32: 7B × 4 bytes ││││ Total just for params: ~84 GB ││││Activations (batch-dependent): Additional 10-50+ GB ││││ Training 7B model needs: 80-100+ GB VRAM││→Requires A100 80GB or multi-GPU │││└───────────────────────────────────────────────────────────┘INFERENCE MEMORY┌───────────────────────────────────────────────────────────┐│││ 7B parameter model: ││││FP16: 7B × 2 bytes = 14 GB ││INT8: 7B × 1 byte = 7 GB ││INT4: 7B × 0.5 bytes = 3.5 GB││││ Plus KV cache (grows with context): ││ Per token: ~0.5-2 MB depending on model ││ 8K context: 4-16 GB additional ││││FP16 7B with 8K context: ~20-30 GB ││INT4 7B with 8K context: ~8-12 GB│││└───────────────────────────────────────────────────────────┘
Precision Formats
Different number formats trade accuracy for memory/speed.
Precision Formats
Precision Formats
PRECISION FORMATS┌───────────────────────────────────────────────────────────┐│││FP32 (32-bit float): ││ • 4 bytes per parameter ││ • Full precision, baseline quality ││ • Slowest, most memory││││FP16 (16-bit float): ││ • 2 bytes per parameter ││ • Slight precision loss ││ • 2x faster, half memory││ • Can overflow (limited range) ││││BF16 (bfloat16): ││ • 2 bytes per parameter ││ • Same range as FP32, less precision ││ • Better for training than FP16││ • Supported on newer GPUs (A100+) ││││INT8 (8-bit integer): ││ • 1 byte per parameter ││ • Quantized (needs calibration) ││ • 4x memory savings vs FP32││ • ~1% quality loss typically ││││INT4 (4-bit integer): ││ • 0.5 bytes per parameter ││ • Aggressive quantization ││ • 8x memory savings vs FP32││ • 1-3% quality loss typically │││└───────────────────────────────────────────────────────────┘MEMORY PER 1B PARAMETERS┌───────────────────────────────────────────────────────────┐│││FP32════════════════════════════════ 4.0 GB ││FP16════════════════════ 2.0 GB ││BF16════════════════════ 2.0 GB ││INT8════════════ 1.0 GB ││INT4══════ 0.5 GB │││└───────────────────────────────────────────────────────────┘
Quantization
Converting weights from high precision to lower precision.
POST-TRAINING QUANTIZATION (PTQ)
┌───────────────────────────────────────────────────────────┐│││ 1. Train model normally (FP16/FP32) ││ 2. After training, quantize weights ││ 3. Calibrate with sample data ││││ Pros: Simple, no retraining││ Cons: May lose quality for aggressive quantization ││││ Used by: GPTQ, AWQ, most deployment tools │││└───────────────────────────────────────────────────────────┘QUANTIZATION-AWARE TRAINING (QAT)
┌───────────────────────────────────────────────────────────┐│││ 1. Train with simulated quantization ││ 2. Model learns to be robust to quantization noise ││ 3. Final weights naturally quantize well ││││ Pros: Better quality at low precision││ Cons: Requires training, more complex││││ Used by: When PTQ quality is insufficient │││└───────────────────────────────────────────────────────────┘
OUT OF MEMORY (OOM)
┌───────────────────────────────────────────────────────────┐│││Symptoms: ││ • CUDA out of memory error││ • Training crashes││││ Immediate fixes: ││ 1. Reduce batch size││ 2. Use gradient accumulation││ 3. Enable gradient checkpointing││ 4. Use mixed precision (FP16/BF16) ││││ Longer-term fixes: ││ 1. Use quantization (INT8/INT4) ││ 2. Use LoRA instead of full fine-tuning ││ 3. Get more VRAM │││└───────────────────────────────────────────────────────────┘MEMORY KEEPS GROWING┌───────────────────────────────────────────────────────────┐│││Symptoms: ││ • Memory usage increases over time ││ • Eventually OOM││││Causes: ││ • Not clearing cache ││ • Storing too much history ││ • KV cache not managed ││││Debug steps: ││ 1. torch.cuda.empty_cache() between batches ││ 2. del intermediate tensors ││ 3. For inference: limit context length or use ││ sliding window │││└───────────────────────────────────────────────────────────┘
When This Matters
Situation
What to know
Choosing GPU for inference
Model size x precision = VRAM needed
Running 7B locally
INT4 quantization, ~8GB needed
Training models
Need 4-6x model size in VRAM
Fine-tuning on consumer GPU
Use LoRA + INT8/INT4
Getting OOM errors
Reduce batch, use gradient accumulation
Understanding model cards
Check precision (FP16, INT4, etc.)
Cost optimization
Smaller precision = cheaper inference
Understanding quantization
INT4 ~ 1-3% quality loss, 8x savings
Interview Notes
💼65% of ML infrastructure interviews
Interview Relevance 65% of ML infrastructure interviews
🏭Model deployment and cost
Production Impact
Powers systems at Model deployment and cost
⚡INT4 = 8x memory savings vs FP32
Performance INT4 = 8x memory savings vs FP32 query improvement