Skip to content

Memory & Compute

GPU memory, precision formats, quantization (INT4/INT8), and practical GPU selection for LLMs

TL;DR

GPU memory (VRAM) limits what models you can run. A 7B model needs ~14GB in FP16 or ~3.5GB in INT4. Quantization trades small quality loss for huge memory savings. Understanding these tradeoffs is essential for deploying LLMs.

Visual Overview

GPU vs CPU for ML

Key GPU specs:

SpecWhat it meansWhy it matters
CUDA coresNumber of parallel processorsMore = faster
VRAMVideo memoryLimits model size
Memory bandwidthData transfer speedLimits throughput
Tensor coresSpecialized matrix units2-4x faster for ML

VRAM (Video RAM)

GPU memory is separate from system RAM. Models must fit in VRAM.

VRAM Usage

Precision Formats

Different number formats trade accuracy for memory/speed.

Precision Formats

Quantization

Converting weights from high precision to lower precision.

Quantization Basics

Quantization Methods

Quantization Methods

Common Quantization Formats

FormatDescriptionQualityUse case
GPTQ4-bit, row-wiseGoodGPU inference
AWQ4-bit, activation-awareBetterGPU inference
GGUFVarious bits, CPU-friendlyGoodCPU/Mac inference
bitsandbytes4/8-bit, dynamicGoodTraining + inference

Practical GPU Selection

GPU Selection Guide

Common GPUs

GPUVRAMGood for
RTX 309024 GBDev, INT4 inference
RTX 409024 GBDev, LoRA fine-tuning
A1024 GBCloud inference
L424 GBCloud inference (efficient)
A100 40GB40 GBTraining, large inference
A100 80GB80 GBLarge model training
H100 80GB80 GBFastest training

Debugging Memory Issues

Debugging Memory Issues

When This Matters

SituationWhat to know
Choosing GPU for inferenceModel size x precision = VRAM needed
Running 7B locallyINT4 quantization, ~8GB needed
Training modelsNeed 4-6x model size in VRAM
Fine-tuning on consumer GPUUse LoRA + INT8/INT4
Getting OOM errorsReduce batch, use gradient accumulation
Understanding model cardsCheck precision (FP16, INT4, etc.)
Cost optimizationSmaller precision = cheaper inference
Understanding quantizationINT4 ~ 1-3% quality loss, 8x savings
Interview Notes
💼65% of ML infrastructure interviews
Interview Relevance
65% of ML infrastructure interviews
🏭Model deployment and cost
Production Impact
Powers systems at Model deployment and cost
INT4 = 8x memory savings vs FP32
Performance
INT4 = 8x memory savings vs FP32 query improvement