Memory & Compute | Concepts

TL;DR

GPU memory (VRAM) limits what models you can run. A 7B model needs ~14GB in FP16 or ~3.5GB in INT4. Quantization trades small quality loss for huge memory savings. Understanding these tradeoffs is essential for deploying LLMs.

Visual Overview

GPU vs CPU for ML

Key GPU specs:

Spec	What it means	Why it matters
CUDA cores	Number of parallel processors	More = faster
VRAM	Video memory	Limits model size
Memory bandwidth	Data transfer speed	Limits throughput
Tensor cores	Specialized matrix units	2-4x faster for ML

VRAM (Video RAM)

GPU memory is separate from system RAM. Models must fit in VRAM.

VRAM Usage

Precision Formats

Different number formats trade accuracy for memory/speed.

Precision Formats

Quantization

Converting weights from high precision to lower precision.

Quantization Basics

Quantization Methods

Common Quantization Formats

Format	Description	Quality	Use case
GPTQ	4-bit, row-wise	Good	GPU inference
AWQ	4-bit, activation-aware	Better	GPU inference
GGUF	Various bits, CPU-friendly	Good	CPU/Mac inference
bitsandbytes	4/8-bit, dynamic	Good	Training + inference

Practical GPU Selection

GPU Selection Guide

Common GPUs

GPU	VRAM	Good for
RTX 3090	24 GB	Dev, INT4 inference
RTX 4090	24 GB	Dev, LoRA fine-tuning
A10	24 GB	Cloud inference
L4	24 GB	Cloud inference (efficient)
A100 40GB	40 GB	Training, large inference
A100 80GB	80 GB	Large model training
H100 80GB	80 GB	Fastest training

Debugging Memory Issues

When This Matters

Situation	What to know
Choosing GPU for inference	Model size x precision = VRAM needed
Running 7B locally	INT4 quantization, ~8GB needed
Training models	Need 4-6x model size in VRAM
Fine-tuning on consumer GPU	Use LoRA + INT8/INT4
Getting OOM errors	Reduce batch, use gradient accumulation
Understanding model cards	Check precision (FP16, INT4, etc.)
Cost optimization	Smaller precision = cheaper inference
Understanding quantization	INT4 ~ 1-3% quality loss, 8x savings