Foundation vocabulary for machine learning: parameters, weights, logits, training vs inference, and why neural networks work
80% of ML interviews
Powers systems at Every ML system
Foundation for all AI work query improvement
TL;DR
A machine learning model is a mathematical function that maps inputs to outputs, with learnable parameters that are adjusted during training. Understanding parameters, logits, training vs inference, and the bias-variance tradeoff is essential vocabulary for any AI engineering work.
Visual Overview
MODEL AS FUNCTION
+-----------------------------------------------------------+
| |
| Input (X) --> Model --> Output (Y) |
| ------------------------------------------------- |
| "Is this spam?" f(x) 0.92 (yes) |
| [image pixels] f(x) "cat" |
| "Translate this" f(x) "Bonjour" |
| |
| The function has PARAMETERS--numbers that determine |
| how inputs map to outputs. Training adjusts them. |
| |
+-----------------------------------------------------------+
UNTRAINED VS TRAINED
+-----------------------------------------------------------+
| |
| Untrained model: random parameters --> garbage output |
| Trained model: learned parameters --> useful output |
| |
+-----------------------------------------------------------+
Parameters, Weights, and Biases
Parameters are the learnable values inside a model.
SIMPLE LINEAR MODEL
+-----------------------------------------------------------+
| |
| y = w * x + b |
| |
| w = weight (how much x matters) |
| b = bias (baseline offset) |
| |
| These are parameters. Training finds good values. |
| |
+-----------------------------------------------------------+
MODERN LLM SCALE
+-----------------------------------------------------------+
| |
| GPT-2: 124 million parameters |
| GPT-3: 175 billion parameters |
| LLaMA 70B: 70 billion parameters |
| Claude: undisclosed, but similar scale |
| |
| Each parameter is a floating-point number. |
| More parameters = more capacity to learn patterns. |
| |
+-----------------------------------------------------------+
Logits
Logits are raw, unnormalized scores output by a model before converting to probabilities.
LOGITS TO PROBABILITIES
+-----------------------------------------------------------+
| |
| Model outputs logits: |
| "cat": 4.2 |
| "dog": 2.1 |
| "car": -1.3 |
| |
| These are arbitrary numbers. Higher = more likely. |
| |
| Apply softmax to convert to probabilities: |
| "cat": 0.89 |
| "dog": 0.10 |
| "car": 0.01 |
| |
| Now they sum to 1 and represent confidence. |
| |
+-----------------------------------------------------------+
Why logits matter:
- LLMs output logits over vocabulary (50,000+ scores)
- Temperature and sampling operate on logits
- Understanding logits helps debug generation issues
Deterministic vs Probabilistic vs Statistical
THREE TYPES OF SYSTEMS
+-----------------------------------------------------------+
| |
| DETERMINISTIC: Same input --> same output, always |
| |
| def deterministic(x): |
| return x * 2 + 5 |
| |
| deterministic(3) # Always 11 |
| |
+-----------------------------------------------------------+
| |
| PROBABILISTIC: Output includes randomness |
| |
| def probabilistic(x): |
| return x * 2 + 5 + random.gauss(0, 1) |
| |
| probabilistic(3) # 11.23 one time, 10.87 next |
| |
+-----------------------------------------------------------+
| |
| STATISTICAL: Learns patterns from data |
| |
| model = LinearRegression() |
| model.fit(X_train, y_train) # Learns from data |
| model.predict(X_new) # Generalizes |
| |
+-----------------------------------------------------------+
ML models are statistical systems that often use probabilistic methods:
- They learn from data (statistical)
- They may sample from distributions (probabilistic)
- Given same input + same random seed, they’re deterministic
LLMs with temperature > 0 are probabilistic. With temperature = 0, they’re deterministic.
Training vs Inference
TRAINING
+-----------------------------------------------------------+
| |
| Training loop: |
| 1. Forward pass: compute prediction |
| 2. Compute loss: how wrong is it? |
| 3. Backward pass: compute gradients |
| 4. Update parameters: reduce error |
| 5. Repeat millions of times |
| |
| Training is: |
| - Expensive (weeks on GPU clusters) |
| - Done once (or periodically) |
| - Requires labeled data |
| |
+-----------------------------------------------------------+
INFERENCE
+-----------------------------------------------------------+
| |
| Inference: |
| 1. Load trained parameters |
| 2. Forward pass only |
| 3. Return prediction |
| |
| Inference is: |
| - Cheap (milliseconds per prediction) |
| - Done continuously in production |
| - No parameter updates |
| |
+-----------------------------------------------------------+
You will mostly do inference. Training LLMs requires massive compute. Fine-tuning is more accessible but still expensive.
Why Neural Networks Work
Neural networks are function approximators. Given enough parameters, they can learn any pattern in data.
UNIVERSAL APPROXIMATION
+-----------------------------------------------------------+
| |
| Theorem: A neural network with sufficient neurons |
| can approximate any continuous function to arbitrary |
| precision. |
| |
| Translation: If a pattern exists in your data, |
| a big enough network can learn it. |
| |
+-----------------------------------------------------------+
WHY DEPTH MATTERS
+-----------------------------------------------------------+
| |
| Shallow network (1-2 layers): |
| - Can approximate functions |
| - Needs exponentially many neurons |
| |
| Deep network (many layers): |
| - Learns hierarchical features |
| - Layer 1: edges |
| - Layer 2: shapes |
| - Layer 3: objects |
| - Layer 4: scenes |
| |
| Each layer builds on previous. Composition is powerful. |
| |
+-----------------------------------------------------------+
The Bias-Variance Tradeoff
Two failure modes when learning from data:
UNDERFITTING (HIGH BIAS)
+-----------------------------------------------------------+
| |
| Training error: High |
| Test error: High |
| Problem: Model too simple, misses patterns |
| Solution: More capacity (bigger model, more features) |
| |
+-----------------------------------------------------------+
OVERFITTING (HIGH VARIANCE)
+-----------------------------------------------------------+
| |
| Training error: Low (perfect!) |
| Test error: High (fails on new data) |
| Problem: Model memorized noise, not signal |
| Solution: Regularization, more data, simpler model |
| |
+-----------------------------------------------------------+
THE SWEET SPOT
+-----------------------------------------------------------+
| |
| | |
| Error | / Test error |
| | / |
| | / |
| | / -------- Sweet spot |
| |/ \ |
| | \ Training error |
| |--------------------------------- |
| Model Complexity --> |
| |
+-----------------------------------------------------------+
Vocabulary Reference
| Term | Definition |
|---|---|
| Model | Mathematical function mapping inputs to outputs |
| Parameters | Learnable values inside the model (weights + biases) |
| Weights | Parameters in connections between neurons |
| Bias | Offset parameter added to weighted sum |
| Features | Measurable properties of input data |
| Logits | Raw, unnormalized output scores |
| Softmax | Converts logits to probabilities (sum to 1) |
| Training | Adjusting parameters to minimize error |
| Inference | Using trained model to make predictions |
| Loss | Measure of how wrong predictions are |
| Gradient | Direction to adjust parameters to reduce loss |
| Epoch | One pass through entire training dataset |
When This Matters
| Situation | What to know |
|---|---|
| Discussing model size | Parameters = capacity, larger = more memory |
| Debugging generation | Temperature affects logit sampling |
| Understanding training | Forward pass -> loss -> backward pass -> update |
| Production deployment | Inference only, no training overhead |
| Model selection | Bias-variance tradeoff guides complexity choice |