Attention to Generation - Producing Text Token by Token
Deep dive into text generation: the generation pipeline, temperature and sampling, decoding strategies, and why deterministic generation doesn't exist
Concepts Covered in This Article
Building On Previous Knowledge
In the previous progression, you learned how attention lets tokens incorporate information from context. The transformer processes the entire sequence, and each position ends up with a rich representation that “knows about” the whole input.
But there’s still a gap: how does a rich vector become the next word?
This progression bridges that gap. After all the attention and feedforward layers, the model outputs a probability distribution over the entire vocabulary. Generation is the process of sampling from that distribution, one token at a time.
What Goes Wrong Without This:
Symptom: Your LLM-powered app gives different answers every time.
Cause: Temperature > 0 introduces randomness. This is a feature,
not a bug—but you might not want it for your use case.
Symptom: Model outputs are repetitive and boring.
Cause: You set temperature = 0 (greedy decoding).
Model always picks highest probability = mode collapse.
Symptom: LLM generates coherent first paragraphs, then rambles.
Cause: Autoregressive generation accumulates errors.
Each token conditions on previous (possibly wrong) tokens.
The Generation Pipeline
Text generation is autoregressive: generate one token, append it, generate the next.
+------------------------------------------------------------------+
| GENERATION PIPELINE |
+------------------------------------------------------------------+
| |
| Input: "The capital of France is" |
| |
| Step 1: Tokenize |
| → [464, 3139, 286, 4881, 318] |
| |
| Step 2: Forward pass through transformer |
| Input embeddings → Attention layers → FFN layers |
| → Final hidden states for each position |
| |
| Step 3: Project to vocabulary (LM head) |
| Last position's hidden state → (vocab_size,) logits |
| hidden_state @ W_vocab → [2.3, -1.1, 0.5, ..., 1.8] |
| ↑ |
| 50,257 values (one per token) |
| |
| Step 4: Convert to probabilities |
| softmax(logits / temperature) → probabilities |
| [0.001, 0.0002, 0.3, ..., 0.001] |
| ↑ |
| Sum = 1.0, each value ∈ [0,1] |
| |
| Step 5: Sample next token |
| Based on probabilities → token "Paris" (ID 6342) |
| |
| Step 6: Repeat from Step 2 with extended sequence |
| New input: "The capital of France is Paris" |
| → Generate next token... |
| |
+------------------------------------------------------------------+
Logits to Probabilities
The model outputs raw scores (logits), not probabilities:
Logits: [2.3, -1.1, 0.5, 4.1, -0.3, ...]
↑ ↑
"the" "Paris"
Softmax converts to probabilities:
P(token_i) = exp(logit_i) / Σ exp(logit_j)
After softmax: [0.02, 0.001, 0.004, 0.85, 0.002, ...]
↑
"Paris" = 85% probability
The token with highest logit gets highest probability.
But it's not 100%—other tokens have non-zero chance.
Temperature: Controlling Randomness
Temperature scales logits before softmax, controlling distribution “sharpness”:
adjusted_logits = logits / temperature
+------------------------------------------------------------------+
| Temperature Effects |
+------------------------------------------------------------------+
| |
| T = 0.0 (or very small): |
| logits / 0 → ∞ for max, 0 for others |
| Always pick highest probability token (greedy) |
| Output: deterministic, conservative, may be repetitive |
| |
| T = 1.0 (default): |
| logits unchanged |
| Sample according to trained distribution |
| Output: balanced creativity and coherence |
| |
| T = 2.0 (high): |
| logits / 2 → flatter distribution |
| Low probability tokens become more likely |
| Output: more random, creative, potentially incoherent |
| |
+------------------------------------------------------------------+
Visual:
T=0.1 T=1.0 T=2.0
Token A (logit 4): 99% 70% 45%
Token B (logit 2): 1% 20% 30%
Token C (logit 1): <1% 10% 25%
Practical guidance:
+------------------+---------------+-------------------------------+
| Temperature | Use Case | Behavior |
+------------------+---------------+-------------------------------+
| 0.0 - 0.3 | Factual Q&A | Conservative, deterministic |
| 0.5 - 0.7 | Code gen | Balanced, mostly predictable |
| 0.7 - 1.0 | Creative | More varied, still coherent |
| 1.0 - 1.5 | Brainstorm | High variety, some wild |
+------------------+---------------+-------------------------------+
Decoding Strategies
Temperature alone doesn’t solve everything. Other strategies modify which tokens are considered.
Greedy Decoding
Always pick highest probability token:
P = [0.02, 0.001, 0.85, 0.004, ...]
↓
Select: token_2 (0.85)
Pros: Deterministic, fast (no sampling)
Cons: Repetitive, no exploration, misses better paths
"The best the best the best the best..." ← mode collapse
Top-K Sampling
Only consider the K most likely tokens:
Original: [0.4, 0.3, 0.15, 0.08, 0.04, 0.02, 0.01, ...]
└── many tiny probabilities
top_k=3: [0.4, 0.3, 0.15, 0, 0, 0, 0, ...]
└─────────────┘
Renormalize these to sum to 1.0
After renormalization: [0.47, 0.35, 0.18, 0, 0, ...]
└── others impossible
Sample from reduced distribution.
Benefit: Prevents very unlikely tokens from being chosen.
Risk: K is fixed, but vocabulary distribution varies.
Sometimes top-5 is enough. Sometimes top-50 is needed.
Top-P (Nucleus) Sampling
Include tokens until cumulative probability reaches P:
Sorted probs: [0.4, 0.3, 0.15, 0.08, 0.04, ...]
Cumulative: [0.4, 0.7, 0.85, 0.93, 0.97, ...]
↑
Top-p=0.9 → include up to 0.93
Adaptive: includes more tokens when distribution is flat,
fewer when one token dominates.
This is often better than top-k because it adapts to context.
Combining Strategies
Real systems often combine:
1. Apply temperature
2. Apply top-k (e.g., k=50)
3. Apply top-p (e.g., p=0.9)
4. Sample from result
Each filter removes tokens that "shouldn't" be generated.
Order matters: temperature affects which tokens pass top-k.
Practical Generation Settings
Recommended settings for common use cases:
Factual/Deterministic
{
"temperature": 0,
"top_p": 1,
"max_tokens": 256
}
Or with slight randomness:
{
"temperature": 0.2,
"top_p": 0.95,
"max_tokens": 256
}
Code Generation
{
"temperature": 0.3,
"top_p": 0.95,
"max_tokens": 1024
}
Creative Writing
{
"temperature": 0.9,
"top_p": 0.95,
"max_tokens": 2048
}
When in Doubt
{
"temperature": 0.7,
"top_p": 0.9,
"max_tokens": 512
}
The Autoregressive Problem
Generation has a fundamental limitation:
Each token conditions only on PREVIOUS tokens.
The model can't "look ahead" and fix mistakes.
Step 1: "The answer is"
Step 2: "The answer is definitely"
Step 3: "The answer is definitely 42" ← committed to "definitely"
Step 4: "The answer is definitely 42."
What if "42" was wrong?
Model already said "definitely"—can't take it back.
Error propagation through the sequence.
Consequences:
1. Early mistakes compound
Wrong direction at step 10 affects all subsequent tokens.
2. Hallucination momentum
Once model starts hallucinating, it continues the pattern.
"The author of Hamlet was Francis Bacon..." continues confidently.
3. No self-correction without explicit mechanisms
Model doesn't naturally "notice" it's wrong.
Chain-of-thought helps but doesn't eliminate the problem.
Why “Deterministic” Generation Doesn’t Exist
Even with temperature = 0, outputs can vary:
+------------------------------------------------------------------+
| Sources of Non-Determinism |
+------------------------------------------------------------------+
| |
| 1. FLOATING-POINT PRECISION |
| Different GPUs/CPUs compute slightly differently |
| exp(12.345) on GPU A ≠ exp(12.345) on GPU B (last bits) |
| When tokens have similar probabilities, winner can change |
| |
| 2. BATCHING EFFECTS |
| Same prompt in different batch positions → different padding |
| Attention patterns slightly affected |
| |
| 3. API VERSION CHANGES |
| Provider updates model weights, quantization, infrastructure |
| "Same model" may not be same computation |
| |
| 4. PARALLEL COMPUTATION ORDER |
| Operations aren't strictly ordered in parallel execution |
| (a + b) + c vs a + (b + c) → floating point differs |
| |
+------------------------------------------------------------------+
Practical implications:
- Don't assume same prompt → same output, ever
- If you need reproducibility, cache outputs
- Test with multiple runs, not just one
- Use seed parameter if available (helps but doesn't guarantee)
Stopping Generation
How does the model know when to stop?
+------------------------------------------------------------------+
| STOPPING CONDITIONS |
+------------------------------------------------------------------+
| |
| 1. EOS TOKEN |
| Model generates <|endoftext|> or equivalent |
| Trained to output this when "done" |
| |
| 2. MAX TOKENS |
| Hit the limit you specified (max_tokens=256) |
| May cut off mid-sentence |
| |
| 3. STOP SEQUENCES |
| Custom strings that terminate generation |
| stop=["\n", "Human:", "```"] |
| |
| 4. TIMEOUT |
| API or system timeout (less common) |
| |
+------------------------------------------------------------------+
Code Example
Sampling from a probability distribution:
import torch
import torch.nn.functional as F
def sample_next_token(
logits: torch.Tensor, # (vocab_size,)
temperature: float = 1.0,
top_k: int = 50,
top_p: float = 0.9,
) -> int:
"""
Sample next token using temperature, top-k, and top-p.
"""
# Step 1: Apply temperature
if temperature > 0:
logits = logits / temperature
else:
# Greedy: return argmax
return logits.argmax().item()
# Step 2: Apply top-k
if top_k > 0:
indices_to_remove = logits < torch.topk(logits, top_k)[0][..., -1, None]
logits[indices_to_remove] = float('-inf')
# Step 3: Apply top-p (nucleus)
if top_p < 1.0:
sorted_logits, sorted_indices = torch.sort(logits, descending=True)
cumulative_probs = torch.cumsum(F.softmax(sorted_logits, dim=-1), dim=-1)
# Remove tokens with cumulative probability above threshold
sorted_indices_to_remove = cumulative_probs > top_p
# Keep at least one token
sorted_indices_to_remove[..., 1:] = sorted_indices_to_remove[..., :-1].clone()
sorted_indices_to_remove[..., 0] = 0
indices_to_remove = sorted_indices[sorted_indices_to_remove]
logits[indices_to_remove] = float('-inf')
# Step 4: Convert to probabilities and sample
probs = F.softmax(logits, dim=-1)
next_token = torch.multinomial(probs, num_samples=1)
return next_token.item()
# Example usage
vocab_size = 50257
logits = torch.randn(vocab_size)
# Different settings for different use cases
factual_token = sample_next_token(logits, temperature=0.2, top_k=10, top_p=0.9)
creative_token = sample_next_token(logits, temperature=0.9, top_k=50, top_p=0.95)
greedy_token = sample_next_token(logits, temperature=0) # always argmax
Key Takeaways
1. Generation is autoregressive: one token at a time, each conditioned on previous
2. Model outputs logits → softmax → probabilities over vocabulary
3. Temperature controls distribution sharpness:
- T=0: deterministic (greedy)
- T=1: as trained
- T>1: more random
4. Top-k and top-p filter the token distribution:
- Top-k: only consider k most likely
- Top-p: consider tokens until cumulative probability reaches p
5. Autoregressive generation can't look ahead—early errors propagate
6. True determinism doesn't exist due to floating-point and infrastructure variations
Verify Your Understanding
Before proceeding, you should be able to:
Explain what happens when you set temperature to 0 vs 1 vs 2 — What changes mathematically? What changes practically?
Given logits [3.0, 2.0, 1.0, 0.5] and top_k=2, what tokens can be selected? Calculate the renormalized probabilities.
Why does autoregressive generation sometimes produce repetitive text? What’s the mechanism, and how do sampling strategies help?
Your application needs consistent outputs for the same input. What can you do? What can’t you guarantee?
Explain the difference between top-k and top-p. When might top-p be better?
What’s Next
After this, you can:
- Continue → Generation → Retrieval — grounding generation in external facts
- Apply → Build LLM applications with appropriate generation settings