Skip to content

Ai-engineering Series

Attention to Generation - Producing Text Token by Token

Deep dive into text generation: the generation pipeline, temperature and sampling, decoding strategies, and why deterministic generation doesn't exist

Concepts Covered in This Article

Building On Previous Knowledge

In the previous progression, you learned how attention lets tokens incorporate information from context. The transformer processes the entire sequence, and each position ends up with a rich representation that “knows about” the whole input.

But there’s still a gap: how does a rich vector become the next word?

This progression bridges that gap. After all the attention and feedforward layers, the model outputs a probability distribution over the entire vocabulary. Generation is the process of sampling from that distribution, one token at a time.

What Goes Wrong Without This:

Symptom: Your LLM-powered app gives different answers every time.
Cause: Temperature > 0 introduces randomness. This is a feature,
       not a bug—but you might not want it for your use case.

Symptom: Model outputs are repetitive and boring.
Cause: You set temperature = 0 (greedy decoding).
       Model always picks highest probability = mode collapse.

Symptom: LLM generates coherent first paragraphs, then rambles.
Cause: Autoregressive generation accumulates errors.
       Each token conditions on previous (possibly wrong) tokens.

The Generation Pipeline

Text generation is autoregressive: generate one token, append it, generate the next.

+------------------------------------------------------------------+
|                    GENERATION PIPELINE                            |
+------------------------------------------------------------------+
|                                                                   |
|  Input: "The capital of France is"                                |
|                                                                   |
|  Step 1: Tokenize                                                 |
|     [464, 3139, 286, 4881, 318]                                  |
|                                                                   |
|  Step 2: Forward pass through transformer                         |
|    Input embeddings  Attention layers  FFN layers               |
|     Final hidden states for each position                        |
|                                                                   |
|  Step 3: Project to vocabulary (LM head)                          |
|    Last position's hidden state  (vocab_size,) logits            |
|    hidden_state @ W_vocab  [2.3, -1.1, 0.5, ..., 1.8]            |
|                                                                  |
|                         50,257 values (one per token)             |
|                                                                   |
|  Step 4: Convert to probabilities                                 |
|    softmax(logits / temperature)  probabilities                  |
|    [0.001, 0.0002, 0.3, ..., 0.001]                               |
|                                                                  |
|    Sum = 1.0, each value ∈ [0,1]                                  |
|                                                                   |
|  Step 5: Sample next token                                        |
|    Based on probabilities  token "Paris" (ID 6342)               |
|                                                                   |
|  Step 6: Repeat from Step 2 with extended sequence                |
|    New input: "The capital of France is Paris"                    |
|     Generate next token...                                       |
|                                                                   |
+------------------------------------------------------------------+

Logits to Probabilities

The model outputs raw scores (logits), not probabilities:

Logits: [2.3, -1.1, 0.5, 4.1, -0.3, ...]
                            
     "the"                "Paris"

Softmax converts to probabilities:
  P(token_i) = exp(logit_i) / Σ exp(logit_j)

After softmax: [0.02, 0.001, 0.004, 0.85, 0.002, ...]
                                     
                              "Paris" = 85% probability

The token with highest logit gets highest probability.
But it's not 100%—other tokens have non-zero chance.

Temperature: Controlling Randomness

Temperature scales logits before softmax, controlling distribution “sharpness”:

adjusted_logits = logits / temperature

+------------------------------------------------------------------+
|  Temperature Effects                                              |
+------------------------------------------------------------------+
|                                                                   |
|  T = 0.0 (or very small):                                         |
|    logits / 0  ∞ for max, 0 for others                           |
|    Always pick highest probability token (greedy)                 |
|    Output: deterministic, conservative, may be repetitive         |
|                                                                   |
|  T = 1.0 (default):                                               |
|    logits unchanged                                               |
|    Sample according to trained distribution                       |
|    Output: balanced creativity and coherence                      |
|                                                                   |
|  T = 2.0 (high):                                                  |
|    logits / 2  flatter distribution                              |
|    Low probability tokens become more likely                      |
|    Output: more random, creative, potentially incoherent          |
|                                                                   |
+------------------------------------------------------------------+

Visual:
                     T=0.1        T=1.0        T=2.0
  Token A (logit 4):  99%          70%          45%
  Token B (logit 2):   1%          20%          30%
  Token C (logit 1):  <1%          10%          25%

Practical guidance:

+------------------+---------------+-------------------------------+
|  Temperature     |  Use Case     |  Behavior                     |
+------------------+---------------+-------------------------------+
|  0.0 - 0.3       |  Factual Q&A  |  Conservative, deterministic  |
|  0.5 - 0.7       |  Code gen     |  Balanced, mostly predictable |
|  0.7 - 1.0       |  Creative     |  More varied, still coherent  |
|  1.0 - 1.5       |  Brainstorm   |  High variety, some wild      |
+------------------+---------------+-------------------------------+

Decoding Strategies

Temperature alone doesn’t solve everything. Other strategies modify which tokens are considered.

Greedy Decoding

Always pick highest probability token:

P = [0.02, 0.001, 0.85, 0.004, ...]
         
Select: token_2 (0.85)

Pros: Deterministic, fast (no sampling)
Cons: Repetitive, no exploration, misses better paths

"The best the best the best the best..."   mode collapse

Top-K Sampling

Only consider the K most likely tokens:

Original: [0.4, 0.3, 0.15, 0.08, 0.04, 0.02, 0.01, ...]
                                              └── many tiny probabilities

top_k=3: [0.4, 0.3, 0.15, 0, 0, 0, 0, ...]
         └─────────────┘
         Renormalize these to sum to 1.0

After renormalization: [0.47, 0.35, 0.18, 0, 0, ...]
                                           └── others impossible

Sample from reduced distribution.

Benefit: Prevents very unlikely tokens from being chosen.
Risk: K is fixed, but vocabulary distribution varies.
  Sometimes top-5 is enough. Sometimes top-50 is needed.

Top-P (Nucleus) Sampling

Include tokens until cumulative probability reaches P:

Sorted probs:  [0.4, 0.3, 0.15, 0.08, 0.04, ...]
Cumulative:    [0.4, 0.7, 0.85, 0.93, 0.97, ...]
                      
               Top-p=0.9  include up to 0.93

Adaptive: includes more tokens when distribution is flat,
          fewer when one token dominates.

This is often better than top-k because it adapts to context.

Combining Strategies

Real systems often combine:

  1. Apply temperature
  2. Apply top-k (e.g., k=50)
  3. Apply top-p (e.g., p=0.9)
  4. Sample from result

Each filter removes tokens that "shouldn't" be generated.
Order matters: temperature affects which tokens pass top-k.

Practical Generation Settings

Recommended settings for common use cases:

Factual/Deterministic

{
  "temperature": 0,
  "top_p": 1,
  "max_tokens": 256
}

Or with slight randomness:

{
  "temperature": 0.2,
  "top_p": 0.95,
  "max_tokens": 256
}

Code Generation

{
  "temperature": 0.3,
  "top_p": 0.95,
  "max_tokens": 1024
}

Creative Writing

{
  "temperature": 0.9,
  "top_p": 0.95,
  "max_tokens": 2048
}

When in Doubt

{
  "temperature": 0.7,
  "top_p": 0.9,
  "max_tokens": 512
}

The Autoregressive Problem

Generation has a fundamental limitation:

Each token conditions only on PREVIOUS tokens.
The model can't "look ahead" and fix mistakes.

Step 1: "The answer is"
Step 2: "The answer is definitely"
Step 3: "The answer is definitely 42"   committed to "definitely"
Step 4: "The answer is definitely 42."

What if "42" was wrong?
Model already said "definitely"—can't take it back.
Error propagation through the sequence.

Consequences:

1. Early mistakes compound
   Wrong direction at step 10 affects all subsequent tokens.

2. Hallucination momentum
   Once model starts hallucinating, it continues the pattern.
   "The author of Hamlet was Francis Bacon..." continues confidently.

3. No self-correction without explicit mechanisms
   Model doesn't naturally "notice" it's wrong.
   Chain-of-thought helps but doesn't eliminate the problem.

Why “Deterministic” Generation Doesn’t Exist

Even with temperature = 0, outputs can vary:

+------------------------------------------------------------------+
|  Sources of Non-Determinism                                       |
+------------------------------------------------------------------+
|                                                                   |
|  1. FLOATING-POINT PRECISION                                      |
|     Different GPUs/CPUs compute slightly differently              |
|     exp(12.345) on GPU A ≠ exp(12.345) on GPU B (last bits)       |
|     When tokens have similar probabilities, winner can change     |
|                                                                   |
|  2. BATCHING EFFECTS                                              |
|     Same prompt in different batch positions  different padding  |
|     Attention patterns slightly affected                          |
|                                                                   |
|  3. API VERSION CHANGES                                           |
|     Provider updates model weights, quantization, infrastructure  |
|     "Same model" may not be same computation                      |
|                                                                   |
|  4. PARALLEL COMPUTATION ORDER                                    |
|     Operations aren't strictly ordered in parallel execution      |
|     (a + b) + c vs a + (b + c)  floating point differs           |
|                                                                   |
+------------------------------------------------------------------+

Practical implications:

- Don't assume same prompt  same output, ever
- If you need reproducibility, cache outputs
- Test with multiple runs, not just one
- Use seed parameter if available (helps but doesn't guarantee)

Stopping Generation

How does the model know when to stop?

+------------------------------------------------------------------+
|  STOPPING CONDITIONS                                              |
+------------------------------------------------------------------+
|                                                                   |
|  1. EOS TOKEN                                                     |
|     Model generates <|endoftext|> or equivalent                   |
|     Trained to output this when "done"                            |
|                                                                   |
|  2. MAX TOKENS                                                    |
|     Hit the limit you specified (max_tokens=256)                  |
|     May cut off mid-sentence                                      |
|                                                                   |
|  3. STOP SEQUENCES                                                |
|     Custom strings that terminate generation                      |
|     stop=["\n", "Human:", "```"]                                  |
|                                                                   |
|  4. TIMEOUT                                                       |
|     API or system timeout (less common)                           |
|                                                                   |
+------------------------------------------------------------------+

Code Example

Sampling from a probability distribution:

import torch
import torch.nn.functional as F

def sample_next_token(
    logits: torch.Tensor,  # (vocab_size,)
    temperature: float = 1.0,
    top_k: int = 50,
    top_p: float = 0.9,
) -> int:
    """
    Sample next token using temperature, top-k, and top-p.
    """
    # Step 1: Apply temperature
    if temperature > 0:
        logits = logits / temperature
    else:
        # Greedy: return argmax
        return logits.argmax().item()

    # Step 2: Apply top-k
    if top_k > 0:
        indices_to_remove = logits < torch.topk(logits, top_k)[0][..., -1, None]
        logits[indices_to_remove] = float('-inf')

    # Step 3: Apply top-p (nucleus)
    if top_p < 1.0:
        sorted_logits, sorted_indices = torch.sort(logits, descending=True)
        cumulative_probs = torch.cumsum(F.softmax(sorted_logits, dim=-1), dim=-1)

        # Remove tokens with cumulative probability above threshold
        sorted_indices_to_remove = cumulative_probs > top_p
        # Keep at least one token
        sorted_indices_to_remove[..., 1:] = sorted_indices_to_remove[..., :-1].clone()
        sorted_indices_to_remove[..., 0] = 0

        indices_to_remove = sorted_indices[sorted_indices_to_remove]
        logits[indices_to_remove] = float('-inf')

    # Step 4: Convert to probabilities and sample
    probs = F.softmax(logits, dim=-1)
    next_token = torch.multinomial(probs, num_samples=1)

    return next_token.item()

# Example usage
vocab_size = 50257
logits = torch.randn(vocab_size)

# Different settings for different use cases
factual_token = sample_next_token(logits, temperature=0.2, top_k=10, top_p=0.9)
creative_token = sample_next_token(logits, temperature=0.9, top_k=50, top_p=0.95)
greedy_token = sample_next_token(logits, temperature=0)  # always argmax

Key Takeaways

1. Generation is autoregressive: one token at a time, each conditioned on previous

2. Model outputs logits  softmax  probabilities over vocabulary

3. Temperature controls distribution sharpness:
   - T=0: deterministic (greedy)
   - T=1: as trained
   - T>1: more random

4. Top-k and top-p filter the token distribution:
   - Top-k: only consider k most likely
   - Top-p: consider tokens until cumulative probability reaches p

5. Autoregressive generation can't look ahead—early errors propagate

6. True determinism doesn't exist due to floating-point and infrastructure variations

Verify Your Understanding

Before proceeding, you should be able to:

Explain what happens when you set temperature to 0 vs 1 vs 2 — What changes mathematically? What changes practically?

Given logits [3.0, 2.0, 1.0, 0.5] and top_k=2, what tokens can be selected? Calculate the renormalized probabilities.

Why does autoregressive generation sometimes produce repetitive text? What’s the mechanism, and how do sampling strategies help?

Your application needs consistent outputs for the same input. What can you do? What can’t you guarantee?

Explain the difference between top-k and top-p. When might top-p be better?


What’s Next

After this, you can:

  • Continue → Generation → Retrieval — grounding generation in external facts
  • Apply → Build LLM applications with appropriate generation settings