The Token Sampling Pipeline

Read this as Which tokens are still eligible right before the draw?

Failure Trap: Treating temperature, top-k, and top-p as independent knobs instead of an ordered filter chain.
Decision Rule: Apply temperature first, filter the tail, renormalize, then sample or argmax deliberately.

1 / ?

Start with Logits

The model does not output probabilities. Its final vocabulary projection produces logits: raw scores, one for each possible next token.

Higher logit means the token is more favored by the model
The numbers can be any real value, not a 0-to-1 probability
The sampling pipeline turns this score vector into one token

Divide by Temperature

Temperature rescales the logits before probability conversion. Lower temperatures sharpen the distribution; higher temperatures flatten it.

T < 1 makes likely tokens dominate more strongly
T > 1 gives lower-ranked tokens more chance
T = 0 is a special greedy argmax branch, not sampling

Clip with Top-k

Top-k applies a fixed cap: keep only the highest-scoring k tokens and remove the rest from consideration.

It cuts off the long tail of weak candidates
The cap is fixed even when the distribution shape changes
Production examples often use it as a coarse safety filter

Clip with Top-p

Top-p, or nucleus sampling, sorts candidates by probability mass and keeps tokens until their cumulative probability crosses the chosen threshold.

It keeps fewer tokens when one candidate dominates
It keeps more tokens when the model is uncertain
This adaptive behavior is why top-p is a strong default

Softmax Renormalizes

After clipping, softmax converts the surviving scores into probabilities and renormalizes them so the remaining options sum to 1.

Dropped tokens get no probability mass
Survivors split the full probability budget
The result is a valid categorical distribution

Draw One Token

A multinomial draw samples one token from the final distribution. The highest-probability token is most likely, but it is not guaranteed unless the distribution has collapsed to argmax.

The selected token is appended to the context
The model runs again to produce the next logit vector
This repeats one token at a time for the whole response