Read this as Which tokens are still eligible right before the draw?
- Failure Trap
- Treating temperature, top-k, and top-p as independent knobs instead of an ordered filter chain.
- Decision Rule
- Apply temperature first, filter the tail, renormalize, then sample or argmax deliberately.
Start with Logits
The model does not output probabilities. Its final vocabulary projection produces logits: raw scores, one for each possible next token.
- Higher logit means the token is more favored by the model
- The numbers can be any real value, not a 0-to-1 probability
- The sampling pipeline turns this score vector into one token
Divide by Temperature
Temperature rescales the logits before probability conversion. Lower temperatures sharpen the distribution; higher temperatures flatten it.
-
T < 1makes likely tokens dominate more strongly T > 1gives lower-ranked tokens more chance-
T = 0is a special greedy argmax branch, not sampling
Clip with Top-k
Top-k applies a fixed cap: keep only the highest-scoring k
tokens and remove the rest from consideration.
- It cuts off the long tail of weak candidates
- The cap is fixed even when the distribution shape changes
- Production examples often use it as a coarse safety filter
Clip with Top-p
Top-p, or nucleus sampling, sorts candidates by probability mass and keeps tokens until their cumulative probability crosses the chosen threshold.
- It keeps fewer tokens when one candidate dominates
- It keeps more tokens when the model is uncertain
- This adaptive behavior is why top-p is a strong default
Softmax Renormalizes
After clipping, softmax converts the surviving scores into probabilities and renormalizes them so the remaining options sum to 1.
- Dropped tokens get no probability mass
- Survivors split the full probability budget
- The result is a valid categorical distribution
Draw One Token
A multinomial draw samples one token from the final distribution. The highest-probability token is most likely, but it is not guaranteed unless the distribution has collapsed to argmax.
- The selected token is appended to the context
- The model runs again to produce the next logit vector
- This repeats one token at a time for the whole response