The Forward Pass: Tokens to Next Token

Follow one transformer inference pass: token IDs become vectors, positions are added, hidden states move through repeated blocks, and the final logits are handed to sampling.

Read this as What path produces the logit vector sampling consumes?
Failure Trap
Treating sampling as the model, when sampling only consumes the final forward-pass scores.
Decision Rule
Separate representation building in the transformer from the decoding policy after logits.
Transformer forward pass from token IDs to next-token sampling Six frames show the canonical inference flow: token IDs are embedded, positional information is added, the hidden state moves through many transformer blocks, one block expands into attention and MLP work, the final hidden state is projected by the unembedding or language-model head into logits, and those logits are handed to sampling to choose the next token. Token IDs become vectors 464 3797 Token IDs Embedding table E[id] Vectors IDs look up learned vectors. Add where each token sits Embed Embed + Position signal pos 0, 1 h0 h1 Position tells the model order. The stream crosses N blocks h Block 1 Block 2 ... Block N hN Each block rewrites hidden states. Inside one transformer block h Transformer block Attention MLP Residual add h' Attention mixes; MLP updates. Project hidden state to logits hN LM head unembed W^T Logits LM head scores every token. Sampling picks the next token Logits cat 1.2 sat 3.8 mat 5.1 Sampling pipeline temp / top-p Next "mat" Sampling turns logits into text.
1 / ?

Token IDs Become Vectors

A model does not receive words directly. The tokenizer has already turned text into integer token IDs, and the first model operation is an embedding lookup: each ID selects a learned vector from the embedding table.

  • Input at this point is a sequence of token IDs.
  • Embedding lookup maps each ID to a dense vector.
  • The result is one vector per token position.

Position Is Added

The same token can mean different things depending on where it appears. Before the transformer blocks, positional information is added to the token embeddings so the hidden stream carries both identity and order.

  • Token vectors carry "what token is this?" information.
  • Position signals carry "where is it?" information.
  • The combined vector becomes the input hidden state.

Hidden States Cross N Blocks

A transformer is a stack of repeated blocks. Each block receives the whole sequence of hidden states and returns a rewritten sequence with more context mixed into each token position.

  • The sequence shape stays aligned to the token positions.
  • Later blocks build on the representations from earlier blocks.
  • The last block outputs the final hidden state.

One Block Has Attention and an MLP

Inside a block, attention lets token positions exchange information, while the MLP updates the features at each position. Residual connections carry the stream through these transformations.

  • Attention mixes information across token positions.
  • The MLP transforms each position's feature vector.
  • Residual adds preserve a path for the running hidden stream.

The LM Head Produces Logits

After the final block, the model projects the last hidden state through the unembedding, often called the language-model head. This produces logits: raw scores over the vocabulary for the next token.

  • The LM head maps hidden features back to vocabulary-sized scores.
  • Each possible next token gets one logit.
  • Logits are scores, not probabilities yet.

Sampling Chooses the Next Token

The forward pass itself ends by handing logits to the decoding or sampling pipeline. Temperature, top-k, top-p, or greedy decoding then decide which token is appended before the next forward pass begins.

  • The transformer computes logits.
  • The sampler turns logits into a token choice.
  • Generation repeats this pass for each new token.