The Forward Pass: Tokens to Next Token

Read this as What path produces the logit vector sampling consumes?

Failure Trap: Treating sampling as the model, when sampling only consumes the final forward-pass scores.
Decision Rule: Separate representation building in the transformer from the decoding policy after logits.

1 / ?

Token IDs Become Vectors

A model does not receive words directly. The tokenizer has already turned text into integer token IDs, and the first model operation is an embedding lookup: each ID selects a learned vector from the embedding table.

Input at this point is a sequence of token IDs.
Embedding lookup maps each ID to a dense vector.
The result is one vector per token position.

Position Is Added

The same token can mean different things depending on where it appears. Before the transformer blocks, positional information is added to the token embeddings so the hidden stream carries both identity and order.

Token vectors carry "what token is this?" information.
Position signals carry "where is it?" information.
The combined vector becomes the input hidden state.

Hidden States Cross N Blocks

A transformer is a stack of repeated blocks. Each block receives the whole sequence of hidden states and returns a rewritten sequence with more context mixed into each token position.

The sequence shape stays aligned to the token positions.
Later blocks build on the representations from earlier blocks.
The last block outputs the final hidden state.

One Block Has Attention and an MLP

Inside a block, attention lets token positions exchange information, while the MLP updates the features at each position. Residual connections carry the stream through these transformations.

Attention mixes information across token positions.
The MLP transforms each position's feature vector.
Residual adds preserve a path for the running hidden stream.

The LM Head Produces Logits

After the final block, the model projects the last hidden state through the unembedding, often called the language-model head. This produces logits: raw scores over the vocabulary for the next token.

The LM head maps hidden features back to vocabulary-sized scores.
Each possible next token gets one logit.
Logits are scores, not probabilities yet.

Sampling Chooses the Next Token

The forward pass itself ends by handing logits to the decoding or sampling pipeline. Temperature, top-k, top-p, or greedy decoding then decide which token is appended before the next forward pass begins.

The transformer computes logits.
The sampler turns logits into a token choice.
Generation repeats this pass for each new token.