Inside One Transformer Block | Explainers

Read this as How does one block update the residual stream?

Failure Trap: Drawing attention as the whole transformer and hiding the MLP and residual updates.
Decision Rule: Read each block as two updates to the same stream: attention update, then MLP update.

1 / ?

The residual stream carries the state

A block receives one vector per token from the previous layer. The stream is the path updates get added back into.

The stream keeps a stable representation path.
Sublayers compute updates to that path.
Residuals help information and gradients flow.

Attention mixes token context

Self-attention lets each position read from other positions and produce a context update.

Queries compare against keys.
Weights mix values from other tokens.
The update depends on the whole context.

Add and normalize stabilizes the update

The attention output is added back to the residual stream, then normalized for the next sublayer.

The add step preserves the stream path.
Normalization controls scale.
Architectures vary pre-norm versus post-norm.

The feed-forward network transforms each position

The MLP, or FFN, applies learned nonlinear transformations independently at each token position.

It does not mix positions directly.
It expands and compresses features.
It gives the block local computation power.

Add and normalize closes the block

The MLP update is added back and normalized, producing the block output for the next layer.

Both sublayers update the same stream.
The output shape matches the input shape.
One vector per token continues onward.

Stacks repeat the pattern many times

A transformer is many blocks in sequence. Each block refines the representation before the final vocabulary projection.

Early layers often capture local patterns.
Later layers combine broader abstractions.
The repeated shape is what scales.