Read this as How does one block update the residual stream?
- Failure Trap
- Drawing attention as the whole transformer and hiding the MLP and residual updates.
- Decision Rule
- Read each block as two updates to the same stream: attention update, then MLP update.
The residual stream carries the state
A block receives one vector per token from the previous layer. The stream is the path updates get added back into.
- The stream keeps a stable representation path.
- Sublayers compute updates to that path.
- Residuals help information and gradients flow.
Attention mixes token context
Self-attention lets each position read from other positions and produce a context update.
- Queries compare against keys.
- Weights mix values from other tokens.
- The update depends on the whole context.
Add and normalize stabilizes the update
The attention output is added back to the residual stream, then normalized for the next sublayer.
- The add step preserves the stream path.
- Normalization controls scale.
- Architectures vary pre-norm versus post-norm.
The feed-forward network transforms each position
The MLP, or FFN, applies learned nonlinear transformations independently at each token position.
- It does not mix positions directly.
- It expands and compresses features.
- It gives the block local computation power.
Add and normalize closes the block
The MLP update is added back and normalized, producing the block output for the next layer.
- Both sublayers update the same stream.
- The output shape matches the input shape.
- One vector per token continues onward.
Stacks repeat the pattern many times
A transformer is many blocks in sequence. Each block refines the representation before the final vocabulary projection.
- Early layers often capture local patterns.
- Later layers combine broader abstractions.
- The repeated shape is what scales.