Read this as Which experts actually run for this token?
- Failure Trap
- Reading MoE as every expert voting on every token instead of sparse routing.
- Decision Rule
- Let the router score experts, activate only top-k, then combine their weighted outputs.
A token reaches the MoE layer
Sparse MoE starts with one token representation that needs a feed-forward update.
- The token is routed independently.
- Different tokens can choose different experts.
- The routing happens inside the layer.
The router scores experts
A small gating network gives each expert a score for this token.
- Scores are token-specific.
- The router is learned with the model.
- The scores become routing weights.
Top-k experts are selected
The layer keeps only the highest-scoring experts, commonly top one or top two, and leaves the rest dormant.
- Sparse activation saves compute.
- Inactive experts do not run for this token.
- Load balancing keeps routing usable.
Selected experts activate
Only the chosen expert MLPs process the token representation.
- Each expert has its own parameters.
- The selected experts run independently.
- The token still needs one output vector.
Experts run in parallel
The selected expert computations can be batched and parallelized, but routing creates systems work.
- Throughput depends on routing balance.
- Communication can dominate at scale.
- The model capacity is larger than active compute.
Weighted outputs combine
The final token update is the weighted combination of selected expert outputs, using router weights.
- The router weights determine the mix.
- The output shape matches a normal MLP.
- Downstream layers see one vector.