Mixture of Experts: Router Picks Two | Explainers

Read this as Which experts actually run for this token?

Failure Trap: Reading MoE as every expert voting on every token instead of sparse routing.
Decision Rule: Let the router score experts, activate only top-k, then combine their weighted outputs.

1 / ?

A token reaches the MoE layer

Sparse MoE starts with one token representation that needs a feed-forward update.

The token is routed independently.
Different tokens can choose different experts.
The routing happens inside the layer.

The router scores experts

A small gating network gives each expert a score for this token.

Scores are token-specific.
The router is learned with the model.
The scores become routing weights.

Top-k experts are selected

The layer keeps only the highest-scoring experts, commonly top one or top two, and leaves the rest dormant.

Sparse activation saves compute.
Inactive experts do not run for this token.
Load balancing keeps routing usable.

Selected experts activate

Only the chosen expert MLPs process the token representation.

Each expert has its own parameters.
The selected experts run independently.
The token still needs one output vector.

Experts run in parallel

The selected expert computations can be batched and parallelized, but routing creates systems work.

Throughput depends on routing balance.
Communication can dominate at scale.
The model capacity is larger than active compute.

Weighted outputs combine

The final token update is the weighted combination of selected expert outputs, using router weights.

The router weights determine the mix.
The output shape matches a normal MLP.
Downstream layers see one vector.