Mixture of Experts: Router Picks Two

See sparse expert routing: a token is scored, top experts activate, and their weighted outputs are combined.

Read this as Which experts actually run for this token?
Failure Trap
Reading MoE as every expert voting on every token instead of sparse routing.
Decision Rule
Let the router score experts, activate only top-k, then combine their weighted outputs.
Mixture of Experts: Router Picks Two See sparse expert routing: a token is scored, top experts activate, and their weighted outputs are combined. Token in Token in one token needs update route next hidden Router Router expert scores learned gate per token gating Pick top-k Pick top-k choose two mask rest sparse compute top 2 Experts run Experts run expert A expert C others idle active Parallel Parallel two outputs same token more capacity sparse Combine Combine weight A weight C one vector output
1 / ?

A token reaches the MoE layer

Sparse MoE starts with one token representation that needs a feed-forward update.

  • The token is routed independently.
  • Different tokens can choose different experts.
  • The routing happens inside the layer.

The router scores experts

A small gating network gives each expert a score for this token.

  • Scores are token-specific.
  • The router is learned with the model.
  • The scores become routing weights.

Top-k experts are selected

The layer keeps only the highest-scoring experts, commonly top one or top two, and leaves the rest dormant.

  • Sparse activation saves compute.
  • Inactive experts do not run for this token.
  • Load balancing keeps routing usable.

Selected experts activate

Only the chosen expert MLPs process the token representation.

  • Each expert has its own parameters.
  • The selected experts run independently.
  • The token still needs one output vector.

Experts run in parallel

The selected expert computations can be batched and parallelized, but routing creates systems work.

  • Throughput depends on routing balance.
  • Communication can dominate at scale.
  • The model capacity is larger than active compute.

Weighted outputs combine

The final token update is the weighted combination of selected expert outputs, using router weights.

  • The router weights determine the mix.
  • The output shape matches a normal MLP.
  • Downstream layers see one vector.