How do Mixture-of-Experts (MoE) models work?
Mixture-of-Experts (MoE) models are a way to make very large AI models more powerful without requiring all of their parameters to be used for every input.
The Core Idea
Instead of having one giant neural network process every token, an MoE model contains:
- A router (or gating network)
- Multiple expert networks
- A mechanism to combine the results
- When a token enters the model:
- The router examines the token.
- It decides which expert(s) are most suitable.
- Only those selected experts perform computation.
- Their outputs are combined and passed to the next layer.
Think of it like a company:
- A receptionist (router) receives a request.
- The request is forwarded to the most qualified specialists (experts).
- Only those specialists work on it.
- Their results are returned.
Dense Models vs MoE Models
Dense Model
Every parameter participates in every forward pass.
Input
↓
All neurons compute
↓
Output
If a model has 100 billion parameters, all 100 billion are involved in every token's computation.
MoE Model
Only a subset of parameters is activated.
Input
↓
Router
↓
Experts 3 and 7 selected
↓
Only those experts compute
↓
Output
A model may contain 100 billion total parameters, but perhaps only 10–20 billion are active for a given token.
This is often called sparse activation.
A Simple Example
Suppose we have 8 experts:
- Expert 1: mathematics
- Expert 2: programming
- Expert 3: biology
- Expert 4: law
- Expert 5: creative writing
- Expert 6: translation
- Expert 7: reasoning
- Expert 8: general language
When the model sees:
"Write Python code to sort a list"
the router may assign:
- 70% weight to programming expert
- 30% weight to reasoning expert
The final output is a weighted combination of those experts' outputs.
In practice, experts do not have explicit human-readable specialties; they learn them automatically during training.
Top-k Routing
Most modern MoE systems use Top-k routing.
The router scores all experts:
| Expert | Score |
|---|---|
| E1 | 0.05 |
| E2 | 0.60 |
| E3 | 0.02 |
| E4 | 0.01 |
| E5 | 0.10 |
| E6 | 0.03 |
| E7 | 0.15 |
| E8 | 0.04 |
If k = 2, the router selects:
- E2
- E7
Only these experts run.
This greatly reduces computation.
Why Use MoE?
1. More Parameters
A dense model might be limited to:
70B parameters
An MoE model might have:
- 500B total parameters
- only 50B active at once
- This allows much larger capacity.
- 2. Lower Compute Cost
Since only a few experts run, inference can be cheaper than activating all parameters.
3. Specialization
Different experts naturally become better at different patterns:
- code
- reasoning
- languages
- mathematics
- scientific text
The router learns when to use each one.
Typical MoE Layer
Modern transformer MoE models usually keep:
- Attention layers dense
- Feed-forward layers replaced with expert pools
A standard transformer block:
Attention
↓
Feed Forward Network
becomes:
Attention
↓
Router
↓
Selected Experts
↓
Combined Output
This means only part of the transformer becomes sparse.
Training Challenges
Expert Collapse
The router may start sending most tokens to only a few experts.
Example:
Expert 1: 70%
Expert 2: 20%
Expert 3–8: almost unused
Then many experts never learn.
To prevent this, training includes load balancing losses that encourage experts to receive roughly equal traffic.
Capacity Limits
Each expert can process only a limited number of tokens per batch.
If too many tokens choose the same expert:
- some tokens may be dropped
- or rerouted
- This must be managed carefully.
Communication Overhead
In distributed training:
- Expert 1 may live on GPU A
- Expert 2 on GPU B
- Expert 3 on GPU C
Tokens must be transferred between GPUs according to router decisions.
This "all-to-all communication" can become a major bottleneck.
Real-World MoE Models
Examples include:
- Switch Transformer
- GShard
- Mixtral 8x7B
- DeepSeek-V3
- DeepSeek-R1
For example, "8×7B" in Mixtral does not mean 56B parameters are active. It means there are eight expert networks of roughly 7B parameters each, but only a subset (often two experts) is used per token.
An Intuitive Analogy
Imagine a hospital with:
100 doctors total
only 2–3 doctors assigned to each patient
The hospital has enormous expertise overall, but any individual case only uses the specialists relevant to it.
A dense model is like asking all 100 doctors to examine every patient.
An MoE model is like using a triage system to select the right specialists.
The Key Insight
The innovation behind MoE is:
Increase model capacity dramatically while keeping the amount of computation per token relatively small.
Mathematically, you can think of it as:
Total parameters → very large
Active parameters per token → much smaller
This allows modern AI systems to scale to hundreds of billions or even trillions of parameters without requiring proportional computation for every token.
Passionate content creator with a keen interest in Artificial Intelligence, emerging technologies, trending news, and current affairs. I enjoy exploring the latest innovations, breaking down complex tech topics into engaging content, and sharing insightful perspectives on global trends. My goal is to create informative, easy-to-read, and impactful content that keeps readers updated with the fast-changing digital world.