Agent Readout
Mixture of Experts at Inference Time
How MoE routing actually works during serving, why sparse activation makes large models cheaper to run per token, and what changes for the inference stack.
- Author
- General Compute
- Published
- 2026-05-03
- Tags
- mixture of experts, moe, inference, routing, sparse models, serving
Markdown body
A 671B-parameter model that runs at the speed of a 37B-parameter model. That is roughly the pitch of DeepSeek V3, and Mixtral 8x22B, and Llama 4 Maverick, and most of the other large models that have shown up in the last year. They are all Mixture of Experts (MoE) architectures, and the trick they share is that only a small fraction of the parameters fire on any given token. The rest sit in memory unused for that step. This makes MoE attractive for inference: you get the quality of a much larger model without paying the per-token compute cost. The trade-offs show up in different places, mostly in memory bandwidth, in routing overhead, and in how you shard the model across GPUs. This post walks through what MoE actually does at inference time, how the routing decision works, and what the shape of an MoE serving deployment looks like compared to a dense model of similar quality. ## The basic shape of an MoE layer A standard transformer block has self-attention followed by a feed-forward network (FFN, usually called the MLP). In a dense model, every token goes through the same FFN, which is a pair of large linear projections with an activation in between. An MoE block replaces that single FFN with a set of N FFNs, called experts, plus a small router network that decides which experts each token uses. For each token, the router picks the top-k experts (commonly k=1 or k=2), runs the token through only those experts, and combines the outputs. The other N-k experts are not touched for that token. So if you have 8 experts and pick the top 2, you activate 2/8 = 25% of the FFN parameters per token. If you have 256 experts and pick the top 8 (DeepSeek V3's setup), you activate roughly 3% of the FFN parameters per token. The attention layers remain dense, so the savings only apply to the FFN portion of the model, but in modern LLMs the FFN is the bulk of the parameter count. This is why DeepSeek V3 has 671B total parameters but only 37B activated per token. The 37B is what you actually compute on; the 671B is what has to be in memory. ## How the router actually decides The router is a small network, usually a single linear layer that maps the token's hidden state to a logit per expert. Take the top-k logits, apply softmax to them, and you have a set of routing weights for that token's chosen experts. Mathematically, for a token with hidden state x: ```python gate_logits = router(x) # shape: [num_experts] top_k_indices, top_k_logits = top_k(gate_logits, k=2) top_k_weights = softmax(top_k_logits) ``` Then the output is a weighted sum of the chosen experts' outputs: ```python output = sum(top_k_weights[i] * experts[top_k_indices[i]](x) for i in range(k)) ``` Each token in a batch can route to a different combination of experts. Token 0 might go to experts 3 and 7, token 1 to experts 1 and 4, token 2 back to expert 3 paired with expert 0. There is no shared routing across the batch. This per-token routing is what makes MoE serving more complicated than dense serving. The work is no longer a uniform matmul over the batch. It is a scatter-gather: send each token to its chosen experts, run the experts, gather the outputs back in the original order. ## Why MoE is faster per token For the FFN computation, an MoE model with k=2 active experts out of 8 does roughly 1/4 of the FLOPs of a dense model with the same total FFN parameter count. The compute savings are linear in the activation ratio. This matters a lot at decode time. Decode is mostly a memory-bandwidth-bound operation for large models, but the FFN matmuls still take real wall-clock time. Cutting them by 4x or 30x (depending on the activation ratio) is a meaningful speedup. For prefill, the savings are similar but the picture changes. Prefill processes many tokens at once, so the matmuls are larger and more compute-bound. The router has to dispatch each prefill token to its experts, which gives you very irregular work per expert. Some experts get many tokens, some get few. This load imbalance is where most of the implementation difficulty in MoE serving lives. ## Memory does not get smaller Here is the catch that surprises people. The compute is sparse. The memory footprint is not. To serve an MoE model, all the experts have to be in GPU memory, ready to be called on. You do not know in advance which experts a token will route to, so you cannot leave any of them on disk or in CPU memory without paying a load latency penalty. A 671B-parameter MoE model in FP8 takes about 671 GB of GPU memory for weights, the same as a 671B dense model would. This means MoE models are large in memory but small in per-token compute. The arithmetic intensity (FLOPs per byte read) goes up, because you are reading more weights per token than you actually compute on. For decode, where you are already memory-bandwidth-bound, this can hurt: you might be reading the routed expert's weights at full bandwidth and not getting any speedup from the sparsity, because the bottleneck moved. In practice, MoE models still serve faster than dense models of comparable quality, because the dense equivalent would need many more parameters to match performance. A 37B-active MoE often matches or beats a 70B dense model. So you compare 671B memory at 37B compute against 70B memory at 70B compute, and the MoE wins on per-token speed even though it loses on total memory. ## Expert parallelism When the model is too large for one node, you have to shard the experts across GPUs. The natural way to do this is expert parallelism (EP): each GPU holds a subset of the experts. With 64 experts across 8 GPUs, each GPU holds 8 experts. Now routing becomes a network operation. Each token has to be sent to whichever GPU holds its chosen expert, run through the expert there, and the result has to come back. This is an all-to-all communication: every GPU has tokens going to every other GPU's experts. The all-to-all is the dominant cost in expert-parallel MoE serving. On NVLink within a node, it is fast. Across nodes over InfiniBand, it is much slower, and tuning the all-to-all becomes one of the main things separating a fast MoE serving stack from a slow one. Libraries like DeepEP and the all-to-all kernels in Megablocks exist specifically to make this efficient. EP combines with tensor parallelism (TP) and pipeline parallelism (PP) for very large models. A typical shape for a 671B MoE on 16 H100s might be EP=8, TP=2, PP=1: each pair of GPUs runs TP across the dense parts (attention, router), and each group of 8 holds the experts split across them. ## Load balancing and the imbalance problem The router is trained to spread tokens roughly evenly across experts, but at inference time the distribution is whatever the router picks for the current input. If most tokens in a batch route to expert 3, then GPU 0 (which holds expert 3) is doing most of the work and the other GPUs are idle. The all-to-all bandwidth is also imbalanced, because all the tokens are flowing toward one GPU. Two common mitigations: - **Capacity factor**: cap the number of tokens per expert at some multiple of the average. If too many tokens want expert 3, the lowest-priority ones get dropped to their second-choice expert. This caps the worst-case latency at the cost of some quality. - **Drop and reroute**: similar idea, but the dropped tokens skip the expert layer entirely (a no-op replaces their FFN computation). Easier to implement, slightly worse for quality. For inference, neither is great. Both add complexity and slightly degrade output quality. The current best practice is to make the all-to-all kernel fast enough that imbalance does not matter much, and to use a router with explicit balance-aware logic at training time. ## Shared experts and fine-grained MoE DeepSeek's architecture introduced a wrinkle that has been adopted by several follow-up models: shared experts that are always active, plus routed experts that are picked per token. So instead of the FFN being entirely a routing decision, a portion of it is always computed (the shared expert handles common patterns) and the routed experts add specialization on top. This stabilizes training and makes the routing decisions less load-bearing. From an inference perspective, the shared expert is a normal dense FFN computation, and the routed experts add the MoE machinery on top. Total compute per token goes up slightly compared to pure top-k routing, but the quality-per-FLOP improves. DeepSeek V3 also uses fine-grained MoE: instead of 8 large experts, it has 256 small experts and routes to 8 of them. Each individual expert is smaller, so the routing decision is more granular, and the activation ratio drops. The network is doing more bookkeeping per token but each piece of bookkeeping touches less compute. Fine-grained routing puts more pressure on the all-to-all. With 256 experts, each token's two chosen experts are spread across more GPUs in a typical EP layout, so the communication pattern is denser. The DeepSeek paper spent significant effort on the kernel implementations to make this work at production speeds. ## What changes for the inference stack Compared to serving a dense model, an MoE serving stack has to handle: - A routing decision per token, per layer. The router is small, but it runs on every token and has to be efficient. - Token dispatch and gather kernels. The fused permutation kernels (like the ones in vLLM, SGLang, and TensorRT-LLM) are critical, because the naive scatter-gather is slow. - All-to-all communication when expert parallelism is used. This needs to overlap with compute as much as possible. - Variable per-expert workloads. Some experts get more tokens than others within a batch, and the kernel has to handle that without serializing. - Memory layout choices for the expert weights. Some implementations store them as one big tensor with strided access, others as separate per-expert tensors, with different cache and bandwidth implications. For most users, this is hidden behind the serving framework. You ask vLLM or TensorRT-LLM to serve Mixtral or DeepSeek V3, and the framework handles the routing. But if you are debugging slow MoE serving, the usual suspects are the all-to-all (when EP is used), the dispatch kernel (when batch sizes are awkward), and load imbalance (when the router routes badly for your workload). ## Where MoE fits For serving, MoE is increasingly the default for very large models. The compute savings make it possible to run 100B-active-parameter-equivalent quality at 30B-active speeds. The memory cost is real but mostly affordable on the multi-GPU nodes that are needed for these models anyway. For smaller models (under 30B total), dense is usually still the right answer. The routing overhead and the implementation complexity are not worth it when the dense model already fits in one or two GPUs and runs quickly. The interesting middle ground is models in the 50B to 200B parameter range, where MoE versus dense is a genuine architecture choice. Here the trade-off depends on your workload: latency-sensitive serving with small batches favors dense (no routing overhead, no all-to-all); throughput-oriented serving with large batches and many concurrent requests favors MoE (the all-to-all amortizes well over batch size, and the per-token compute savings stack). ## Closing The fundamentals are straightforward. MoE replaces a single FFN with a routed set of FFNs, only a few of which fire per token. You save compute, you spend memory, and you take on some new infrastructure complexity around routing and all-to-all communication. For models large enough that compute would otherwise be the binding constraint, the trade is worth it. For smaller models, dense is simpler and just as fast. If you want to serve MoE models without setting up the EP topology, the dispatch kernels, and the all-to-all tuning yourself, General Compute runs models like DeepSeek V3 and Llama 4 Maverick on inference hardware where the routing infrastructure is already in place. Same OpenAI-compatible API as any other model. Try it at [generalcompute.com](https://generalcompute.com).