Back to blog
tensor parallelismpipeline parallelisminferencedistributedgpuserving

Tensor Parallelism vs Pipeline Parallelism for Model Serving

General Compute·

Once a model stops fitting on a single GPU, you have to split it across several. There are a handful of ways to do that, but for inference, the two that matter are tensor parallelism and pipeline parallelism. They look superficially similar (both shard a large model across multiple devices) but they have very different performance profiles, and picking the wrong one for your workload can cost you a factor of two or more on either latency or throughput.

This post walks through what each one does mechanically, how communication patterns change the bandwidth requirements, and how to decide between them (or, more often, how to combine them) for a real serving deployment.

Why a single GPU sometimes is not enough

A 70B-parameter model in FP16 is 140 GB of weights. An H100 has 80 GB of HBM. The arithmetic does not work, even before you account for the KV cache, activations, and the workspace memory the kernels need. You either quantize aggressively, or you split the model.

Splitting also helps when you have memory headroom but not enough compute. A 13B model fits on one GPU, but if you want to serve it with a 1,000 ms time-to-first-token budget on 32K-token prompts, a single GPU might not have the FLOPs. Spreading the work across several GPUs can pull latency down even when memory is not the binding constraint.

Tensor parallelism and pipeline parallelism are the two main answers. They are not exclusive. Most large-model deployments use both at once.

What tensor parallelism actually does

Tensor parallelism (TP) splits the work inside each layer across GPUs. Take a linear projection that maps a hidden vector of size H to an output of size O. The weight matrix is H by O. With TP across N GPUs, you cut the matrix along the output dimension, so each GPU holds H by O/N weights and computes its own slice of the output. After the projection, you concatenate or all-reduce to recover the full result.

Megatron-LM popularized a particular pattern for transformer blocks. The QKV projection is sharded along the head dimension, so each GPU owns a subset of attention heads. The attention computation runs locally on those heads. The output projection is sharded along the input dimension, which means each GPU produces a partial sum, and an all-reduce at the end collapses those partials into the final output. The MLP follows the same column-then-row pattern: the first linear is sharded column-wise (no communication needed before the activation), the second is sharded row-wise (one all-reduce after).

Two all-reduces per transformer block. That is the cost of tensor parallelism, and it is paid on every forward pass, every layer, every token.

Because the all-reduces happen inside the block, they are on the critical path for that token's compute. You cannot hide them behind other work the way pipeline parallelism can. The interconnect between GPUs has to be fast enough that the all-reduce does not stall the matmuls.

This is why TP almost always lives within a single node. NVLink between H100s in the same DGX gives you about 900 GB/s per direction. PCIe gives you 64 GB/s on a good day. If you try to do TP across PCIe or, worse, across an InfiniBand fabric between nodes, the all-reduce latency dominates and you lose more than you gain.

The practical limit on TP is the number of GPUs in one NVLink domain. On a standard 8-way H100 node, that is 8. On systems with NVLink switches and larger NVL domains, it can go higher. Beyond that, you usually run out of useful interconnect.

What pipeline parallelism actually does

Pipeline parallelism (PP) splits the model across layers, not within them. Suppose you have an 80-layer model and 4 GPUs. Pipeline parallelism puts layers 1 to 20 on GPU 0, 21 to 40 on GPU 1, 41 to 60 on GPU 2, and 61 to 80 on GPU 3. A request flows through the GPUs in sequence: GPU 0 processes its 20 layers, sends the activation to GPU 1, GPU 1 processes its 20 layers, and so on.

The communication between stages is small compared to TP. You only send the activation tensor for the boundary between stages, which is a single hidden state per token. On a 4K-token prompt with hidden size 8192 in FP16, that is 64 MB per stage boundary. That is a single point-to-point send, not a collective, and it can run over a slower interconnect without much penalty.

The catch with pipeline parallelism is the bubble. If you only have one request in flight, GPU 0 is busy for the first chunk of time, then idles while GPU 1 works, then GPU 2, then GPU 3. Three out of four GPUs are doing nothing at any given moment. That is terrible utilization.

The standard fix is microbatching. Split a batch of requests into microbatches and pipeline them. While GPU 1 processes microbatch 1, GPU 0 starts on microbatch 2. With enough microbatches in flight, all the GPUs stay busy most of the time. There is still a startup bubble at the front of the pipeline and a drain bubble at the end, but the steady state is high utilization.

For training, this is well understood. For inference, it is more subtle, because requests come in at different times and have different lengths.

Why pipeline parallelism is awkward for low-latency inference

In a training step, you decide on a batch and run it through the pipeline. There is no real-time constraint. The bubble matters for throughput, but every microbatch eventually completes.

In serving, two things complicate pipeline parallelism.

First, time-to-first-token includes the full pipeline depth. A request has to traverse every stage before the first token comes out. If each stage takes 50 ms on its share of prefill, a 4-stage pipeline gives you a 200 ms TTFT just from pipeline traversal. You do not get the speedup that tensor parallelism gives, where every GPU contributes to the same prefill in parallel.

Second, decode is sequential by nature. Each generated token depends on the previous one. So during decode, the pipeline runs one token at a time through the whole pipeline before the next token can start. A 4-stage pipeline is not 4 times faster at decode, it is roughly the same speed as a single GPU (assuming the same per-stage compute), because each token waits for the full pipeline traversal.

The fix during decode is to have many concurrent requests, so the pipeline stays full of work even though each individual request only sees one token at a time. Continuous batching helps a lot here. As soon as a request finishes its decode at one stage, the next request starts at that stage. The pipeline is full of decoded tokens at different positions.

The result: pipeline parallelism is good for throughput when you have many concurrent requests, and bad for latency when you do not. Tensor parallelism is the opposite: latency stays low even at small batch sizes, but it scales poorly past one node.

Memory and weights, not just compute

The split also affects how the model fits in memory. With TP across 8 GPUs, each GPU holds 1/8 of every weight matrix. The model is uniformly distributed. If you want to add another transformer layer, every GPU has to find space for its share.

With PP across 4 stages, each stage holds a quarter of the layers. The split is by layer, not by tensor. Adding more layers means putting them on whichever stage has room. This is sometimes useful for unbalanced models or for fitting in heterogeneous hardware, where you have, say, one 80 GB GPU and three 40 GB GPUs.

KV cache memory works differently in the two regimes. With TP, the KV cache is also sharded across the heads, so each GPU stores 1/N of the per-token KV. With PP, each GPU stores the full KV for its layers. For long-context workloads, this matters. A 128K-context request with 80 layers needs to keep KV for all 80 layers somewhere. PP spreads that across stages naturally; TP keeps each layer's KV on the same set of GPUs.

Combining them: 2D parallelism

Most serving deployments above one node use both. A common shape is TP=8 within a node, PP=N across nodes. The 8 GPUs in a node share NVLink and run tensor parallelism over the high-bandwidth fabric. The pipeline stages run across the slower inter-node InfiniBand links, where the small point-to-point activation transfers do not stall.

This 2D parallelism gives you the latency benefits of TP for the compute inside each pipeline stage, and the scalability benefits of PP for going beyond one node's worth of memory. The bubble cost of PP is manageable because you only have a few stages, and continuous batching keeps the pipeline full.

For a 405B-parameter model, you might run TP=8, PP=2 on two 8-GPU nodes, totaling 16 GPUs. For a 1T-parameter MoE model, TP=8, PP=4 across four nodes is a common shape. The exact numbers depend on context length, batch size, and what you are optimizing for, but the pattern is consistent: TP within nodes, PP across them.

Practical decision rules

A few rough heuristics that hold up in production:

  • If your model fits on one GPU and you want lower latency, you do not need either. Just serve it on one GPU. Multi-GPU inference always has overhead.
  • If the model fits in one node but not one GPU, use TP across the GPUs in the node. Skip pipeline parallelism, it adds latency without benefit.
  • If the model is too big for one node, use TP within nodes and PP across them. Set TP equal to the number of GPUs per node, and pick PP based on the total weights and KV memory you need.
  • If your workload is heavily latency-sensitive at low batch sizes (voice agents, coding agents with short prompts), favor more TP and less PP. The bubble cost dominates at low batch.
  • If your workload is throughput-oriented at high batch sizes (offline batch jobs, bulk RAG), more PP is fine, sometimes preferable, because the pipeline stays full and the per-request latency can be amortized.

The wrong answer is usually doing PP across PCIe within a node, or TP across InfiniBand between nodes. Both of those make the communication cost of the chosen scheme misalign with the available bandwidth, and you lose throughput, latency, or both.

Sequence parallelism and other variants

There are extensions to this two-axis picture. Sequence parallelism splits along the sequence dimension to reduce activation memory inside TP. Expert parallelism, used in MoE models, shards the experts across GPUs in a way that overlaps with TP and PP. Context parallelism (sometimes called Ulysses or Ring Attention) shards the attention computation across the sequence axis, which is critical for very long contexts.

For most workloads under 128K context with dense models, you do not need to think about these. Plain TP within a node and PP across nodes is enough. When you start serving 1M-token contexts or trillion-parameter MoE models, the picture gets more complicated, and the trade-offs shift.

Where this fits in a serving stack

vLLM, TGI, TensorRT-LLM, and SGLang all support TP out of the box. PP support is more uneven, and the interaction with continuous batching is where implementations differ the most. If you are choosing a stack for multi-node inference, the quality of the pipeline scheduler matters more than the quality of the kernels, because the pipeline scheduler is what determines whether the bubble eats your throughput.

For most users, this is invisible. You set TP=8 in a config, the runtime handles the rest, and you get the model served. But when something is slow, knowing which axis of parallelism is paying for what helps you debug it. A slow TTFT is often a TP problem. Low GPU utilization with high latency is often a PP scheduling problem. Memory pressure on one GPU but not others usually means an uneven pipeline split.

Closing

Tensor parallelism and pipeline parallelism solve overlapping problems with different trade-offs. TP gives you low latency and good utilization, but it needs fast interconnect and stops scaling past one node. PP scales as far as you have nodes and bandwidth, but it adds latency and only works well with enough concurrent requests to keep the pipeline full.

The combination, TP inside nodes and PP between them, is the default for serving anything bigger than a single node can hold. The remaining work is tuning the exact shape to your workload, which mostly comes down to whether you are optimizing for tail latency on small batches or throughput on large ones.

If you want to skip the parallelism tuning entirely, General Compute serves these models on custom inference hardware where the parallelism strategy is already chosen for you. Same OpenAI-compatible API, no config files to tune. Try it at generalcompute.com.

ModeHumanAgent