Back to blog
inferencecompilersdeep-dive

Compiler-Level Optimizations for Inference: TorchInductor, Triton, XLA

General Compute·

If you have ever profiled a transformer forward pass, you have probably noticed that the model spends a surprising amount of time doing nothing useful. A small reshape here, a kernel launch there, a memory copy because two operators disagree on layout. The math is fine. The problem is everything around the math.

Compilers exist to fix this. TorchInductor, Triton, and XLA all sit between high level model code and the hardware, and all three try to remove the same kind of waste: unnecessary launches, unnecessary memory traffic, and unnecessary precision. They take different paths to get there. This post walks through what each one does, where they overlap, and what an inference engineer should actually expect when they enable them.

What "compiler" means in this context

There are two compilers in any deep learning stack. There is the one that ships with the GPU vendor (nvcc for CUDA, ROCm's compiler for AMD), which turns C++ kernel code into machine instructions. Then there is the higher level ML compiler, which turns a graph of operators into a sequence of those kernels. When people say "compile your model," they almost always mean the second one.

The ML compiler has three jobs:

  1. Trace the graph. Capture the operations the model is doing, including their shapes and dtypes, into a representation it can manipulate.
  2. Rewrite the graph. Fuse operators together, eliminate dead code, pick layouts, and choose algorithms that match the hardware.
  3. Generate kernels. Emit code that the GPU vendor's compiler can compile down to actual instructions.

Where the three frameworks differ is in how aggressively they do each step, and how much escape hatch they give you when their defaults are wrong.

TorchInductor: the default that ships with torch.compile

TorchInductor is the backend behind torch.compile in PyTorch 2.x. When you write model = torch.compile(model), Inductor is what runs.

Its design choice is to lean on Triton for kernel generation rather than reinventing CUDA codegen. Inductor takes a TorchDynamo-traced FX graph, lowers it through its own intermediate representation, applies a long list of fusion and simplification passes, and then emits Triton kernels for GPU and C++ kernels for CPU. The Triton kernels handle the parts that benefit from autotuning. The C++ side handles glue code and reductions that are easier to express in scalar form.

For inference, the wins come from a few places:

  • Pointwise fusion. Activations like SiLU after a matmul, residual adds, layernorm scaling: all collapse into a single kernel. A typical decoder block might go from 15 launches to 4 or 5.
  • Reduction fusion. Softmax, layernorm, and RMSNorm fuse with whatever pointwise operations sit on either side, which means the intermediate tensors never leave registers.
  • Buffer reuse. The IR tracks which tensors are still needed, and reuses memory aggressively. For long context inference where activations are huge, this matters.
  • Autotuning. For matmul-shaped operations on supported configs, Inductor will benchmark a handful of Triton configurations at compile time and pick the best one for your shape.

The catch is the compile time itself. The first call into a compiled model can take 30 seconds to several minutes, especially with autotuning enabled. For batch inference servers with stable shapes this is fine, since it amortizes. For agent loops with variable input lengths it can be painful, because shape changes trigger recompilation. The cure is dynamic=True, which tells Inductor to specialize on a few size buckets rather than every concrete shape, but you give up some peak throughput in exchange.

The other catch is that Inductor still relies on the operator library underneath it. If you are calling FlashAttention through torch.nn.functional.scaled_dot_product_attention, Inductor does not generate the attention kernel itself. It dispatches to the FlashAttention implementation that PyTorch ships, and your speedup from compilation comes from everything around the attention call, not the call itself.

Triton: the kernel language that everyone is building on

Triton is the layer below Inductor, but it is also a language people use directly. It was designed to be the middle ground between writing CUDA by hand and waiting for a compiler to generate something good. You write Python that looks like NumPy but operates on blocks of values. Triton's compiler turns those blocks into the warp-level scheduling, shared memory layouts, and load patterns that a CUDA programmer would otherwise tune by hand.

A simple Triton kernel looks like this:

import triton import triton.language as tl @triton.jit def add_kernel(x_ptr, y_ptr, out_ptr, n, BLOCK: tl.constexpr): pid = tl.program_id(axis=0) offsets = pid * BLOCK + tl.arange(0, BLOCK) mask = offsets < n x = tl.load(x_ptr + offsets, mask=mask) y = tl.load(y_ptr + offsets, mask=mask) tl.store(out_ptr + offsets, x + y, mask=mask)

That looks trivial, and for elementwise ops it is. The reason Triton matters is that the same style scales up to attention kernels, fused MoE routing, quantized matmuls, and custom paged-KV operations. FlashAttention 2, vLLM's paged attention kernel, and a large fraction of the custom kernels in modern inference servers are written in Triton, not CUDA.

For inference work, Triton hits a particular sweet spot. You can prototype a fused kernel in a day, autotune it across a handful of block sizes, and ship something that gets within 10 to 20% of a hand-tuned CUDA kernel. That gap matters at frontier scale, but for most teams the speed of iteration is more valuable than the last 15% of throughput.

Triton has limits. It does not support the full range of warp-specialized features that very recent NVIDIA hardware exposes (TMA, async copies in their newest forms, certain Hopper features), and that gap reopens whenever a new GPU generation lands. The Triton team usually catches up within a few months, but if you need day-zero performance on a brand new accelerator, you are still writing CUDA.

XLA: the graph-first approach

XLA started inside TensorFlow and is now the compiler underneath JAX, the TPU stack, and several other projects. It takes a different philosophy from Inductor. Instead of fusing operators opportunistically based on local patterns, XLA wants the entire computation handed to it as a static graph, and then it does whole-program optimization.

The XLA pipeline goes:

  1. HLO (High Level Operations). A small set of well-defined operations like dot, reduce, gather, dynamic-slice. The frontend lowers your model into HLO.
  2. Optimization passes. Algebraic simplification, layout assignment, loop fusion, memory scheduling, sharding propagation.
  3. Code generation. For TPU, an optimized backend that knows the chip's matrix unit, scratchpad layout, and async DMA patterns. For GPU, a backend that emits LLVM IR which is then compiled to PTX.

What you get from XLA on the right workload is striking. On TPUs, where there is no real alternative, XLA is the only path to performance. On GPU, JAX-with-XLA can match or beat eager PyTorch by a wide margin for workloads where the graph is fully traceable. We have seen 2x to 4x improvements on dense decoder models when the input shapes are static and batched.

The trade-off is rigidity. XLA assumes the graph is known up front. Variable shapes turn into recompilations. Control flow has to be expressed through lax.cond or lax.scan, not Python if statements. Dynamic KV caches, which are everywhere in inference, force you into either padding to a maximum length or using dynamic-update-slice carefully to avoid blowing up the compiled program size. JAX's jit machinery handles a lot of this for you, but the rough edges show up the first time you try to serve a model with variable batch and variable sequence length.

For pure inference servers that handle one model with predictable batch sizes, XLA on GPU is competitive. For interactive workloads where requests show up with arbitrary lengths and the server has to handle them efficiently, the dynamic compilation cost usually pushes teams toward Inductor or a custom Triton based stack.

Where these compilers actually overlap

In practice, modern inference stacks are not built around one compiler. They use whichever one is best for each layer:

  • Operator-level kernels. Triton, almost always. FlashAttention, paged attention, fused MoE, quantized matmuls.
  • Graph-level fusion and scheduling. Inductor for PyTorch deployments, XLA for JAX or TPU deployments.
  • Vendor primitives. cuBLAS, cuDNN, and CUTLASS still handle the heavy matmuls when their kernels beat what the compiler generates. Inductor and XLA both know how to call into them.

The interesting part is that the stack between your model code and the metal is no longer a single tool. It is a chain. PyTorch traces the graph with Dynamo, Inductor lowers it and decides which parts to fuse, Triton generates the fused parts, and the GPU vendor compiles Triton's output. Each link in the chain has its own performance characteristics, and a regression in any of them shows up at the top.

What this means for an inference engineer

A few practical takeaways from running these compilers in production:

  • Always measure. torch.compile does not always make things faster. For very small models or models where attention dominates and is already calling FlashAttention, the speedup can be small or negative. Profile before and after.
  • Watch for recompilations. Both Inductor and XLA recompile on shape changes by default. A serving loop that sees ten different sequence lengths can spend more time compiling than running. Use shape buckets or dynamic shapes deliberately.
  • Triton is a power tool. When the compiler does not fuse the way you want, writing the kernel yourself in Triton is no longer exotic. The barrier to entry is much lower than CUDA, and the resulting kernels integrate cleanly with Inductor and PyTorch.
  • XLA shines for static workloads. If you are running batch inference at fixed shapes, JAX plus XLA is genuinely fast and worth evaluating. For online serving, the dynamic shape story is harder.
  • Compiler choice is not free. Each stack has its own debugging story, its own version churn, and its own failure modes. Picking one means investing in tooling and people who can read its IR when things go wrong.

The headline result is that compilers have closed a real gap between naive Python model code and hand-tuned CUDA. They are not magic. Stacks that lean on them still spend a lot of engineering time profiling, writing custom kernels, and tracking compiler updates. The difference is that the floor has moved up. A team that runs torch.compile on a well-structured model today gets performance that took a kernel specialist a quarter of work to achieve three years ago.

If you are building inference infrastructure and want to compare your stack against something that has spent a lot of time on this exact problem, General Compute's API runs every step of the chain we just described, on hardware tuned for it. Try it in a few lines of code and see how the numbers compare.

ModeHumanAgent