Infrastructure Deep-Dives

Speculative decoding, KV cache, tensor parallelism, batching strategies, and the systems that serve LLMs at scale.

Quantization Explained: INT4, GGUF, GPTQ and What They Mean for Your Model

A practical guide to LLM quantization: what INT4, GGUF, and GPTQ actually do, how much quality you lose, and how to quantize a model yourself with llama.cpp and AutoGPTQ.

General Compute·June 1, 2026

inferencecompilersdeep-dive

Compiler-Level Optimizations for Inference: TorchInductor, Triton, XLA

How modern ML compilers turn Python model code into fused, fast kernels. A practical look at TorchInductor, Triton, and XLA, and the tradeoffs each one makes for inference.

General Compute·May 6, 2026

inferencespeculative-decodingdeep-dive

Draft Model Selection for Speculative Decoding

Picking a draft model is the most consequential decision when deploying speculative decoding. A practical guide to acceptance rates, sizing, and the tradeoffs that decide whether you actually get a speedup.

General Compute·May 5, 2026

tensor parallelismpipeline parallelisminferencedistributedgpuserving

Tensor Parallelism vs Pipeline Parallelism for Model Serving

How tensor and pipeline parallelism actually differ in production inference, when to use each, and why most serving stacks end up combining them.

General Compute·May 2, 2026

prefix cachingkv cacheinferencevllmsglangproduction

Prefix Caching: Why Repeated Prompts Shouldn't Cost You Twice

How prefix caching works in modern LLM serving stacks, why it changes the economics of long system prompts and RAG, and what to watch out for in production.

General Compute·May 1, 2026

inferencebatchingservingschedulingthroughput

Dynamic Batching Strategies: From Naive to Continuous to Iteration-Level

Batching is the lever that turns idle GPU silicon into served tokens. This post walks through the evolution of batching for LLM serving, from one-at-a-time to static batches to request-level dynamic batching to iteration-level continuous batching, and shows where each strategy still leaves throughput on the floor.

General Compute·April 25, 2026

inferencepapersservingschedulingtail-latencyfairness

S3: Scheduling for Straggler Mitigation in LLM Serving

In LLM serving, a single long-running request can stall everyone else sharing the same batch. S3 attacks that by predicting output length and scheduling around it. Here is what stragglers actually cost you, and how output-length-aware scheduling helps.

General Compute·April 23, 2026

inferencepapersservingprefilldecodeschedulingsarathi

Chunked Prefill: Overlapping Compute and Communication

Prefill pins the compute units while decode starves for memory bandwidth. Sarathi-Serve splits prefill into chunks and piggybacks decodes on them, keeping both resources busy in the same batch. Here is how it works and where the limits are.

General Compute·April 22, 2026

inferencepapersroutingcascadesfrugalgptllm

Cascade Inference: Using Small Models to Route to Big Ones

FrugalGPT and its descendants show that most queries do not need the biggest model. We walk through the cascade pattern, routing classifiers, and the engineering trade-offs of sending easy work to cheap models and escalating only when needed.

General Compute·April 21, 2026

inferencepapersdecodingspeculative-decodinglookaheadllm

Lookahead Decoding: Parallel Token Generation Without Draft Models

Lookahead decoding from LMSYS speeds up autoregressive generation without requiring a draft model. We walk through the Jacobi iteration trick, the n-gram pool, and what the speedups actually look like in practice.

General Compute·April 20, 2026

inferencepapersservingprefilldecodegpu

Disaggregated Prefill and Decode (Splitwise / DistServe)

Prefill and decode have different compute profiles and clash when they share a GPU. Splitwise and DistServe separate them onto different hardware pools. We walk through why, how, and when it actually pays off.

General Compute·April 19, 2026

inferencepaperskv-cacheattentiondeepseek

KV Cache Compression: MLA and Beyond

DeepSeek's Multi-Head Latent Attention cuts the KV cache by an order of magnitude without giving up quality. We walk through MLA, how it compares to MQA and GQA, and the other compression techniques worth knowing.

General Compute·April 18, 2026

inferencepapersdeep-dive

Multi-Query and Grouped-Query Attention: Shrinking the KV Cache

MQA and GQA reduce the memory footprint of attention by sharing key-value heads across queries. A simple architectural change that makes inference dramatically faster.

General Compute·March 25, 2026

inferencepapersdeep-dive

Continuous Batching: The Orca Paper That Changed LLM Serving

Before continuous batching, LLM servers wasted GPU cycles waiting for the slowest request in each batch. Orca's iteration-level scheduling fixed this with a 36x throughput improvement.

General Compute·March 24, 2026

inferencepapersdeep-dive

Medusa, EAGLE, and Sequoia: The Next Generation of Speculative Decoding

The original speculative decoding papers needed a separate draft model. Medusa, EAGLE, and Sequoia found ways to speculate faster, smarter, and without the extra model.

General Compute·March 24, 2026

inferencepapersdeep-dive

SGLang and RadixAttention: Smarter KV Cache Reuse

SGLang's RadixAttention stores KV cache in a radix tree, enabling automatic prefix sharing across requests. The result is up to 5x higher throughput for multi-turn and structured workloads.

General Compute·March 24, 2026

inferencepapersdeep-dive

Speculative Decoding: Getting 3x Speedups Without Changing the Model

Speculative decoding uses a small draft model to predict multiple tokens ahead, then verifies them all at once. The result is mathematically identical output, 2-3x faster.

General Compute·March 23, 2026