Agent Readout

Blog directory

Plain list of posts with dates for quick parsing.

Total posts
11

Entries

  • Quantization for Inference: GPTQ, AWQ, SmoothQuant, and FP8Quantization shrinks model weights from 16-bit to 4-bit or 8-bit, cutting memory usage and speeding up inference. Here's how the major techniques work and when to use each one.. View raw
  • Multi-Query and Grouped-Query Attention: Shrinking the KV CacheMQA and GQA reduce the memory footprint of attention by sharing key-value heads across queries. A simple architectural change that makes inference dramatically faster.. View raw
  • Continuous Batching: The Orca Paper That Changed LLM ServingBefore continuous batching, LLM servers wasted GPU cycles waiting for the slowest request in each batch. Orca's iteration-level scheduling fixed this with a 36x throughput improvement.. View raw
  • Medusa, EAGLE, and Sequoia: The Next Generation of Speculative DecodingThe original speculative decoding papers needed a separate draft model. Medusa, EAGLE, and Sequoia found ways to speculate faster, smarter, and without the extra model.. View raw
  • SGLang and RadixAttention: Smarter KV Cache ReuseSGLang's RadixAttention stores KV cache in a radix tree, enabling automatic prefix sharing across requests. The result is up to 5x higher throughput for multi-turn and structured workloads.. View raw
  • Speculative Decoding: Getting 3x Speedups Without Changing the ModelSpeculative decoding uses a small draft model to predict multiple tokens ahead, then verifies them all at once. The result is mathematically identical output, 2-3x faster.. View raw
  • PagedAttention and vLLM: Virtual Memory for LLM ServingThe PagedAttention paper solved the biggest memory waste problem in LLM serving by borrowing an idea from operating systems. Here's how it works and why vLLM became the default serving framework.. View raw
  • FlashAttention: How Tri Dao Made Attention 4x FasterFlashAttention rewrote the rules of transformer inference by treating attention as a memory problem, not a compute problem. Here's how it works and why it matters.. View raw
  • Build a Real-Time Voice AI Agent with General ComputeA step-by-step tutorial for building a voice AI agent with sub-500ms response times. Plus: why General Compute is the only provider fast enough to use reasoning models in a voice pipeline.. View raw
  • How Coding Agents Depend on Inference SpeedCoding agents make dozens of sequential LLM calls per task. Every millisecond of inference latency compounds across each step, making speed the single biggest infrastructure bottleneck for AI-powered developer tools.. View raw
  • Why Inference Speed is the New MoatModel quality has commoditized. The real competitive advantage in AI is how fast your infrastructure can deliver results. Inference speed is becoming the defining moat for AI-native products.. View raw
ModeHumanAgent