Blog

Insights on AI inference, ASIC infrastructure, and building fast AI applications.

Latest

vLLM vs Managed Inference APIs: When to Self-Host vs Pay-as-You-Go

A practical TCO comparison of self-hosting LLMs with vLLM versus using managed inference APIs. Includes break-even analysis and a decision framework for your team.

General Compute·July 8, 2026

llama apillama 4

Llama API: How to Run Llama 4 via API Without Managing Infrastructure

A practical guide to running Llama 4 through managed inference APIs. Compare providers, understand pricing and performance trade-offs, and see working integration code.

General Compute·July 7, 2026

openai api alternativeinference

Best OpenAI API Alternatives in 2025: Full Developer Comparison

A developer's guide to the top OpenAI API alternatives in 2025, covering GeneralCompute, Groq, Together AI, Fireworks AI, and Replicate with pricing, model selection, and migration code.

General Compute·July 6, 2026

Browse by topic

Inference Speed & Benchmarks

View all →

Why inference speed is the new moat, real-time AI guides, and benchmarks comparing latency, throughput, and cost.

latencyttfttokens per secondinference speedoptimization

AI Inference Latency Explained: TTFT, TPS, and How to Optimize Them

What time to first token and tokens per second actually measure, how to measure them correctly, and a layer-by-layer guide to reducing AI inference latency in production.

General Compute·June 12, 2026

vllmbenchmarksinference speedthroughputlatency

GeneralCompute vs vLLM: Throughput, Latency, and Cost Benchmarks

A head-to-head comparison of vLLM self-hosted on H100s versus GeneralCompute's managed inference API: full methodology, throughput and latency numbers, and a total cost of operations breakdown.

General Compute·June 11, 2026

voice-aitutorialagents

Build a Real-Time Voice AI Agent with General Compute

A step-by-step tutorial for building a voice AI agent with sub-500ms response times. Plus: why General Compute is the only provider fast enough to use reasoning models in a voice pipeline.

General Compute·March 20, 2026

coding-agentsinferencedeveloper-tools

How Coding Agents Depend on Inference Speed

Coding agents make dozens of sequential LLM calls per task. Every millisecond of inference latency compounds across each step, making speed the single biggest infrastructure bottleneck for AI-powered developer tools.

General Compute·March 19, 2026

inferenceinfrastructure

Why Inference Speed is the New Moat

Model quality has commoditized. The real competitive advantage in AI is how fast your infrastructure can deliver results. Inference speed is becoming the defining moat for AI-native products.

General Compute·March 18, 2026

AI Inference Fundamentals

View all →

Technical deep-dives on the building blocks of modern LLM inference: attention, quantization, decoding, and architectures.

flash-attentionattentiongpuinference-optimization

Flash Attention: Why Modern LLMs Run Faster With It

Flash Attention rewrites the attention computation to avoid moving a giant intermediate matrix in and out of GPU memory. Here is how the tiling and kernel fusion work, how v1, v2, and v3 evolved, and how to turn it on in PyTorch.

General Compute·June 3, 2026

fine-tuningalignmentdporlhf

What Is Direct Preference Optimization (DPO)? Explained Simply

DPO aligns language models to human preferences without a separate reward model or reinforcement learning. Here is how it works, how it compares to RLHF, and when to reach for IPO, KTO, or ORPO instead.

General Compute·June 2, 2026

speculative decodinginferencellmlatencyoptimization

What Is Speculative Decoding? How It Makes LLMs 3x Faster

A clear explanation of speculative decoding: how a small draft model proposes tokens that a large model verifies in parallel, why it preserves output quality, the Medusa and Eagle variants, and the real speedups you can expect in production.

General Compute·May 31, 2026

lpugpucpuinferencehardware

LPU vs GPU vs CPU: Which Processor Wins for AI Inference?

A head-to-head look at LPUs, GPUs, and CPUs for AI inference, with the architecture reasons behind their performance, real-world latency and throughput trade-offs, cost-effectiveness, and which one fits each kind of workload.

General Compute·May 30, 2026

inferencefundamentalsllmproductionlatency

What Is AI Inference? A Developer's Complete Guide

A practical, end-to-end explanation of AI inference: what it is, how the pipeline works, the metrics that matter, the hardware that runs it, and the trade-offs you face when you put a model in production.

General Compute·May 29, 2026

attention sinksstreamingllmlong contextkv cacheinferencetransformers

The Attention Sink Phenomenon: Why the First Token Matters

How attention concentrates on the first few tokens of every sequence, why naive sliding-window caching breaks long-context generation, and how StreamingLLM uses sink tokens to serve effectively unbounded streams.

General Compute·May 4, 2026

mixture of expertsmoeinferenceroutingsparse modelsserving

Mixture of Experts at Inference Time

How MoE routing actually works during serving, why sparse activation makes large models cheaper to run per token, and what changes for the inference stack.

General Compute·May 3, 2026

distillationinferencemodel compressiontrainingproduction

Distillation for Inference: How Smaller Models Learn From Larger Ones

A practical guide to knowledge distillation for production inference: what actually works, what to skip, and how to ship a smaller model without losing the behavior you cared about.

General Compute·April 30, 2026

fp8quantizationinferencetraininghopperblackwell

FP8 Training and Inference: The Precision Sweet Spot

Why 8-bit floating point hits a different point on the accuracy/throughput curve than INT8, how E4M3 and E5M2 are used in practice, and what FP8 actually buys you in production serving.

General Compute·April 29, 2026

quantizationawqinferencellmoptimization

Activation-Aware Quantization (AWQ) Deep Dive

A close look at how AWQ picks salient weight channels, applies per-channel scaling, and why it consistently beats round-to-nearest 4-bit quantization for LLM inference.

General Compute·April 28, 2026

mambastate-space-modelsinferencearchitecturelong-context

Mamba and State Space Models: Inference Without Attention

How structured state space models like Mamba achieve constant-time per-token inference, and why the selective scan changes the trade-off space for long-context serving.

General Compute·April 27, 2026

rwkvlinear-attentioninferencearchitecturelong-context

RWKV and Linear Attention: Recurrent Models as an Inference Shortcut

How RWKV and linear attention architectures collapse the per-token cost of generation to O(1), and what that means for serving long-context workloads.

General Compute·April 26, 2026

inferencepaperstransformerstoken-mergingpruningvision

Token Merging and Token Pruning for Faster Transformers

Attention cost grows with the square of sequence length. Token merging and token pruning shrink that sequence mid-network, trading a little accuracy for real speedups. Here is how ToMe works, how the idea extends to language models, and where it breaks down.

General Compute·April 24, 2026

inferencepaperslong-contextdistributed

Ring Attention: Scaling Context to Millions of Tokens

Ring Attention distributes the attention computation across devices in a ring topology, overlapping KV transfer with compute so context length scales linearly with the number of GPUs.

General Compute·April 17, 2026

inferencepapersdeep-dive

Quantization for Inference: GPTQ, AWQ, SmoothQuant, and FP8

Quantization shrinks model weights from 16-bit to 4-bit or 8-bit, cutting memory usage and speeding up inference. Here's how the major techniques work and when to use each one.

General Compute·March 26, 2026

inferencepapersdeep-dive

PagedAttention and vLLM: Virtual Memory for LLM Serving

The PagedAttention paper solved the biggest memory waste problem in LLM serving by borrowing an idea from operating systems. Here's how it works and why vLLM became the default serving framework.

General Compute·March 22, 2026

inferencepapersdeep-dive

FlashAttention: How Tri Dao Made Attention 4x Faster

FlashAttention rewrote the rules of transformer inference by treating attention as a memory problem, not a compute problem. Here's how it works and why it matters.

General Compute·March 21, 2026

AI Agents & Agentic AI

View all →

Tool calling, multi-agent architectures, reasoning patterns, and the inference requirements behind production agents.

agentsstreaminginferencelatencyuxtool-calling

Streaming for Agents: Why Partial Results Change the UX

Streaming in agentic pipelines is not the same as streaming chat tokens. Partial tool calls, pipelined steps, and early cancellation change what the user experiences.

General Compute·May 16, 2026

agentstool-callingparallelisminferencelatency

Parallel Tool Execution: How Fast Inference Enables Concurrent Agent Actions

Why running multiple tool calls in parallel changes the latency math of an agent, and how inference speed determines whether the parallelism is worth doing.

General Compute·May 15, 2026

agentsmemoryragkv-cacheinferencelatency

Agent Memory Systems: Balancing Context Length vs Retrieval Latency

How agents reconstruct memory between turns, and the latency trade-offs between long context, RAG, summarization, and KV cache reuse.

General Compute·May 12, 2026

coding-agentsinferencelatencydeveloper-toolsagents

Building a Code Agent: Why Each Step Needs Sub-Second Inference

A practical breakdown of the latency budget inside a code agent, step by step, and why every link in the chain needs to land under a second to keep the loop usable.

General Compute·May 11, 2026

agentsreasoningreactreflexionchain-of-thoughtinferencelatency

ReAct, Reflexion, and Chain-of-Thought: The Inference Cost of Reasoning Patterns

Popular agent reasoning patterns are described as prompt techniques, but they are inference cost multipliers. Here is how ReAct, Reflexion, and Chain-of-Thought actually shape the bill and the latency.

General Compute·May 10, 2026

agentsmulti-agentinferencelatencycost

Multi-Agent Architectures and the Inference Cost Explosion

Orchestrator and worker patterns make multi-agent systems easy to design and expensive to run. Here is where the inference cost actually goes, and what it means for the infrastructure underneath.

General Compute·May 9, 2026

agentstool-callinginferencelatency

Tool Calling Latency: The Bottleneck No One Talks About

Function calling looks simple on paper, but the latency budget of a tool-using LLM is dominated by short structured generations that most serving stacks are not optimized for. This is what actually makes tool calls feel slow.

General Compute·May 8, 2026

agentsinferencelatency

The Agentic Inference Tax: Why Agents Need 10x Faster Models

Agents make many sequential LLM calls per task, and each one pays the full latency of decoding. This post walks through how that compounds and why fast inference changes which agents are even viable.

General Compute·May 7, 2026

Infrastructure Deep-Dives

View all →

Speculative decoding, KV cache, tensor parallelism, batching strategies, and the systems that serve LLMs at scale.

inferencequantizationtutorial

Quantization Explained: INT4, GGUF, GPTQ and What They Mean for Your Model

A practical guide to LLM quantization: what INT4, GGUF, and GPTQ actually do, how much quality you lose, and how to quantize a model yourself with llama.cpp and AutoGPTQ.

General Compute·June 1, 2026

inferencecompilersdeep-dive

Compiler-Level Optimizations for Inference: TorchInductor, Triton, XLA

How modern ML compilers turn Python model code into fused, fast kernels. A practical look at TorchInductor, Triton, and XLA, and the tradeoffs each one makes for inference.

General Compute·May 6, 2026

inferencespeculative-decodingdeep-dive

Draft Model Selection for Speculative Decoding

Picking a draft model is the most consequential decision when deploying speculative decoding. A practical guide to acceptance rates, sizing, and the tradeoffs that decide whether you actually get a speedup.

General Compute·May 5, 2026

tensor parallelismpipeline parallelisminferencedistributedgpuserving

Tensor Parallelism vs Pipeline Parallelism for Model Serving

How tensor and pipeline parallelism actually differ in production inference, when to use each, and why most serving stacks end up combining them.

General Compute·May 2, 2026

prefix cachingkv cacheinferencevllmsglangproduction

Prefix Caching: Why Repeated Prompts Shouldn't Cost You Twice

How prefix caching works in modern LLM serving stacks, why it changes the economics of long system prompts and RAG, and what to watch out for in production.

General Compute·May 1, 2026

inferencebatchingservingschedulingthroughput

Dynamic Batching Strategies: From Naive to Continuous to Iteration-Level

Batching is the lever that turns idle GPU silicon into served tokens. This post walks through the evolution of batching for LLM serving, from one-at-a-time to static batches to request-level dynamic batching to iteration-level continuous batching, and shows where each strategy still leaves throughput on the floor.

General Compute·April 25, 2026

inferencepapersservingschedulingtail-latencyfairness

S3: Scheduling for Straggler Mitigation in LLM Serving

In LLM serving, a single long-running request can stall everyone else sharing the same batch. S3 attacks that by predicting output length and scheduling around it. Here is what stragglers actually cost you, and how output-length-aware scheduling helps.

General Compute·April 23, 2026

inferencepapersservingprefilldecodeschedulingsarathi

Chunked Prefill: Overlapping Compute and Communication

Prefill pins the compute units while decode starves for memory bandwidth. Sarathi-Serve splits prefill into chunks and piggybacks decodes on them, keeping both resources busy in the same batch. Here is how it works and where the limits are.

General Compute·April 22, 2026

inferencepapersroutingcascadesfrugalgptllm

Cascade Inference: Using Small Models to Route to Big Ones

FrugalGPT and its descendants show that most queries do not need the biggest model. We walk through the cascade pattern, routing classifiers, and the engineering trade-offs of sending easy work to cheap models and escalating only when needed.

General Compute·April 21, 2026

inferencepapersdecodingspeculative-decodinglookaheadllm

Lookahead Decoding: Parallel Token Generation Without Draft Models

Lookahead decoding from LMSYS speeds up autoregressive generation without requiring a draft model. We walk through the Jacobi iteration trick, the n-gram pool, and what the speedups actually look like in practice.

General Compute·April 20, 2026

inferencepapersservingprefilldecodegpu

Disaggregated Prefill and Decode (Splitwise / DistServe)

Prefill and decode have different compute profiles and clash when they share a GPU. Splitwise and DistServe separate them onto different hardware pools. We walk through why, how, and when it actually pays off.

General Compute·April 19, 2026

inferencepaperskv-cacheattentiondeepseek

KV Cache Compression: MLA and Beyond

DeepSeek's Multi-Head Latent Attention cuts the KV cache by an order of magnitude without giving up quality. We walk through MLA, how it compares to MQA and GQA, and the other compression techniques worth knowing.

General Compute·April 18, 2026

inferencepapersdeep-dive

Multi-Query and Grouped-Query Attention: Shrinking the KV Cache

MQA and GQA reduce the memory footprint of attention by sharing key-value heads across queries. A simple architectural change that makes inference dramatically faster.

General Compute·March 25, 2026

inferencepapersdeep-dive

Continuous Batching: The Orca Paper That Changed LLM Serving

Before continuous batching, LLM servers wasted GPU cycles waiting for the slowest request in each batch. Orca's iteration-level scheduling fixed this with a 36x throughput improvement.

General Compute·March 24, 2026

inferencepapersdeep-dive

Medusa, EAGLE, and Sequoia: The Next Generation of Speculative Decoding

The original speculative decoding papers needed a separate draft model. Medusa, EAGLE, and Sequoia found ways to speculate faster, smarter, and without the extra model.

General Compute·March 24, 2026

inferencepapersdeep-dive

SGLang and RadixAttention: Smarter KV Cache Reuse

SGLang's RadixAttention stores KV cache in a radix tree, enabling automatic prefix sharing across requests. The result is up to 5x higher throughput for multi-turn and structured workloads.

General Compute·March 24, 2026

inferencepapersdeep-dive

Speculative Decoding: Getting 3x Speedups Without Changing the Model

Speculative decoding uses a small draft model to predict multiple tokens ahead, then verifies them all at once. The result is mathematically identical output, 2-3x faster.

General Compute·March 23, 2026

Model Guides & Launches

View all →

Guides and benchmarks for the latest open-source and proprietary models, with practical tips for running them in production.

open source llmmodel comparisonllamadeepseekqwen

Open-Source LLM Landscape 2025: Top Models Compared

A practical map of the open-source LLM ecosystem in 2025: the leading model families, how they stack up by size and task, what the licenses actually let you do, and how to pick one for production.

General Compute·June 10, 2026

faster-whisperspeech-to-textvoice-aictranslate2

Faster-Whisper: Real-Time Speech-to-Text on GeneralCompute

Faster-Whisper reimplements OpenAI's Whisper on CTranslate2 with INT8 inference, running several times faster at the same accuracy. Here is how it works, how streaming differs from batch transcription, and how it fits into a real-time STT to LLM to TTS voice pipeline.

General Compute·June 9, 2026

qwq-32breasoning-modelsqwenopen-source-llm

QwQ-32B: The Reasoning Model That Rivals o1 — Complete Guide

QwQ-32B is a 32-billion-parameter open-weight reasoning model from the Qwen team that competes with much larger reasoning models. Here is how it works, how it compares to o1, o1-mini, and DeepSeek R1, and what its long reasoning traces mean when you serve it in production.

General Compute·June 8, 2026

llama4fine-tuningloraqlorahow-to

How to Fine-Tune Llama 4: Step-by-Step Guide with Code

A practical walkthrough for fine-tuning Llama 4: when to do it, how to prepare data, and working LoRA, QLoRA, and full fine-tune code, plus evaluation and deployment.

General Compute·June 7, 2026

qwen3-coderopen-source-llmcoding-modelbenchmarksinference

Qwen3-Coder: The Best Open-Source Coding Model? Benchmark + Guide

A close look at Qwen3-Coder: how it scores on HumanEval, MBPP, and SWE-bench, how it compares to Code Llama and DeepSeek Coder, and how to wire it into your editor and agents.

General Compute·June 6, 2026

llama4open-source-llmgetting-startedinference

Llama 4 on GeneralCompute: Getting Started Guide

A practical guide to running Llama 4 on GeneralCompute: the model variants, what hardware they need, how to make your first API call, and how to tune requests for speed and cost.

General Compute·June 5, 2026

deepseek-r1reasoning-modelsreinforcement-learningopen-source-llm

DeepSeek R1: What It Is, How It Works, and Why It Matters

DeepSeek R1 is an open-weight reasoning model trained mostly through reinforcement learning. Here is how its architecture and training work, how it compares to GPT-4 class models, Claude, and Llama, and what its reasoning style means for inference.

General Compute·June 4, 2026