# General Compute

> The world's fastest inference provider. Purpose-built ASICs, 1,000+ tokens
> per second, up to 7x faster than GPU-based competitors. OpenAI-compatible API.

Developer resources for General Compute live on this domain: API documentation,
OpenAPI specification, authentication, webhooks, MCP server, and SDKs. Every
page is also available as markdown: request it with `Accept: text/markdown` or
append `.md` to the path. A full-text dump lives at https://www.generalcompute.com/llms-full.txt.

## Key facts

- API base URL: `https://api.generalcompute.com/v1`
- Protocol: OpenAI-compatible HTTPS
- Agent signup (no human-only gate): https://docs.generalcompute.com/agent-signup
- Docs: https://docs.generalcompute.com
- Contact: founders@generalcompute.com

## Developer resources

- [General Compute API documentation](https://docs.generalcompute.com)
- [General Compute API reference](https://www.generalcompute.com/api-reference)
- [General Compute OpenAPI specification](https://docs.generalcompute.com/openapi.json)
- [General Compute authentication & API keys](https://www.generalcompute.com/auth)
- [General Compute webhooks](https://www.generalcompute.com/webhooks)
- [General Compute MCP server](https://www.generalcompute.com/mcp)
- [General Compute SDKs](https://www.generalcompute.com/sdks)
- [General Compute agent signup](https://docs.generalcompute.com/agent-signup)
- [General Compute MCP descriptor](https://www.generalcompute.com/.well-known/mcp)
- [General Compute API catalog (RFC 9727)](https://www.generalcompute.com/.well-known/api-catalog)
- [General Compute agent skills](https://www.generalcompute.com/.well-known/agent-skills/index.json)

## Pages

- [Home](https://www.generalcompute.com/.md): What General Compute is, performance numbers, and a quick-start snippet.
- [Agents portal](https://www.generalcompute.com/agents.md): Entry point for automated consumers — discovery resources and machine rules.
- [General Compute developer resources](https://www.generalcompute.com/developers.md): Hub for General Compute API docs, OpenAPI spec, authentication, webhooks, MCP server, and SDKs.
- [General Compute API reference](https://www.generalcompute.com/api-reference.md): OpenAI-compatible API base URL, models, auth header, and a curl example for General Compute.
- [General Compute OpenAPI specification](https://www.generalcompute.com/openapi.md): Where to fetch the General Compute OpenAPI JSON and how to consume it.
- [General Compute authentication](https://www.generalcompute.com/auth.md): Bearer-token auth, key rotation, and the agent signup flow for General Compute.
- [General Compute webhooks](https://www.generalcompute.com/webhooks.md): Webhook event categories, HMAC signing, and retry behaviour for General Compute.
- [General Compute MCP server](https://www.generalcompute.com/mcp.md): Model Context Protocol endpoint and discovery descriptor for the General Compute MCP server.
- [General Compute SDKs](https://www.generalcompute.com/sdks.md): Python and Node clients for General Compute, with install and usage snippets.
- [OpenClaw integration](https://www.generalcompute.com/openclaw.md): Swap an OpenAI-compatible coding agent to General Compute.
- [Benchmarks](https://www.generalcompute.com/benchmarks.md): Head-to-head latency and throughput vs other inference providers.
- [Infrastructure](https://www.generalcompute.com/infrastructure.md): Purpose-built ASIC clusters, energy economics, and architecture.
- [Coding agents](https://www.generalcompute.com/use-cases/coding-agents.md): Workload profile and recommended settings for autonomous coding agents.
- [Voice AI](https://www.generalcompute.com/use-cases/voice-ai.md): Latency profile and streaming notes for real-time voice workloads.
- [Roadmap](https://www.generalcompute.com/roadmap.md): Current site, expansion plan, and deployment playbook.
- [Team](https://www.generalcompute.com/team.md): Operators to contact for rate-limit changes or escalation.
- [Demo](https://www.generalcompute.com/demo.md): Pointer to the interactive demo and programmatic benchmarking.
- [Blog](https://www.generalcompute.com/blog.md): Posts on inference performance, model serving, and agent infrastructure.
- [Terms of service](https://www.generalcompute.com/terms.md): Agent-readable summary of the platform terms.
- [Privacy](https://www.generalcompute.com/privacy.md): Agent-readable summary of the privacy policy.

## Blog

- [Streaming for Agents: Why Partial Results Change the UX](https://www.generalcompute.com/blog/streaming-for-agents-why-partial-results-change-the-ux.md): Streaming in agentic pipelines is not the same as streaming chat tokens. Partial tool calls, pipelined steps, and early cancellation change what the user experiences.
- [Parallel Tool Execution: How Fast Inference Enables Concurrent Agent Actions](https://www.generalcompute.com/blog/parallel-tool-execution-how-fast-inference-enables-concurrent-agent-actions.md): Why running multiple tool calls in parallel changes the latency math of an agent, and how inference speed determines whether the parallelism is worth doing.
- [Agent Memory Systems: Balancing Context Length vs Retrieval Latency](https://www.generalcompute.com/blog/agent-memory-systems-balancing-context-length-vs-retrieval-latency.md): How agents reconstruct memory between turns, and the latency trade-offs between long context, RAG, summarization, and KV cache reuse.
- [Building a Code Agent: Why Each Step Needs Sub-Second Inference](https://www.generalcompute.com/blog/building-a-code-agent-why-each-step-needs-sub-second-inference.md): A practical breakdown of the latency budget inside a code agent, step by step, and why every link in the chain needs to land under a second to keep the loop usable.
- [ReAct, Reflexion, and Chain-of-Thought: The Inference Cost of Reasoning Patterns](https://www.generalcompute.com/blog/react-reflexion-and-chain-of-thought-the-inference-cost-of-reasoning-patterns.md): Popular agent reasoning patterns are described as prompt techniques, but they are inference cost multipliers. Here is how ReAct, Reflexion, and Chain-of-Thought actually shape the bill and the latency.
- [Multi-Agent Architectures and the Inference Cost Explosion](https://www.generalcompute.com/blog/multi-agent-architectures-and-the-inference-cost-explosion.md): Orchestrator and worker patterns make multi-agent systems easy to design and expensive to run. Here is where the inference cost actually goes, and what it means for the infrastructure underneath.
- [Tool Calling Latency: The Bottleneck No One Talks About](https://www.generalcompute.com/blog/tool-calling-latency-the-bottleneck-no-one-talks-about.md): Function calling looks simple on paper, but the latency budget of a tool-using LLM is dominated by short structured generations that most serving stacks are not optimized for. This is what actually makes tool calls feel slow.
- [The Agentic Inference Tax: Why Agents Need 10x Faster Models](https://www.generalcompute.com/blog/the-agentic-inference-tax.md): Agents make many sequential LLM calls per task, and each one pays the full latency of decoding. This post walks through how that compounds and why fast inference changes which agents are even viable.
- [Compiler-Level Optimizations for Inference: TorchInductor, Triton, XLA](https://www.generalcompute.com/blog/compiler-level-optimizations-for-inference.md): How modern ML compilers turn Python model code into fused, fast kernels. A practical look at TorchInductor, Triton, and XLA, and the tradeoffs each one makes for inference.
- [Draft Model Selection for Speculative Decoding](https://www.generalcompute.com/blog/draft-model-selection-for-speculative-decoding.md): Picking a draft model is the most consequential decision when deploying speculative decoding. A practical guide to acceptance rates, sizing, and the tradeoffs that decide whether you actually get a speedup.
- [The Attention Sink Phenomenon: Why the First Token Matters](https://www.generalcompute.com/blog/the-attention-sink-phenomenon-why-the-first-token-matters.md): How attention concentrates on the first few tokens of every sequence, why naive sliding-window caching breaks long-context generation, and how StreamingLLM uses sink tokens to serve effectively unbounded streams.
- [Mixture of Experts at Inference Time](https://www.generalcompute.com/blog/mixture-of-experts-at-inference-time.md): How MoE routing actually works during serving, why sparse activation makes large models cheaper to run per token, and what changes for the inference stack.
- [Tensor Parallelism vs Pipeline Parallelism for Model Serving](https://www.generalcompute.com/blog/tensor-parallelism-vs-pipeline-parallelism-for-model-serving.md): How tensor and pipeline parallelism actually differ in production inference, when to use each, and why most serving stacks end up combining them.
- [Prefix Caching: Why Repeated Prompts Shouldn't Cost You Twice](https://www.generalcompute.com/blog/prefix-caching-why-repeated-prompts-shouldnt-cost-you-twice.md): How prefix caching works in modern LLM serving stacks, why it changes the economics of long system prompts and RAG, and what to watch out for in production.
- [Distillation for Inference: How Smaller Models Learn From Larger Ones](https://www.generalcompute.com/blog/distillation-for-inference-how-smaller-models-learn-from-larger-ones.md): A practical guide to knowledge distillation for production inference: what actually works, what to skip, and how to ship a smaller model without losing the behavior you cared about.
- [FP8 Training and Inference: The Precision Sweet Spot](https://www.generalcompute.com/blog/fp8-training-and-inference-the-precision-sweet-spot.md): Why 8-bit floating point hits a different point on the accuracy/throughput curve than INT8, how E4M3 and E5M2 are used in practice, and what FP8 actually buys you in production serving.
- [Activation-Aware Quantization (AWQ) Deep Dive](https://www.generalcompute.com/blog/activation-aware-quantization-awq-deep-dive.md): A close look at how AWQ picks salient weight channels, applies per-channel scaling, and why it consistently beats round-to-nearest 4-bit quantization for LLM inference.
- [Mamba and State Space Models: Inference Without Attention](https://www.generalcompute.com/blog/mamba-and-state-space-models-inference-without-attention.md): How structured state space models like Mamba achieve constant-time per-token inference, and why the selective scan changes the trade-off space for long-context serving.
- [RWKV and Linear Attention: Recurrent Models as an Inference Shortcut](https://www.generalcompute.com/blog/rwkv-and-linear-attention-recurrent-models-as-an-inference-shortcut.md): How RWKV and linear attention architectures collapse the per-token cost of generation to O(1), and what that means for serving long-context workloads.
- [Dynamic Batching Strategies: From Naive to Continuous to Iteration-Level](https://www.generalcompute.com/blog/dynamic-batching-strategies-from-naive-to-continuous-to-iteration-level.md): Batching is the lever that turns idle GPU silicon into served tokens. This post walks through the evolution of batching for LLM serving, from one-at-a-time to static batches to request-level dynamic batching to iteration-level continuous batching, and shows where each strategy still leaves throughput on the floor.
- [Token Merging and Token Pruning for Faster Transformers](https://www.generalcompute.com/blog/token-merging-and-token-pruning-for-faster-transformers.md): Attention cost grows with the square of sequence length. Token merging and token pruning shrink that sequence mid-network, trading a little accuracy for real speedups. Here is how ToMe works, how the idea extends to language models, and where it breaks down.
- [S3: Scheduling for Straggler Mitigation in LLM Serving](https://www.generalcompute.com/blog/s3-scheduling-for-straggler-mitigation-in-llm-serving.md): In LLM serving, a single long-running request can stall everyone else sharing the same batch. S3 attacks that by predicting output length and scheduling around it. Here is what stragglers actually cost you, and how output-length-aware scheduling helps.
- [Chunked Prefill: Overlapping Compute and Communication](https://www.generalcompute.com/blog/chunked-prefill-overlapping-compute-and-communication.md): Prefill pins the compute units while decode starves for memory bandwidth. Sarathi-Serve splits prefill into chunks and piggybacks decodes on them, keeping both resources busy in the same batch. Here is how it works and where the limits are.
- [Cascade Inference: Using Small Models to Route to Big Ones](https://www.generalcompute.com/blog/cascade-inference-using-small-models-to-route-to-big-ones.md): FrugalGPT and its descendants show that most queries do not need the biggest model. We walk through the cascade pattern, routing classifiers, and the engineering trade-offs of sending easy work to cheap models and escalating only when needed.
- [Lookahead Decoding: Parallel Token Generation Without Draft Models](https://www.generalcompute.com/blog/lookahead-decoding-parallel-token-generation-without-draft-models.md): Lookahead decoding from LMSYS speeds up autoregressive generation without requiring a draft model. We walk through the Jacobi iteration trick, the n-gram pool, and what the speedups actually look like in practice.
- [Disaggregated Prefill and Decode (Splitwise / DistServe)](https://www.generalcompute.com/blog/disaggregated-prefill-and-decode.md): Prefill and decode have different compute profiles and clash when they share a GPU. Splitwise and DistServe separate them onto different hardware pools. We walk through why, how, and when it actually pays off.
- [KV Cache Compression: MLA and Beyond](https://www.generalcompute.com/blog/kv-cache-compression-mla-and-beyond.md): DeepSeek's Multi-Head Latent Attention cuts the KV cache by an order of magnitude without giving up quality. We walk through MLA, how it compares to MQA and GQA, and the other compression techniques worth knowing.
- [Ring Attention: Scaling Context to Millions of Tokens](https://www.generalcompute.com/blog/ring-attention-scaling-context-to-millions-of-tokens.md): Ring Attention distributes the attention computation across devices in a ring topology, overlapping KV transfer with compute so context length scales linearly with the number of GPUs.
- [Quantization for Inference: GPTQ, AWQ, SmoothQuant, and FP8](https://www.generalcompute.com/blog/quantization-for-inference-gptq-awq-smoothquant-fp8.md): Quantization shrinks model weights from 16-bit to 4-bit or 8-bit, cutting memory usage and speeding up inference. Here's how the major techniques work and when to use each one.
- [Multi-Query and Grouped-Query Attention: Shrinking the KV Cache](https://www.generalcompute.com/blog/multi-query-grouped-query-attention.md): MQA and GQA reduce the memory footprint of attention by sharing key-value heads across queries. A simple architectural change that makes inference dramatically faster.
- [Continuous Batching: The Orca Paper That Changed LLM Serving](https://www.generalcompute.com/blog/continuous-batching-the-orca-paper.md): Before continuous batching, LLM servers wasted GPU cycles waiting for the slowest request in each batch. Orca's iteration-level scheduling fixed this with a 36x throughput improvement.
- [Medusa, EAGLE, and Sequoia: The Next Generation of Speculative Decoding](https://www.generalcompute.com/blog/medusa-eagle-sequoia-next-gen-speculative-decoding.md): The original speculative decoding papers needed a separate draft model. Medusa, EAGLE, and Sequoia found ways to speculate faster, smarter, and without the extra model.
- [SGLang and RadixAttention: Smarter KV Cache Reuse](https://www.generalcompute.com/blog/sglang-and-radix-attention.md): SGLang's RadixAttention stores KV cache in a radix tree, enabling automatic prefix sharing across requests. The result is up to 5x higher throughput for multi-turn and structured workloads.
- [Speculative Decoding: Getting 3x Speedups Without Changing the Model](https://www.generalcompute.com/blog/speculative-decoding-3x-speedups-without-changing-the-model.md): Speculative decoding uses a small draft model to predict multiple tokens ahead, then verifies them all at once. The result is mathematically identical output, 2-3x faster.
- [PagedAttention and vLLM: Virtual Memory for LLM Serving](https://www.generalcompute.com/blog/paged-attention-and-vllm.md): The PagedAttention paper solved the biggest memory waste problem in LLM serving by borrowing an idea from operating systems. Here's how it works and why vLLM became the default serving framework.
- [FlashAttention: How Tri Dao Made Attention 4x Faster](https://www.generalcompute.com/blog/flash-attention-how-tri-dao-made-attention-4x-faster.md): FlashAttention rewrote the rules of transformer inference by treating attention as a memory problem, not a compute problem. Here's how it works and why it matters.
- [Build a Real-Time Voice AI Agent with General Compute](https://www.generalcompute.com/blog/build-a-real-time-voice-ai-agent.md): A step-by-step tutorial for building a voice AI agent with sub-500ms response times. Plus: why General Compute is the only provider fast enough to use reasoning models in a voice pipeline.
- [How Coding Agents Depend on Inference Speed](https://www.generalcompute.com/blog/how-coding-agents-depend-on-inference-speed.md): Coding agents make dozens of sequential LLM calls per task. Every millisecond of inference latency compounds across each step, making speed the single biggest infrastructure bottleneck for AI-powered developer tools.
- [Why Inference Speed is the New Moat](https://www.generalcompute.com/blog/why-inference-speed-is-the-new-moat.md): Model quality has commoditized. The real competitive advantage in AI is how fast your infrastructure can deliver results. Inference speed is becoming the defining moat for AI-native products.

## Machine-readable resources

- [Agent skills](https://www.generalcompute.com/.well-known/agent-skills/index.json): machine-readable skill catalog
- [API catalog](https://www.generalcompute.com/.well-known/api-catalog): RFC 9727 linkset
- [MCP descriptor](https://www.generalcompute.com/.well-known/mcp): MCP discovery JSON
- [Sitemap](https://www.generalcompute.com/sitemap.xml): full URL list