Agent Readout
LPU vs GPU vs CPU: Which Processor Wins for AI Inference?
A head-to-head look at LPUs, GPUs, and CPUs for AI inference, with the architecture reasons behind their performance, real-world latency and throughput trade-offs, cost-effectiveness, and which one fits each kind of workload.
- Author
- General Compute
- Published
- 2026-05-30
- Tags
- lpu, gpu, cpu, inference, hardware
Markdown body
If you are choosing hardware to run a language model in production, you will run into three categories of processor: CPUs, GPUs, and the newer class of purpose-built inference chips that vendors call LPUs or inference accelerators. They are not interchangeable, and the marketing around each tends to flatten the differences into a single "faster" claim that does not survive contact with a real workload. This guide compares the three on the dimensions that actually decide which one you should use: how their architectures differ, where each one is fast and where it falls over, what they cost per token, and which workloads map cleanly onto each. The short version is that the right answer depends almost entirely on what you are running and how you are running it. A batch job that scores a million records overnight has different constraints than a voice agent that needs to respond before the user notices a pause. Let us go through why. ## What an LPU actually is LPU stands for Language Processing Unit, a term popularized by Groq for its inference accelerator. It is worth being precise here, because LPU is a product category name rather than a standardized class of hardware the way CPU and GPU are. Different vendors build inference-specific chips with different internal designs, and they market them under different names: LPU, IPU, NPU, or just "inference accelerator." What they share is a design goal. They are built to run already-trained models, especially transformers, as fast and as efficiently as possible, and they give up the general-purpose flexibility that CPUs and GPUs keep. The defining architectural choice in most of these chips is deterministic, software-scheduled execution backed by large amounts of on-chip SRAM. A GPU spends a lot of silicon and energy on dynamic scheduling: deciding at runtime which work goes to which execution unit, managing caches, and hiding memory latency by juggling thousands of threads. An LPU-style design pushes that scheduling into the compiler ahead of time. The chip knows exactly which operation runs in which cycle, so it can drop most of the control-logic overhead. The result is very low and very predictable latency, which is the property that matters most for single-stream, latency-sensitive generation. The trade-off is that this approach leans hard on keeping data on-chip. SRAM is fast but small. To hold a large model, you often have to spread it across many chips connected by a fast interconnect, which raises the cost and complexity of a deployment. That constraint shapes where these chips win and where they do not. ## Why the architecture differences matter for inference LLM inference, specifically the decode phase where the model generates tokens one at a time, is memory-bandwidth-bound rather than compute-bound. For every single token, the hardware has to read the entire set of model weights it needs out of memory. The arithmetic per token is modest; the bottleneck is moving the weights. This single fact explains most of the performance gaps between these three processor types. A CPU has the least memory bandwidth of the three and a small number of powerful cores optimized for sequential, branchy code. That is great for running an operating system or a web server. It is poorly matched to streaming billions of weight values per token through a matrix multiply. A GPU has thousands of simpler cores and high-bandwidth memory (HBM) attached, giving it an order of magnitude more memory bandwidth than a CPU. It is built for the dense, parallel linear algebra that neural networks are made of. For most teams, a GPU is the default and for good reason. An LPU-style accelerator attacks the bandwidth problem from a different angle: it keeps weights in on-chip SRAM, which has far higher bandwidth than even HBM, and it removes the runtime scheduling overhead that adds variance to GPU latency. For a single request streaming tokens out as fast as possible, this is the design that produces the lowest time-per-token. The catch, again, is capacity, because SRAM is expensive per gigabyte and you need a lot of chips to hold a large model. ## Head-to-head: latency, throughput, and where each one wins It helps to separate two performance metrics that often get conflated. **Latency** is how quickly a single request completes, usually measured as time-to-first-token (TTFT) and time-per-output-token (or its inverse, tokens per second for one stream). This is what a user feels when they are waiting on a response. **Throughput** is how many tokens the system produces across all concurrent requests, usually measured in total tokens per second at a given batch size. This is what determines your cost per token when you are serving many users at once. These two often trade against each other. Here is roughly how the three processors land. ### CPU CPUs win when the model is small, the request volume is low, or you simply do not have a GPU available. Running a quantized 1B-to-3B model on a modern server CPU is entirely workable for occasional requests, background jobs, or on-device scenarios. Frameworks like llama.cpp have made CPU inference for small models genuinely usable, and INT4/INT8 quantization narrows the bandwidth gap considerably. Where CPUs fall over is large models and any latency target tighter than a couple of seconds. A 70B model on a CPU will technically run, but token generation crawls, and you will not be happy with the interactive experience. CPUs also remain essential as the host for GPU and accelerator deployments. They handle tokenization, request routing, the API layer, and orchestration. So this is rarely a question of CPU versus GPU as an either-or; the CPU is almost always in the picture doing the supporting work. ### GPU GPUs are the generalists that win the widest range of workloads. They handle large models, they batch many concurrent requests efficiently for high throughput, and they run essentially every model architecture and framework without special porting work. With continuous batching (serving stacks process requests at the iteration level rather than waiting for a full batch), a single high-end GPU can serve a large model to many users at a strong cost per token. The areas where GPUs are not the obvious winner are single-stream latency and latency consistency. Because GPUs hide memory latency by running many threads and scheduling work dynamically, the time for any individual token can vary, and an unbatched single request does not use the hardware efficiently. If your priority is the fastest possible response for one user at a time, a GPU running one request leaves a lot of silicon idle. ### LPU and inference accelerators These chips win on single-stream latency and latency predictability. For interactive workloads where the metric that matters is how fast tokens come back to one user, the deterministic SRAM-backed design produces token rates and TTFT that are hard to match with a GPU. This is the regime where you most often see the headline "fastest inference" numbers, and for that specific workload the numbers are real. The cost of that advantage is flexibility and capacity economics. You generally run the models the vendor supports, on the vendor's stack, and very large models require spreading across many chips. If your workload is throughput-bound batch processing rather than latency-bound interaction, the accelerator's advantage shrinks and the cost math can favor a well-batched GPU. ## A rough cost-effectiveness comparison Cost per token, not cost per chip, is the number that matters, and it depends heavily on utilization. A few principles hold up across deployments: - For **high-volume batch work** where you can fill large batches and you do not care about per-request latency, GPUs usually deliver the best tokens-per-dollar. Continuous batching keeps the hardware busy, and you amortize the cost of the chip across many simultaneous requests. - For **latency-critical interactive work** at meaningful volume, inference accelerators can win on cost per token despite higher hardware cost, because they hit latency targets that a GPU can only match by under-batching (and under-batching a GPU wrecks its cost efficiency). If you are forced to run a GPU at batch size 1 to meet a latency SLA, you are paying for a lot of idle silicon. - For **low-volume or small-model work**, a CPU you already own is often the cheapest option, because the marginal cost is near zero and you avoid provisioning an accelerator that sits idle most of the time. The recurring theme is utilization. A fast chip running at 10 percent utilization can cost more per token than a slower chip running at 90 percent. Whatever hardware you pick, the cost story is mostly about keeping it busy. ## A decision framework Rather than asking which processor is fastest in the abstract, work backward from your workload. 1. **What is your latency requirement?** If you need sub-100ms responsiveness for an interactive product (voice agents, live coding assistants, anything conversational), latency is your binding constraint, and an inference accelerator or a lightly loaded GPU is where to look. If you can tolerate seconds, your options widen considerably. 2. **What is your request pattern?** Steady high concurrency favors GPUs with continuous batching, because you can fill batches and drive up throughput. Spiky or low-volume traffic favors either an accelerator (for the latency wins without batching) or a CPU (if the model is small enough), because you are not paying for idle capacity. 3. **How big is your model?** Small models (under ~7B, especially quantized) run acceptably on CPUs and comfortably on a single GPU. Large models need GPUs with enough HBM or accelerators with enough aggregate SRAM, which raises the floor on deployment cost. 4. **How much flexibility do you need?** If you are constantly swapping models, fine-tuning, or running unusual architectures, GPUs give you the broadest software support. If you have a stable set of well-supported models and you want maximum speed, a specialized accelerator is worth evaluating. For most teams starting out, the honest default is a GPU. It handles the widest range of cases, the tooling is mature, and you will not paint yourself into a corner. You reach for CPUs at the small-model, low-volume edge, and you reach for inference accelerators when single-stream latency becomes the thing your product lives or dies on. ## Where this leaves the "which wins" question There is no single winner, because the three processors are optimized for different points on the latency-throughput-flexibility surface. CPUs are the flexible, low-cost option for small models and supporting work. GPUs are the generalists that handle nearly everything and deliver the best throughput economics when well utilized. LPU-style accelerators are the specialists that win when the fastest possible per-user response is the priority. What has changed over the last few years is that the latency-critical category has grown. Voice interfaces, agentic systems that chain many model calls, and real-time coding tools all multiply the cost of slow inference, because the latency compounds across steps. That is the workload class where purpose-built inference hardware earns its place, and it is a big part of why the category exists at all. General Compute runs custom inference infrastructure built for exactly that latency-critical regime, with an OpenAI-compatible API so you can point an existing application at it without rewriting your stack. If your workload is the kind where per-token latency is the constraint, it is worth benchmarking against your current setup. You can read more in the [docs](https://generalcompute.com) and run your own numbers against the workload you actually serve, which is the only benchmark that ends up mattering.