Agent Readout

RWKV and Linear Attention: Recurrent Models as an Inference Shortcut

How RWKV and linear attention architectures collapse the per-token cost of generation to O(1), and what that means for serving long-context workloads.

Author: General Compute
Published: 2026-04-26
Tags: rwkv, linear-attention, inference, architecture, long-context

Markdown body

The dominant cost of running a transformer at inference time is not the matrix multiplications inside each layer. It is the attention mechanism itself, which has to look back at every previous token before producing the next one. Generate the 10,000th token and the model touches 10,000 keys and 10,000 values. Generate the 100,000th and it touches 100,000. The KV cache grows linearly with sequence length, and the per-token compute does the same. Long contexts get expensive in both memory and time, and the cost is not amortized: every new token pays the full price.

Linear attention and recurrent architectures like RWKV try to flip that. Instead of carrying around the full history of keys and values, they compress everything seen so far into a fixed-size state. Generation becomes a constant-time update of that state. No matter how long the context, producing the next token costs the same. That is the inference shortcut.

This post walks through why standard attention scales the way it does, what linear attention changes mathematically, how RWKV adapts the idea into something that trains and runs well in practice, and where the trade-offs land when you actually deploy these models.

## Why Standard Attention Is O(n) Per Token

A transformer decoder layer does roughly the following at each generation step. The new token's query vector is compared against the keys of every prior token, the resulting scores are softmaxed, and those weights are applied to the corresponding values. Mathematically, for a query q at position t and keys K and values V from positions 1 through t:

```
output_t = softmax(q_t @ K^T / sqrt(d)) @ V
```

The softmax over q_t @ K^T is what makes attention non-linear. It also forces the model to keep all of K and V around, because softmax depends on the maximum and sum of the scores across all positions. You cannot incrementally update a softmax without seeing the new query.

The KV cache is the standard optimization: store K and V for the prompt and all generated tokens so you do not recompute them, then append a new row each step. Memory grows with sequence length, and each decode step still does an O(t) dot product across the cache. For a 128K-token context, every new token reads 128,000 key vectors and 128,000 value vectors out of HBM. That bandwidth is the bottleneck on most modern accelerators, not the floating-point math.

This is fine when contexts are short. When you start running agents that maintain long histories, voice systems that keep transcripts, or document workflows on full books, the cache pressure becomes the dominant concern.

## The Linear Attention Reformulation

Linear attention starts from a small algebraic trick. The softmax in standard attention is the only thing that prevents you from rearranging the computation. If you replace it with something that factors, you can rewrite attention as a recurrence.

Specifically, write attention as a sum of similarities:

```
output_t = sum_{i<=t} sim(q_t, k_i) * v_i / sum_{i<=t} sim(q_t, k_i)
```

In standard attention, sim is `exp(q . k / sqrt(d))`. The exponential does not factor across q and k, so you cannot pull q out of the sum. But if you pick sim(q, k) = phi(q) . phi(k) for some feature map phi (for instance, the elu+1 function from the original linear attention paper), then by associativity:

```
sum_i phi(q_t) . phi(k_i) * v_i = phi(q_t) . (sum_i phi(k_i) * v_i^T)
```

The right-hand side is a vector-matrix product where the matrix only depends on the history, not on q_t. Call that matrix S_t. Now S_t can be updated incrementally:

```
S_t = S_{t-1} + phi(k_t) * v_t^T
```

And generation becomes:

```
output_t = phi(q_t) . S_t / (phi(q_t) . z_t)
```

where z_t is a similar running sum used for the normalizer. The state S_t has shape (d_key x d_value), constant in t. Each token does O(d^2) work to update the state and produce the output, independent of how many tokens came before.

That is the shortcut. Generation is now O(1) per token in sequence length, and the memory footprint is one fixed-size matrix per layer per head, not a growing cache.

The catch is quality. Linear attention with a fixed feature map underperforms full softmax attention on most language tasks. The feature maps studied in the original work are simple, and they cannot represent the sharp, content-dependent attention patterns that softmax produces. You get speed, you lose expressiveness.

## RWKV: A Practical Recurrent Hybrid

RWKV (Receptance Weighted Key Value) is the most prominent attempt to take this idea and make it work at scale. The architecture lineage now spans several versions (RWKV-4, 5, 6, 7), and each release has narrowed the quality gap with transformers while keeping the constant-time inference property.

The core idea in RWKV is to combine linear-attention-style state updates with a learned time-mixing mechanism. Instead of a pure exponential decay or a fixed feature map, RWKV uses time-decay weights that the model learns per channel. The state update looks roughly like:

```
state_t = exp(-w) * state_{t-1} + k_t * v_t
output_t = receptance_t * (state_t / norm_t)
```

where w is a learnable channel-wise decay and receptance is a sigmoid gate that decides how much of the state to expose at each step. Different channels can decay at different rates, so some attend long-range and others act more locally. The receptance gate gives the model a way to suppress or amplify the state contribution token by token.

RWKV-5 and RWKV-6 added matrix-valued states (similar to multi-head linear attention) and data-dependent decays, where the decay weights are themselves a function of the input rather than a fixed learned parameter. RWKV-7 went further with delta-rule-style updates that allow the state to overwrite as well as accumulate. Each step pulls the architecture closer to what attention can express, while keeping the recurrent form.

The training story is the part that makes RWKV interesting beyond the pure linear attention papers. RWKV is mathematically equivalent to a recurrent network at inference, but the time-decay structure also allows it to be trained in parallel like a transformer. You unroll the recurrence into a parallel form, run it through a CUDA kernel that exploits the structure, and get something close to transformer training throughput. That dual representation, recurrent at inference and parallel at training, is what lets the architecture compete on both axes.

## What Constant-Time Inference Actually Buys

The headline benefit is obvious: long contexts are cheap. A 1M-token context generation with a transformer would require an enormous KV cache and prohibitive bandwidth per token. With an RWKV model, the per-token cost at position 1,000,000 is the same as the cost at position 100. Memory per layer is fixed, so VRAM usage does not blow up.

That changes a few things in practice.

**Streaming workloads become natural.** A voice agent or transcription system that runs for hours can keep accumulating state without a cache management strategy. There is no need to evict old tokens, summarize history, or chunk the context. The state is the history, compressed.

**Edge and on-device inference gets easier.** Constant memory means you can ship a small RWKV model to a device and let it run indefinitely without worrying about OOMs from a growing cache. This is part of why RWKV has shown up in mobile and embedded AI projects.

**Batching is more predictable.** With transformers, mixing requests of different lengths in a batch creates ragged compute and complicated scheduling. With RWKV, every request does the same fixed amount of work per step regardless of how long it has been running, which makes scheduling and capacity planning simpler.

**Cache management goes away.** Prefix caching, paged attention, sliding windows, and similar techniques exist because KV caches are awkward shared resources. None of them are needed for a recurrent model. The state is just per-stream local memory.

## The Trade-offs

Linear and recurrent models do not match transformers on every benchmark, and the gap is real if subtle. A few things to keep in mind.

**Recall over very long contexts is harder.** Compressing all of history into a fixed-size state means information has to be aggressively summarized as it passes through. Standard attention can pull any token from the past with full fidelity. Recurrent models cannot. This shows up most clearly in needle-in-a-haystack tests and exact-recall tasks, where transformer architectures still tend to win on raw accuracy. Recent RWKV versions and other state-space models have closed a lot of this gap, but it is still a real consideration for tasks that require pinpoint retrieval from long histories.

**Training a competitive recurrent model requires care.** The parallel training kernels for RWKV are non-trivial, and getting the time-decay parameterization right has taken multiple architecture revisions. This is a less mature ecosystem than the standard transformer one, which means fewer pretrained checkpoints, fewer mature serving stacks, and more rough edges in tooling.

**Determinism in long-context behavior is different.** Because the state is a learned compression, two slightly different histories can converge to similar states, and small changes upstream can have larger downstream effects than in attention-based models. This is mostly a curiosity, but it matters for some applications where you want strict reproducibility of long sessions.

**Hybrid architectures are gaining ground.** A growing class of models (Jamba, Zamba, the various Mamba+attention hybrids) interleave a small number of full-attention layers with many state-space or linear-attention layers. These hybrids try to keep the cheap recurrent compute for most of the model while preserving exact-recall capability where it matters. For many production workloads, this is probably where things end up: not pure RWKV, not pure transformer, but a careful mix.

## When to Reach For a Recurrent Model

If your workload involves long sequences, streaming inference, or strict memory constraints, a linear-attention or RWKV-style model is worth a hard look. Voice agents, document workers that scan large corpora, on-device assistants, and any application where the per-step cost matters more than per-token quality on retrieval-heavy tasks are good candidates.

If you are running a chat application with bounded context or a coding agent where the working set fits comfortably in a 32K window, a standard transformer is probably still the right choice. The quality margin matters more than the asymptotic compute savings at those scales, and the tooling is more mature.

The interesting case is when you have flexibility in the architecture choice for a new product. The constant-time property changes what is feasible: workloads that were uneconomical with full attention become routine with a recurrent backbone. That is worth thinking about before you commit to scaling out a transformer-only stack.

If you want to test fast inference for these architectures or compare them against transformer baselines on your own workloads, the General Compute API supports a range of open models and is built specifically for the kinds of latency-bound applications where the choice of architecture starts to matter. Documentation and a sandbox are at generalcompute.com.