Back to blog
mambastate-space-modelsinferencearchitecturelong-context

Mamba and State Space Models: Inference Without Attention

General Compute·

If you have spent any time profiling a transformer at long context, you already know where the time goes. The attention operator scales linearly with sequence length per token, the KV cache grows without bound, and HBM bandwidth becomes the wall you keep hitting. Most of the optimization work in inference systems over the last few years has been about chipping away at this: paged attention, KV compression, sliding windows, prefix caching. Each helps. None of them change the underlying scaling.

State space models take a different route. Instead of carrying the full history of keys and values, they compress everything seen so far into a fixed-size hidden state and update it recurrently. Mamba is the most prominent example, and it has shown that an SSM-based architecture can match transformers on a wide range of language benchmarks while keeping per-token inference cost constant in sequence length. This post walks through what an SSM actually is, what Mamba changes, and why the math shakes out the way it does at serving time.

What a State Space Model Is

A state space model is the standard form for any system whose evolution can be described by a hidden state that updates over time. In continuous time it looks like this:

h'(t) = A h(t) + B x(t)
y(t) = C h(t) + D x(t)

Here h(t) is the hidden state, x(t) is the input signal, y(t) is the output, and A, B, C, D are matrices that parameterize the dynamics. Control engineers and signal processing people have been using this form for decades. The interesting thing for deep learning is that you can discretize it and treat it as a sequence model.

Discretization gives you the recurrent form:

h_t = A_bar h_{t-1} + B_bar x_t
y_t = C h_t

where A_bar and B_bar are the discretized versions of A and B (typically using a zero-order hold or bilinear transform with a step size parameter). At inference, this is just a recurrence: store h, get x_t, compute h_t and y_t, move on. The state is fixed size. The per-token compute does not depend on how many tokens came before.

The S4 paper from Albert Gu and collaborators showed that if you parameterize A carefully (using a structured form derived from HiPPO theory), an SSM can capture long-range dependencies on synthetic benchmarks better than transformers, while running with linear complexity in sequence length. S4 was the first credible signal that this whole line of work could actually compete on real tasks.

Why S4 Was Not Enough

S4 and its successors (S4D, S5, GSS) demonstrated the asymptotic benefits but had a clear weakness on language. The dynamics in those models are linear time-invariant: A_bar and B_bar do not depend on the input. The same recurrence runs regardless of what token shows up. This is fine for signals where the relevant structure is roughly stationary, but language is not. The model needs to be able to ignore some tokens and pay close attention to others, depending on context. A linear time-invariant SSM cannot do that.

You can see this concretely in tasks like selective copying, where the model has to remember a few specific tokens from a long sequence and ignore the rest. S4 struggles. A transformer with attention can just put weight on the relevant positions. The SSM, with its fixed dynamics, has no mechanism to selectively attend.

Linear attention has the same problem in a different form. The state update is content-independent in the sense that every token contributes the same way to the running summary. The model has no gate to say "remember this" or "forget that". This is part of why fixed-feature linear attention has historically lagged on language benchmarks.

What Mamba Changes

Mamba (the S6 architecture, introduced by Gu and Dao in late 2023) makes the SSM parameters input-dependent. Specifically, B, C, and the discretization step Delta become functions of the current input token:

B_t = Linear_B(x_t)
C_t = Linear_C(x_t)
Delta_t = softplus(Linear_Delta(x_t))

The transition matrix A stays as a structured per-channel parameter, but the way each input gets written into the state and the way the state gets read out are now content-aware. This is the selectivity that S4 was missing. The model can effectively decide, per token and per channel, how much to update the state and how much of it to expose.

The trade-off is that input dependence breaks the convolutional view that S4 and friends used for fast parallel training. With time-invariant SSMs, you can express the entire sequence as a long convolution and use FFT-based kernels to train efficiently. Once B, C, and Delta depend on x, the convolutional form is gone. You are back to a recurrence, which on GPUs is bad news unless you are very careful about how you implement it.

The Mamba paper's key engineering contribution is a hardware-aware parallel scan kernel. The recurrence

h_t = A_bar_t h_{t-1} + B_bar_t x_t

is associative in a particular sense: you can compute it with a parallel scan (a generalization of prefix sum) in O(log n) depth on a parallel machine. The kernel keeps the state in SRAM, fuses the discretization and the scan, and avoids materializing the full sequence of states in HBM. This is structurally similar to what FlashAttention does for attention: the win is not new math, it is hardware-aware execution. With this kernel, Mamba trains at throughput comparable to a similarly-sized transformer.

Why Inference Is Fast

At inference, you do not need the parallel scan. Generation is one token at a time, so the recurrence runs sequentially: take the current state, compute B_t, C_t, Delta_t from the new token, do one update, emit one output. The cost per token is fixed.

Concretely, for a Mamba layer with hidden dimension d and SSM state dimension n, each step does roughly:

  • Project x_t to get the input-dependent parameters: O(d^2) or so depending on the parameterization.
  • Discretize: cheap, element-wise.
  • Update the state: h_t = A_bar_t * h_{t-1} + B_bar_t * x_t is a per-channel operation of size d * n.
  • Read out: y_t = C_t * h_t is another d * n operation.

Total per-token compute is O(d * n) for the SSM part, plus the standard projection costs that any layer has. Memory per layer is d * n floats for the state. None of this depends on sequence length.

Compare this to a transformer at position t in a long generation. Each layer reads t keys and t values out of HBM to compute attention. The compute is O(d * t) per layer per token, and it grows. At long contexts, you are bandwidth-bound on the KV cache and the entire decoder stalls waiting for HBM. The Mamba layer just reads the fixed state and moves on.

This is the inference shortcut, and it is qualitatively similar to what RWKV achieves with linear attention. The difference is that Mamba's selectivity gives it a way to compete on language benchmarks where pure linear attention has historically come up short.

Memory and Throughput

The constant-state-size property has practical consequences beyond just per-token cost.

A 7B-parameter transformer running at 128K context can easily need 10s of gigabytes of KV cache per request. Serving multiple long-context requests in parallel becomes a memory packing problem. Paged attention, prefix sharing, and aggressive eviction strategies exist because the cache is the dominant resource.

A Mamba model of similar size at the same context has a fixed state per layer per request, on the order of a few megabytes total. You can pack many more concurrent long-context requests into the same GPU. Throughput on long-context workloads ends up being substantially better, not just because per-token compute is lower, but because you stop being memory-pressured.

This also matters for streaming. A voice agent or transcription system that runs for hours needs a way to keep up without context management heroics. Mamba's state simply is the history, compressed. There is no eviction policy to design, no chunking strategy, no summarization step. The model accumulates indefinitely, and per-token latency stays flat.

Where SSMs Still Lag

Mamba is competitive with transformers on perplexity at the scales it has been trained at, and it generally wins on long-range tasks where the asymptotic advantage shows up. It is not a strict superset of attention, though, and the gap is real on a few specific things.

Exact recall over long contexts is harder. Compressing the entire history into a fixed-size state forces lossy summarization. Standard attention can pull any prior token verbatim, because the KV cache stores them explicitly. Mamba cannot, in general. Needle-in-a-haystack tests and tasks that require pinpoint retrieval from a long passage are where this shows up most clearly. Recent SSM variants and hybrid architectures have made progress here, but the underlying tension is structural.

In-context learning patterns can be different. Some of the tricks that work well with attention (looking up exemplars, copying spans, doing precise multi-hop reasoning across a prompt) lean on the same lookup capability. Mamba can simulate these to a degree, but the inductive biases are different and prompts that were tuned for transformers do not always port cleanly.

Tooling is less mature. The transformer ecosystem has years of optimized serving stacks, quantization recipes, fine-tuning libraries, and adapter frameworks. SSMs are catching up, but if you want to run a Mamba model in production today, expect to do more work than you would with a comparable Llama checkpoint.

Hybrids Are Probably the Right Answer

A growing line of work interleaves a small number of attention layers with many SSM or linear-attention layers. Jamba, Zamba, Samba, and the various Mamba+attention designs all share this idea. The intuition is that attention is good at exact recall and selective lookup, SSMs are good at cheap long-range mixing, and you want a small dose of the first inside a stack that is otherwise the second.

Empirically, these hybrids tend to keep most of the inference speed advantage of pure SSMs while closing the recall gap that pure SSMs leave. They also fit well with existing serving infrastructure, since the attention layers can use standard KV cache machinery while the SSM layers ride alongside with their fixed states. For production workloads at long context, the ratio of attention to SSM layers becomes a real tuning knob, and the right ratio depends on what your application actually needs.

When to Reach For an SSM

The case for a state space backbone is clearest when sequence length matters more than peak benchmark accuracy on retrieval-heavy tasks. Voice agents, real-time transcription, document processing pipelines, and on-device assistants are all natural fits. The constant memory and constant per-token compute change what is feasible: workloads that are economically painful with full attention become routine.

For chat applications with bounded context windows, the math is less compelling. A 32K-token coding session is not where the asymptotic advantage shows up, and the transformer ecosystem is more mature. The interesting decisions are for new products where context length is in the design space, or for serving infrastructure that needs to handle very long requests at scale.

If you want to benchmark SSM models against transformer baselines on your own workloads, or test fast inference for either architecture, the General Compute API supports a range of open models and is built for the latency-bound applications where the architecture choice actually matters. Documentation and a sandbox are at generalcompute.com.

ModeHumanAgent