Agent Readout
DeepSeek R1: What It Is, How It Works, and Why It Matters
DeepSeek R1 is an open-weight reasoning model trained mostly through reinforcement learning. Here is how its architecture and training work, how it compares to GPT-4 class models, Claude, and Llama, and what its reasoning style means for inference.
- Author
- General Compute
- Published
- 2026-06-04
- Tags
- deepseek-r1, reasoning-models, reinforcement-learning, open-source-llm
Markdown body
When DeepSeek released R1, the thing that got everyone's attention was not a single benchmark number. It was the combination: a model that reasons through hard math and coding problems at roughly the level of the best closed reasoning models, released with open weights, trained for a fraction of the budget people assumed was required. R1 made the case that strong reasoning ability could come out of reinforcement learning applied directly to a base model, with far less hand-labeled data than the field had been treating as mandatory. This post explains what DeepSeek R1 actually is, how its architecture and training pipeline work, how it stacks up against GPT-4 class models, Claude, and Llama, and what its long reasoning traces mean when you actually serve it in production. ## What DeepSeek R1 Is DeepSeek R1 is a large language model built for reasoning. That means it does not answer immediately. It generates a long internal chain of thought, often hundreds or thousands of tokens, before it produces a final answer. You can think of it as the open-weight counterpart to the reasoning models that OpenAI and others ship behind an API, except you can download the weights and run them yourself. R1 is built on top of DeepSeek's V3 base model, which is a Mixture of Experts (MoE) architecture. The full model has around 671 billion total parameters, but only about 37 billion are active for any given token because the MoE router selects a small subset of experts per token. This is the central trick that makes R1 economical: it has the knowledge capacity of a very large dense model but the per-token compute cost of a much smaller one. DeepSeek released the model in a few forms. There is R1-Zero, an experiment trained purely with reinforcement learning and no supervised fine-tuning. There is the full R1, which adds some supervised data to fix the rough edges. And there is a set of distilled models, smaller dense networks (Qwen and Llama based, from 1.5B up to 70B) that were trained to imitate R1's reasoning traces. The distilled versions matter a lot in practice because most teams cannot serve a 671B MoE model, but they can run a 14B or 32B dense distill on a single GPU. ## The Architecture R1 inherits its backbone from DeepSeek V3, so understanding R1 means understanding a few choices in that base model. The first is Mixture of Experts. Instead of one large feed-forward network in each transformer layer, the model has many smaller expert networks plus a router that decides which experts handle each token. DeepSeek V3 uses fine-grained experts with a shared expert that is always active, which helps the model keep general knowledge in one place while specialized experts handle narrower patterns. Because only a fraction of experts fire per token, the active parameter count stays low. The second is Multi-Head Latent Attention (MLA). Standard attention stores a key and value vector per token per head in the KV cache, and that cache is what eats memory during long-context inference. MLA compresses the keys and values into a smaller latent vector and reconstructs them on the fly, which shrinks the KV cache substantially. For a model that generates very long reasoning chains, the KV cache size is a real constraint, so MLA is not a minor detail. It directly affects how much context you can hold and how many requests you can batch. The third is that V3 was trained with FP8 in large parts of the pipeline, which cut training cost and is part of why the reported budget was so low. That is a training-side decision more than an inference one, but it is part of why R1 exists at the price it does. ## How R1 Was Trained The training story is the most interesting part of R1, because it pushed on an assumption the field had been making. The usual recipe for a reasoning model looks like this: take a base model, do supervised fine-tuning on a large pile of human-written reasoning examples, then apply reinforcement learning from human feedback to polish it. The supervised reasoning data is expensive because someone has to write or curate step-by-step solutions. DeepSeek's R1-Zero experiment skipped the supervised step entirely. They took the V3 base model and applied reinforcement learning directly, using a rule-based reward. For math and coding problems where the answer can be checked automatically, you do not need a human to grade the output. You just check whether the final answer is correct or whether the code passes the tests, and reward the model accordingly. They also rewarded the model for putting its reasoning inside specific tags and producing a clean final answer. The reinforcement learning algorithm they used is called Group Relative Policy Optimization (GRPO). The short version is that for each problem, the model generates a group of candidate answers, and the advantage for each answer is computed relative to the average score of the group rather than against a separately trained value network. Dropping the value network saves a lot of memory and compute, which matters when your rollouts are thousands of tokens long. What came out of R1-Zero was striking. With no supervised reasoning examples at all, the model learned on its own to generate longer chains of thought over the course of training, to check its own work, and to backtrack when a line of reasoning was not working. There is a now-famous moment in the technical report where the model writes something like "wait, let me reconsider" mid-solution. Nobody taught it to do that. It emerged because longer, more careful reasoning produced correct answers more often, and the reward favored correct answers. R1-Zero had problems, though. Its outputs were sometimes hard to read, mixed languages, and were not well suited to general chat. So the full R1 adds a small amount of high-quality supervised data as a cold start before the reinforcement learning, plus a later stage that brings in general helpfulness and safety. The result keeps the strong reasoning from the pure-RL approach but behaves like a usable assistant. ## The Distilled Models The distilled models deserve their own mention because they are what most people will run. DeepSeek took the reasoning traces R1 produces and used them as supervised training data for smaller dense models like Qwen 2.5 and Llama 3. The smaller models learn to imitate the long reasoning style. The interesting finding was that distillation beat running reinforcement learning directly on the small models. A 32B model distilled from R1 outperformed the same 32B model trained with its own RL run. The reasoning ability discovered by the big model transfers down better than it can be rediscovered from scratch at small scale. For a team that wants reasoning on a budget, the practical answer is usually a distilled model rather than the full R1. ## How R1 Compares Here is the rough picture, keeping in mind that benchmark numbers move and you should test on your own workload. On math-heavy benchmarks like AIME and MATH, R1 lands in the same neighborhood as OpenAI's o1, which was the strongest reasoning model when R1 launched. On competition coding benchmarks and on reasoning-oriented evaluations, R1 is competitive with o1 and clearly ahead of non-reasoning models like the original GPT-4, Claude's standard (non-extended-thinking) responses, and Llama's instruct models. The reasoning training is what closes that gap. Against GPT-4 class general models, the comparison is more about task type than a single winner. For problems that benefit from step-by-step deduction, math, algorithmic coding, logic puzzles, R1 tends to do better because it spends tokens thinking. For tasks where a fast direct answer is fine, such as straightforward writing, summarization, or simple lookups, a non-reasoning model is often just as good and far cheaper, because it does not burn hundreds of tokens on a chain of thought you never read. Against Claude and Llama specifically: Claude's strongest reasoning shows up when you enable its extended thinking mode, and at that point the two are in similar territory on hard reasoning, with each having strengths depending on the domain. Llama's open models are strong general-purpose models but were not trained with the same reasoning-focused RL, so on the hardest math and coding problems R1 and its distills generally pull ahead. The piece that made R1 matter beyond benchmarks is that it is open weight and was trained cheaply. It showed that the reasoning capability was not a moat that required an enormous closed budget, and that changed the conversation about who can build these models. ## What R1 Means for Inference If you decide to serve R1 or one of its distills, the reasoning style changes how you think about cost and latency. The main thing is token count. A reasoning model does not answer in 50 tokens. It might generate 2,000 tokens of internal reasoning before a 100-token answer. You pay for all of those tokens, and more importantly, the user waits for all of them. Time to first token is the same as any model, but time to a useful answer is much longer because the useful answer comes at the end of a long generation. That has a few consequences: - **Throughput per request drops** because each request occupies the GPU for far longer. Continuous batching helps you keep utilization high across many concurrent requests, but a single reasoning request is expensive. - **The KV cache grows large** because of the long sequences. This is exactly why MLA in the base model is useful, and it is why you want a serving stack with paged KV cache and prefix caching so shared system prompts are not recomputed. - **Speed compounds.** If you put a reasoning model inside an agent that calls it many times in a loop, the long generations stack up. A model that is twice as fast per token turns a 40-second reasoning step into 20 seconds, and across a multi-step agent that difference is the gap between usable and not. This is the part where fast inference stops being a nice-to-have. Reasoning models trade tokens for accuracy, and tokens are latency, so the value of running them on hardware tuned for high token throughput is larger here than for ordinary chat models. Serving a distilled R1 on infrastructure built for low latency and high throughput is what makes the reasoning practical inside a real product rather than a demo you wait on. ## When to Reach for R1 R1 and its distills are a good fit when the task genuinely needs reasoning: hard math, algorithmic or competitive coding, multi-step logic, anything where a model that checks its own work beats one that answers from the hip. They are a poor fit when you need a fast, cheap response and the problem does not require deliberation, because you will pay for reasoning tokens that add nothing. A common pattern is to route. Send easy requests to a fast non-reasoning model and only escalate the hard ones to a reasoning model. That keeps your average cost and latency low while still having the deep reasoning available when a query needs it. ## Trying It The distilled R1 models are the easiest entry point. A 14B or 32B distill runs on a single modern GPU and gives you most of the reasoning behavior without the operational weight of a 671B MoE model. If you want to test R1's reasoning behind an OpenAI-compatible API without managing the serving stack yourself, you can point your existing client at General Compute's endpoint and swap the model name. The long reasoning traces are exactly the kind of workload that benefits from inference tuned for token throughput, so it is a reasonable place to see how the model behaves under real latency budgets. Whatever path you take, the thing to internalize about R1 is the shift it represents: strong reasoning came out of reinforcement learning against checkable rewards, the weights are open, and the cost to serve it is dominated by how many tokens it thinks through. Plan your infrastructure around that and the model earns its keep.