Agent Readout
Open-Source LLM Landscape 2025: Top Models Compared
A practical map of the open-source LLM ecosystem in 2025: the leading model families, how they stack up by size and task, what the licenses actually let you do, and how to pick one for production.
- Author
- General Compute
- Published
- 2026-06-10
- Tags
- open source llm, model comparison, llama, deepseek, qwen
Markdown body
The open-source LLM world moves fast enough that a comparison written six months ago is mostly wrong today. New checkpoints land every few weeks, benchmark numbers get chased and beaten, and the gap between open weights and the best closed models keeps narrowing. If you are trying to choose a model to build on, the hard part is not finding options. It is figuring out which of the dozens of releases actually matter for what you are doing. This post is a map of where things stand in 2025. It covers the model families worth knowing, how they group by size, what their licenses really allow, and how to think about picking one. The goal is to give you a mental model that survives the next few releases, not a leaderboard that goes stale by next quarter. ## What "Open" Actually Means Before comparing models, it helps to be precise about the word "open," because it covers a wide range. A few distinctions matter in practice. Open weights means you can download the model parameters and run them yourself. This is the part most people care about. Almost every model in this post is open in this sense. Open license is separate, and it is where the differences get sharp. Some models ship under Apache 2.0 or MIT, which let you do essentially anything, including commercial use, modification, and redistribution, with no strings. Others ship under custom licenses with conditions. Llama's license, for example, has an acceptable use policy and a clause that kicks in only for companies above a very large monthly active user threshold. For most teams that clause never applies, but you should read it rather than assume. Open training data and open training code are rarer still. Most "open" models release weights but not the dataset or the full recipe. A smaller set of projects, like the OLMo line from Allen AI, release everything, which matters if you are doing research or need full provenance for compliance reasons. When someone says a model is open, ask which of these they mean. For a production deployment, the license is usually the thing that decides whether you can ship. ## The Major Families A handful of families produce most of the models worth using. Knowing the families is more useful than memorizing individual checkpoints, because each family has a consistent philosophy that carries across releases. ### Llama (Meta) Llama is the model most of the ecosystem is built around. The tooling, the fine-tuning guides, the quantization formats, the inference servers: they all assume Llama first and add others later. Llama 4 introduced a mixture-of-experts design with variants like Scout and Maverick, moving away from the dense models of the Llama 3 line. The license is permissive for almost everyone but is not technically open source by the strict OSI definition, because of the use policy and the large-user clause. If you want the safest bet for ecosystem support and the largest pool of community fine-tunes, Llama is usually it. ### Qwen (Alibaba) Qwen has quietly become one of the strongest families, especially for multilingual work and coding. The Qwen 2.5 line covers an unusually wide range of sizes, from 0.5B up to 72B, and the Qwen3-Coder variants are competitive with anything open for code generation. Most Qwen models ship under Apache 2.0, which makes them attractive when you want a clean commercial license without conditions. Qwen is the family to look at when you need strong performance across many languages or you want a fully permissive license. ### DeepSeek DeepSeek made its name with aggressive efficiency. DeepSeek V3 is a large mixture-of-experts model that performs near the top of the open field while being cheaper to serve than its raw parameter count suggests, because only a fraction of the experts activate per token. DeepSeek R1 took the family into explicit reasoning, producing long chains of thought before answering, and rivals closed reasoning models on math and logic benchmarks. The MIT license on much of the DeepSeek lineup is about as permissive as it gets. DeepSeek is the family to watch when you care about reasoning quality or serving cost per token at the high end. ### Mistral Mistral built its reputation on small models that punch above their weight. Mistral 7B was, for a long time, the default 7B model, and the Mixtral mixture-of-experts releases showed how sparse models could match much larger dense ones at a fraction of the active compute. Mistral continues to ship both open and commercial models, with the open ones generally under Apache 2.0. Mistral is a strong choice when you want efficient models in the small-to-mid range with a permissive license. ### Gemma (Google) Gemma is Google's open-weights family, derived from the same research as the Gemini models. The Gemma 2 line, in sizes like 2B and 9B, is well regarded for quality at small sizes and integrates cleanly with the broader Google tooling. The Gemma license is custom but generally permissive for commercial use. Gemma is worth a look for small, high-quality models, particularly if you are already in the Google ecosystem. ### The Research-Open Set Beyond the big labs, projects like OLMo (Allen AI) and the various fully open community efforts release not just weights but data and training code. These rarely top the benchmarks, but they are valuable when you need full transparency, want to study training dynamics, or have compliance requirements that demand knowing exactly what the model was trained on. ## Comparing by Size Bucket Raw benchmark scores matter less than picking the right size for your constraints. Models cluster into a few buckets, and the right question is which bucket fits your latency budget, your hardware, and your accuracy needs. | Size bucket | Representative models | Typical use | | --- | --- | --- | | Under 1B | Qwen 2.5 0.5B, TinyLlama | Edge, mobile, classification, draft models | | 2B to 4B | Gemma 2 2B, Llama 3.2 3B, Phi-3.5 Mini | On-device assistants, cheap high-volume tasks | | 7B to 9B | Llama 3.1 8B, Mistral 7B, Qwen 2.5 7B, Gemma 2 9B | General-purpose workhorse, most production apps | | 32B to 72B | Qwen 2.5 72B, Llama 3.1 70B, QwQ-32B | Harder reasoning, higher-quality generation | | Frontier MoE | DeepSeek V3, Llama 4 Maverick | Top-end quality, complex multi-step work | A few patterns hold across the buckets. The 7B-to-9B bucket is where most production applications should start. These models are good enough for a wide range of tasks, cheap to serve, fast enough for interactive use, and supported by every inference stack. You move off this bucket only when you have evidence that you need to, either down for cost and latency or up for accuracy. The sub-4B bucket has improved dramatically. A 3B model in 2025 does things that needed a 13B model two years ago. For high-volume, well-scoped tasks like classification, extraction, or routing, a small model fine-tuned on your data often beats a large general model and costs a fraction as much to run. The frontier mixture-of-experts models change the cost math. A model like DeepSeek V3 has a huge total parameter count but activates only a small slice per token, so it serves more like a mid-size dense model in compute terms while delivering near-top-tier quality. This is why MoE has become the dominant design at the high end. ## Comparing by Task Size is one axis. The task is the other, and some families have clear strengths. For coding, the Qwen3-Coder variants and DeepSeek Coder are the open models to beat. They are tuned specifically on code and consistently lead open benchmarks like HumanEval, MBPP, and SWE-bench. A general model of the same size will usually trail a code-specialized one on real coding work. For reasoning and math, the explicit reasoning models, DeepSeek R1 and Qwen's QwQ-32B, are the standouts. They spend extra tokens thinking before they answer, which costs more per request but lifts accuracy on problems that need multi-step logic. For a customer-support bot this is overkill. For an agent solving math or doing careful planning, it is the difference between usable and not. For multilingual work, Qwen and the larger Llama models lead. If your users are not all writing English, test the multilingual benchmarks rather than assuming a model trained mostly on English will transfer. For general chat and instruction following, almost any of the 7B-to-9B instruct models will do a competent job. This is the most commoditized part of the landscape, which is good news: you have many interchangeable options and can choose on license, speed, and serving cost rather than fighting over small quality differences. ## How to Actually Choose Given all this, here is a process that works more often than chasing the top of a leaderboard. Start with the constraints, not the model. Write down your latency budget, your per-request cost ceiling, the languages you need, and whether the task is reasoning-heavy or straightforward. These narrow the field faster than any benchmark. Default to a 7B-to-9B model with a permissive license. Qwen 2.5 7B, Llama 3.1 8B, and Mistral 7B are all reasonable starting points. Build your prototype on one of them. Move only with evidence. If your evals show the model is not accurate enough, step up a size bucket or to a task-specialized model before you reach for a frontier model. If cost or latency is the problem, step down and consider fine-tuning a smaller model on your specific task. Check the license against your business. A model that is technically excellent but ships under a license your legal team will not approve is not an option. Apache 2.0 and MIT models, common across Qwen, Mistral, and much of DeepSeek, avoid most of these conversations. Run your own evals. Public benchmarks are a starting filter, not a decision. The model that wins on MMLU may lose on your actual prompts. A small evaluation set drawn from your real traffic will tell you more than any leaderboard. ## Where Inference Speed Fits Picking a model is only half the decision. The same open weights can feel completely different depending on how they are served. A 7B model that returns its first token in 200 milliseconds is a different product from the same model that takes two seconds, even though the output is identical. For interactive applications, voice agents, and coding assistants, serving speed often matters more than squeezing out the last few points of benchmark accuracy. This is the part of the stack that is easy to underestimate. You can choose the right model and still ship a slow product if the inference layer is not tuned for it. The open-source landscape gives you the weights; what you do with them at serving time decides how the application feels. General Compute runs the leading open models on infrastructure built specifically for low-latency, high-throughput inference, behind an OpenAI-compatible API. If you have narrowed down a model from the landscape above and want to see how it performs when speed is the priority, you can point your existing code at our endpoint by changing the base URL and start measuring. Browse the [docs](https://generalcompute.com) to see which models are available and how to get started. The ecosystem will keep moving. New families will appear, the benchmark numbers will keep climbing, and some of the specific checkpoints named here will be superseded. The structure, though, tends to hold: known families with consistent philosophies, clear size buckets, task specialization that beats general models, and licenses that decide what you can ship. Keep that map in your head and the next wave of releases is much easier to read.