We raised $15M to build the world's fastest neocloud.Read
llama4open-source-llmgetting-startedinference

Llama 4 on GeneralCompute: Getting Started Guide

General Compute·

Llama 4 is Meta's latest open-weight model family, and it is the first Llama generation built around a Mixture of Experts design from the start. That changes a few things about how you pick a variant, how much hardware you need, and how you should think about cost per token. This guide walks through the variants, the practical hardware requirements, and how to get a working request running against GeneralCompute in a few minutes. By the end you should be able to make your first call, stream responses, and reason about which variant fits your workload.

If you have run Llama 3 before, most of your code will carry over without changes. The API is OpenAI compatible, so the main decisions are which variant to call and how to set a few parameters that matter more for an MoE model than they did for the older dense models.

The Llama 4 Variants

Llama 4 ships as a small family rather than a single model, and the variants differ in ways that actually matter for serving. The two you will reach for most are Llama 4 Scout and Llama 4 Maverick.

Llama 4 Scout is the smaller of the two. It uses a Mixture of Experts layout with a relatively modest number of active parameters per token, which keeps the per-token compute cost low. Scout is the variant to start with for chat, summarization, classification, and most retrieval-augmented generation work. It is also the one to use when you care about time to first token and want to keep cost down on high-volume traffic. Scout supports a very long context window, which makes it a good fit for document-heavy workloads where you stuff a lot of retrieved text into the prompt.

Llama 4 Maverick is the larger general-purpose variant. It has many more total parameters spread across more experts, so it holds more knowledge and tends to do better on harder reasoning, coding, and multi-step instruction following. Because it is an MoE model, only a fraction of those parameters fire on any given token, so the per-token cost is far lower than a dense model of the same total size would be. Maverick is the variant to choose when Scout's answers are not strong enough and you are willing to trade some latency and cost for quality.

Both variants are natively multimodal, meaning they accept image inputs alongside text. If your application needs to read screenshots, charts, or photographs, you do not need a separate vision model.

There is a third, much larger model in the family aimed at the frontier of open-weight quality. It is built for the hardest tasks and for use as a teacher model when distilling smaller models. Most teams will not serve it directly because of its size, but it is worth knowing it exists if you are benchmarking the absolute ceiling of the family.

A Note on Mixture of Experts

The reason Llama 4 behaves differently from Llama 3 at serving time comes down to the MoE design. In a dense model, every parameter participates in computing every token. In an MoE model, each transformer layer has many expert feed-forward networks plus a router that picks a small subset of them per token. The model has the knowledge capacity of a very large network but the per-token compute cost of a much smaller one.

This matters for you in two practical ways. First, throughput and cost per token are better than the total parameter count would suggest, because only the active experts run. Second, memory requirements are driven by the total parameter count, not the active count, because all the experts have to be resident in memory even though only a few fire per token. So an MoE model can be cheap to run per token while still demanding a lot of GPU memory to host.

On GeneralCompute the hosting and memory side is handled for you, so the part you feel directly is the favorable cost per token. But understanding the trade-off helps explain why Maverick can be both large and affordable at the same time.

Hardware Requirements

If you are calling Llama 4 through the GeneralCompute API, you do not provision any hardware. You send requests and we serve them on infrastructure tuned for these models. This is the path most teams should take, especially early on, because you skip the work of sizing GPUs, loading weights, and keeping a serving stack healthy.

It is still useful to know the rough requirements, both so you can reason about cost and so you can decide whether self-hosting ever makes sense.

For self-hosting, the binding constraint is GPU memory, and it depends on the variant and the precision you run at. As a rough guide:

  • Llama 4 Scout at 8-bit precision fits on a single high-memory data center GPU, and at 4-bit quantization it fits comfortably with room for a reasonable batch and KV cache.
  • Llama 4 Maverick needs multiple high-memory GPUs even at reduced precision, because its total parameter count is large despite the low active count. You will be looking at a multi-GPU node with a fast interconnect.
  • The largest variant requires a full multi-GPU server or more, and is generally out of reach for teams without serious infrastructure.

Quantization changes these numbers a lot. Running at 4-bit instead of 16-bit roughly quarters the memory needed for the weights, at some cost to quality. The long context windows Llama 4 supports also consume memory through the KV cache, so if you plan to use very long prompts you need to budget memory for the cache on top of the weights.

The short version: Scout is approachable to self-host, Maverick is a real infrastructure commitment, and the API removes the question entirely if you would rather not manage any of it.

Your First Request

The GeneralCompute API is OpenAI compatible, so if you have used the OpenAI SDK before, this will look familiar. You only need to change the base URL and the API key, then point the request at a Llama 4 model.

Here is a minimal call in Python:

from openai import OpenAI client = OpenAI( base_url="https://api.generalcompute.com/v1", api_key="YOUR_GC_API_KEY", ) response = client.chat.completions.create( model="llama-4-scout", messages=[ {"role": "system", "content": "You are a concise technical assistant."}, {"role": "user", "content": "Explain what a Mixture of Experts model is in two sentences."}, ], ) print(response.choices[0].message.content)

The same request in Node.js looks like this:

import OpenAI from "openai"; const client = new OpenAI({ baseURL: "https://api.generalcompute.com/v1", apiKey: process.env.GC_API_KEY, }); const response = await client.chat.completions.create({ model: "llama-4-scout", messages: [ { role: "system", content: "You are a concise technical assistant." }, { role: "user", content: "Explain what a Mixture of Experts model is in two sentences." }, ], }); console.log(response.choices[0].message.content);

To switch to the larger variant, change the model name to llama-4-maverick. Nothing else in your code has to change.

Streaming Responses

For anything user-facing, you almost always want to stream tokens as they are generated rather than waiting for the whole response. Streaming makes the application feel responsive because the first words appear quickly, and it is the natural fit for chat interfaces and voice applications.

Streaming with the Python SDK is a one-line change:

stream = client.chat.completions.create( model="llama-4-scout", messages=[ {"role": "user", "content": "Write a haiku about fast inference."}, ], stream=True, ) for chunk in stream: delta = chunk.choices[0].delta.content if delta: print(delta, end="", flush=True)

Each chunk carries a small piece of the response in delta.content. You append the pieces as they arrive. For a voice agent or a coding assistant, this is what lets you start speaking or rendering before the model has finished thinking.

Sending Images

Both Scout and Maverick accept images in the same message format the OpenAI vision API uses. You pass an image as part of the user message content, either as a URL or as a base64 data URL:

response = client.chat.completions.create( model="llama-4-scout", messages=[ { "role": "user", "content": [ {"type": "text", "text": "What does this chart show?"}, {"type": "image_url", "image_url": {"url": "https://example.com/chart.png"}}, ], }, ], ) print(response.choices[0].message.content)

This is handy for document understanding, screenshot-driven agents, and any workflow where the input is visual rather than text.

Parameters Worth Tuning

A few request parameters have an outsized effect on quality, speed, and cost. These are the ones to set deliberately rather than leaving at defaults.

max_tokens caps the length of the response. Set it to something close to what you actually expect, because a high cap can let the model ramble and you pay for every generated token. For classification or extraction tasks, a small cap also serves as a guardrail.

temperature controls randomness. For factual or code tasks, keep it low (around 0 to 0.3) so the output is stable and repeatable. For creative writing or brainstorming, raise it. The default is a middle value that is rarely the best choice for either extreme.

top_p is an alternative way to control diversity by limiting sampling to the most probable tokens. Most people tune temperature or top_p, not both at once.

stop lets you give the model strings that end generation early. If you are generating structured output or a single line, a well-chosen stop sequence saves tokens and avoids trailing junk.

For long-context work, remember that the prompt counts toward both cost and latency. A very long retrieved context will raise your time to first token because the model has to read all of it before it can start generating. If latency matters, trim the context to what is actually relevant rather than sending everything you have.

Choosing Between Scout and Maverick

A simple way to decide: start with Scout, measure quality on your real tasks, and only move to Maverick if Scout falls short. Scout is faster and cheaper, and for a large share of workloads (chat, summarization, routing, retrieval answers) it is good enough. Reserve Maverick for the cases where you have measured a real quality gap, such as harder reasoning, multi-file code changes, or tricky instruction following.

If you are running an agent that makes many model calls in a loop, the speed difference compounds. A coding agent or a voice agent makes the latency of each individual call very visible to the user, so the faster variant often wins on user experience even when the slower one scores slightly higher on a benchmark. This is exactly the kind of workload GeneralCompute is built for, where the gating factor is how quickly each step completes.

Where to Go Next

Once you have a request working, the natural next steps are to add streaming everywhere it helps, tune max_tokens and temperature for your tasks, and benchmark Scout against Maverick on your own data rather than on generic benchmarks. If you are migrating an existing app, the OpenAI-compatible API means you can usually point your current code at GeneralCompute by changing the base URL and the model name, then validate the outputs.

You can find the full API reference, the current list of available model names, and pricing in the GeneralCompute docs. If you are building something latency-sensitive like a voice agent or a coding assistant, that is where Llama 4 on fast inference infrastructure earns its keep, and it is worth running your own numbers to see the difference. Grab an API key, send the first request above, and start measuring.

ModeHumanAgent
Llama 4 on GeneralCompute: Getting Started Guide | General Compute