Agent Readout

How to Fine-Tune Llama 4: Step-by-Step Guide with Code

A practical walkthrough for fine-tuning Llama 4: when to do it, how to prepare data, and working LoRA, QLoRA, and full fine-tune code, plus evaluation and deployment.

Author
General Compute
Published
2026-06-07
Tags
llama4, fine-tuning, lora, qlora, how-to

Markdown body


Fine-tuning Llama 4 is the process of taking the open-weight model Meta released and continuing its training on your own data so it behaves the way you need. Most teams reach for it when prompting alone stops being enough: the model keeps drifting from a format you require, it does not know your domain vocabulary, or you want to bake a long system prompt into the weights so every request gets cheaper and faster. This guide covers when fine-tuning is the right call, how to prepare data, and three concrete approaches with code you can run. The focus is on getting a working pipeline rather than chasing a leaderboard score.

Llama 4 is a Mixture of Experts family, which changes a few practical details compared to the dense Llama 3 models, but the core fine-tuning workflow is the same one you would use for any modern transformer. If you have fine-tuned before, most of this will be familiar and you can skip to the code.

## Decide Whether You Actually Need to Fine-Tune

Fine-tuning is not free. It costs GPU time, it adds a model artifact you have to host and version, and it can make the model worse at general tasks if you overdo it. Before you start, rule out the cheaper options.

Try these first:

- **Better prompting.** A clear system prompt with two or three examples solves a surprising number of formatting and tone problems.
- **Retrieval (RAG).** If the problem is that the model does not know facts about your business, retrieval is usually a better fit than fine-tuning. Fine-tuning teaches behavior and style; it is a poor way to store knowledge that changes often.
- **Structured output modes.** If you only need reliable JSON, use a JSON or schema-constrained decoding mode instead of training.

Fine-tune when you need consistent behavior that prompting cannot hold: a specific output structure across thousands of calls, a domain style, a classification head, or a way to shrink a giant system prompt into the weights. Once you have decided it is worth it, the next decision is which method to use.

## Three Approaches, Briefly

There are three common ways to fine-tune Llama 4, and they trade memory for fidelity.

**Full fine-tuning** updates every weight in the model. It gives you the most control and the best ceiling on quality, but for Llama 4 it needs serious hardware: multiple high-memory GPUs even for the smaller Scout variant, because you have to hold the weights, the gradients, and the optimizer states all at once. Most teams do not need this.

**LoRA** (Low-Rank Adaptation) freezes the original weights and trains small adapter matrices that get added to a few layers. You end up training well under one percent of the parameters, so memory drops sharply and the adapter file is small (often tens of megabytes). Quality is close to full fine-tuning for most tasks.

**QLoRA** is LoRA on top of a 4-bit quantized base model. The frozen weights are stored in 4-bit, which cuts memory again, and the LoRA adapters train in higher precision. This is what lets people fine-tune large models on a single GPU. It is the default choice for most projects, so we will spend the most time on it.

The rule of thumb: start with QLoRA, move to LoRA if you have memory to spare and want a small quality bump, and reach for full fine-tuning only when you have measured that the adapters are leaving real quality on the table.

## Step 1: Prepare Your Dataset

Data quality matters more than method. A clean dataset of 1,000 examples will beat a noisy dataset of 50,000 almost every time. For instruction fine-tuning, the standard format is a list of chat-style messages.

Use the same chat structure the model expects at inference. A single training example looks like this:

```json
{
  "messages": [
    {"role": "system", "content": "You are a support agent for Acme Cloud."},
    {"role": "user", "content": "How do I rotate my API key?"},
    {"role": "assistant", "content": "Open Settings, then API Keys, and click Rotate. Your old key stays valid for 24 hours."}
  ]
}
```

Store the dataset as JSON Lines, one object per line. A few practical rules that save pain later:

- **Match production.** If your live prompts include a system message, include it in training too. The model learns the mapping you show it.
- **Balance length.** A dataset where every answer is one sentence will teach the model to be terse everywhere. Mix in the response lengths you actually want.
- **Hold out a test split.** Set aside 5 to 10 percent of examples the model never sees during training so you can measure real improvement.
- **Deduplicate.** Near-duplicate examples inflate your count without adding signal and can cause overfitting.

Load it with the `datasets` library:

```python
from datasets import load_dataset

dataset = load_dataset("json", data_files={
    "train": "data/train.jsonl",
    "test": "data/test.jsonl",
})
```

## Step 2: QLoRA Fine-Tune (the Default Path)

This is the approach most teams should start with. It runs on a single high-memory GPU for the Scout variant. The code uses Hugging Face `transformers`, `peft` for the LoRA adapters, `bitsandbytes` for 4-bit quantization, and `trl` for the training loop.

Install the dependencies:

```bash
pip install "transformers>=4.45" peft bitsandbytes trl datasets accelerate
```

Load the model in 4-bit:

```python
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig

model_id = "meta-llama/Llama-4-Scout-17B-16E-Instruct"

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True,
)

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    quantization_config=bnb_config,
    device_map="auto",
    torch_dtype=torch.bfloat16,
)
```

Define the LoRA configuration. The `target_modules` list tells PEFT which layers get adapters. Targeting the attention and projection layers is a reliable default:

```python
from peft import LoraConfig

peft_config = LoraConfig(
    r=16,
    lora_alpha=32,
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM",
    target_modules=[
        "q_proj", "k_proj", "v_proj", "o_proj",
        "gate_proj", "up_proj", "down_proj",
    ],
)
```

A note on `r`, the rank: it controls how much capacity the adapter has. A rank of 8 to 16 is enough for most tasks. Going higher adds parameters and rarely helps unless you are teaching the model something genuinely new and complex. Keep `lora_alpha` at roughly twice the rank as a starting point.

Now set up the trainer and run it:

```python
from trl import SFTTrainer, SFTConfig

training_args = SFTConfig(
    output_dir="./llama4-scout-qlora",
    num_train_epochs=2,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,
    learning_rate=2e-4,
    bf16=True,
    logging_steps=10,
    save_strategy="epoch",
    max_seq_length=2048,
)

trainer = SFTTrainer(
    model=model,
    args=training_args,
    train_dataset=dataset["train"],
    eval_dataset=dataset["test"],
    peft_config=peft_config,
)

trainer.train()
trainer.save_model("./llama4-scout-qlora/final")
```

A few parameters are worth understanding rather than copying blindly:

- **Effective batch size** is `per_device_train_batch_size` times `gradient_accumulation_steps`, so the config above trains as if the batch size were 16 while only holding 4 examples in memory at once.
- **Learning rate** around `2e-4` is typical for LoRA. It is higher than you would use for full fine-tuning because you are training far fewer parameters.
- **Epochs** should stay low. Two is a good start. If your test loss keeps dropping you can add more, but watch for the training loss falling while test loss rises, which is overfitting.

## Step 3: Plain LoRA and Full Fine-Tuning

If you have the memory, plain LoRA is a one-line change from the QLoRA code: drop the `quantization_config` when loading the model and load it in `bfloat16` instead.

```python
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map="auto",
    torch_dtype=torch.bfloat16,
)
```

Everything else, including the `LoraConfig` and the trainer, stays the same. You trade more GPU memory for slightly higher quality and faster training steps, since there is no dequantization overhead on each forward pass.

Full fine-tuning means removing the `peft_config` entirely and letting the trainer update all weights. For Llama 4 this needs a multi-GPU setup with a sharding strategy like FSDP or DeepSpeed ZeRO-3, because the optimizer states alone are several times the size of the model. Lower the learning rate to around `1e-5` and use a small batch with gradient accumulation. Reach for this only after you have confirmed with evaluation that LoRA adapters are limiting you.

One Llama 4 specific note: because it is a Mixture of Experts model, the `gate_proj`, `up_proj`, and `down_proj` layers belong to the expert feed-forward networks. LoRA on these layers works, but if your task is narrow you can often target only the attention projections and still get most of the benefit with fewer trained parameters.

## Step 4: Evaluate Before You Trust It

Training loss going down is not proof your model improved at the task you care about. You need to test on held-out examples and, ideally, on a metric tied to your real use case.

Start with a quick qualitative check. Load the adapter and run a few prompts by hand:

```python
from peft import PeftModel

base = AutoModelForCausalLM.from_pretrained(
    model_id, device_map="auto", torch_dtype=torch.bfloat16,
)
model = PeftModel.from_pretrained(base, "./llama4-scout-qlora/final")

prompt = tokenizer.apply_chat_template(
    [{"role": "user", "content": "How do I rotate my API key?"}],
    tokenize=False, add_generation_prompt=True,
)
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
output = model.generate(**inputs, max_new_tokens=128)
print(tokenizer.decode(output[0], skip_special_tokens=True))
```

For a real evaluation, build a small scoring loop over your test split. The right metric depends on the task: exact match or F1 for classification, a JSON-validity check for structured output, or a rubric scored by a stronger model for open-ended generation. The key discipline is comparing the fine-tuned model against the base model on the same test set, so you know the training actually helped rather than just assuming it did.

Watch for two common failure modes. Overfitting shows up as great test answers that look suspiciously like training examples and poor performance on anything slightly different. Catastrophic forgetting shows up as the model getting worse at general tasks it used to handle, which usually means you trained too long or your data was too narrow.

## Step 5: Merge and Deploy

A LoRA adapter is a separate file you load on top of the base model. For deployment you can either keep them separate, which lets you swap adapters cheaply, or merge the adapter into the base weights for a single self-contained model:

```python
merged = model.merge_and_unload()
merged.save_pretrained("./llama4-scout-merged")
tokenizer.save_pretrained("./llama4-scout-merged")
```

Merging removes the small per-token overhead of applying the adapter at inference time, so it is worth doing for production unless you are serving many adapters from one base model.

From here you serve the merged model the same way you would serve any Llama 4 checkpoint, behind an inference engine that handles batching and KV caching. If you would rather not manage GPUs and serving infrastructure, GeneralCompute runs Llama 4 behind an OpenAI-compatible API, so once your fine-tune is validated you can point your existing client code at it with a base URL change. Check the [docs](https://generalcompute.com) for the supported variants and how to bring a custom checkpoint.

## A Realistic Workflow

Putting it together, a sane first pass looks like this: prepare a clean dataset of a few hundred to a few thousand examples that match production, run QLoRA for two epochs on the Scout variant, evaluate against the base model on a held-out split, and merge the adapter if the numbers hold up. If the results are close but not good enough, move to plain LoRA or push the rank higher before considering full fine-tuning. Most of your time should go into the data, not the hyperparameters, because that is where the quality actually comes from.

Fine-tuning rewards iteration. Your first run will teach you more about your data than any tutorial can, so get a small pipeline working end to end before you scale it up.
ModeHumanAgent
How to Fine-Tune Llama 4: Step-by-Step Guide with Code | General Compute