How to finetune LLMs

It's 2024 and finetuning Large Language Models (LLMs) has become a worthwhile and surprisingly efficient and accessible process. Finetuning not only makes models more accurate and relevant for specific tasks, but it allows to surpass many closed-source models in terms of use-case specific performance.

Finetuning, in very short, allows you to adapt a pre-trained LLM to your particular applications, improving its performance on targeted tasks without the need for training a model from scratch. This process can lead to significant improvements in accuracy, relevance, and efficiency for specialized applications.

In this blog post we're going to introduce the finetuning process, exemplary on a Llama 3 model. We'll cover everything from the basics of what finetuning entails, setting up the required environment, how to create your datasets for finetuning, how to execute the finetuning process and finally how to use your newly created model variant for text inference.

What is finetuning and why do we want to fine-tune LLMs?

Model finetuning is a process in machine learning where a pre-trained model is further trained on a specific dataset or task to improve its performance in a particular domain. In the context of Large Language Models, finetuning allows us to adapt these powerful, general-purpose models to specialized applications without the need to train a new AI model from scratch.

This process adjusts the model's parameters to better fit the new data and task requirements. The key idea is to leverage the knowledge and patterns learned by the model during its initial training on vast amounts of data, and then refine this knowledge for a more focused application.

Why is Finetuning Important?

Improved Performance: Finetuning can significantly enhance a model's performance on specific tasks or domains, often surpassing both the original pre-trained model and smaller models trained from scratch on the specific task.
Resource Efficiency: Training LLMs from scratch requires enormous computational resources and vast amounts of data. Finetuning allows us to achieve excellent results with much less data and computational power.
Customization: Finetuning enables the adaptation of general-purpose models to niche domains or specialized tasks, making them more relevant and accurate for specific use cases.
Faster Development: Instead of spending months or years developing and training a new model, finetuning allows researchers and developers to create high-performing, task-specific models in a matter of days or weeks.
Overcoming Limitations: Finetuning can help address some limitations of pre-trained models, such as biases or outdated information, by introducing new, carefully curated data.
Continuous Learning: As new data becomes available or requirements change, models can be periodically finetuned to stay up-to-date and relevant.

When to consider finetuning your own AI model

When you have a specific task or domain that differs from the general use case of the pre-trained model.
When you have a dataset that represents your specific use case or contains information not present in the original training data.
When you need to improve the model's performance on particular types of inputs or outputs.
When you want to adapt the model's style, tone, or domain-specific language.

Now that we understand the importance and benefits of finetuning LLMs, let's dive into the practical aspects of LLM finetuning.

Finetuning concept

Setting up your environment for finetuning LLMs

Before engaging in the finetuning process itself, let's get our system ready. This chapter outlines our hardware as well as software requirements and guides you step by step through the installation process. This ensures you have all the necessary tools and resources to efficiently finetune Large Language Models.

As this chapter sounds quite extensive, let me assure you first, that it's not. We only need the following tools and accounts:

A Hugging Face account to download the base models and (optionally) some training datasets.
A handful of Python packages to run the training process. These are pytorch, transformers, datasets, accellerate, peft,huggingface_hub, bitsandbytes, and trl.
Hardware: While it's theoretically possible to finetune LLMs on CPUs, as of time of this writing, using a CUDA enabled NVIDIA GPU is highly recommended. This will speed up the training process significantly.

It's possible to use AMD, Intel or even Mac-GPUs for model finetuning, however, due to NVIDIA's dominance in the field of AI, most tools are optimized for NVIDIA - therefore our guide assumes NVIDIA GPUs.

Hardware Requirements

The main requirement for finetuning LLMs is the GPU - and there the limiting factor is the amount of VRAM. The model itself needs to fit in GPU memory as well as some headroom for the adapters and training process.

If you have a GPU with Ampere architecture or later - like the NVIDIA RTX4090, RTX3090 or A10, you can also benefit from Flash Attention. Flash Attention is a technique to both reduce the memory footprint of the training process as well as drastically speed up the training process. (In very short: It rearranges parts of the finetuning process to better utilize the GPU's architecture. Read more here.

If you are not one of the lucky people with your own GPUs, use one of the many GPU cloud providers. Two we use regularly and can recommend are runpod.io and latitude.sh.

Setting up your Hugging Face account

Navigate to the Hugging Face sign up page.
Use the form to create a new account.
After being logged in to your new account, navigate to Settings -> Access Tokens
Create a new access token. Note the token, as we'll need it in a minute.

Installing the python packages

The final preparation step is to install the python packages. The following preparation assumes python 3.11.

First, make sure you have a virtual environment set up, either use venv or anaconda.

We regularly use the latter and therefore recommend it.

Activate your virtual environment
Install the packages:

1pip install \
2"torch==2.2.2" \
3tensorboard \
4"transformers==4.40.0" \
5"datasets==2.18.0" \
6"accelerate==0.29.3" \
7"bitsandbytes==0.43.1" \
8"huggingface_hub==0.22.2" \
9"trl==0.8.6" \
10"peft==0.10.0"

Note: Make sure to pin the library versions as the python ecosystem is known for introducing incompatibilities when upgrading minor versions. Better save than sorry...

As a last step, let's initialize our Hugging Face CLI:

1huggingface-cli login --token "<your-token>"

That's it, we're done.

What did we actually install?

Let's quickly introduce the libraries we installed - to get a little more familiar with the tools we use:

torch: PyTorch is an open-source machine learning library that provides a flexible and efficient framework for building and training all sorts of machine learning and deep learning models.
tensorboard: TensorBoard is a visualization toolkit for machine learning experimentation. It enables tracking and visualizing metrics such as loss and accuracy, model architecture, and more.
transformers: The Hugging Face Transformers library provides a wide range of pre-trained models for natural language processing tasks, as well as tools for finetuning and deploying these models.
datasets: The Hugging Face Datasets library provides a collection of data for natural language processing tasks, as well as tools for loading, processing, and interacting with these.
accelerate: The Hugging Face Accelerate library provides tools for distributed training and mixed-precision training of deep learning models.
bitsandbytes: The Hugging Face Bits and Bytes Library provides tools for CUDA optimizers and quantization.
huggingface_hub: The Hugging Face Hub library provides tools for sharing, discovering, and using models and datasets from the Hugging Face hub.
trl: The Hugging Face TRL library provides tools for training and finetuning deep learning models with teacher-student learning. It also provides a convenient interface for finetuning.
peft: The Hugging Face PEFT library (Parameter-Efficient Finetuning) provides tools for training and fine-tuning deep learning models with progressive early stopping - which means adapting large pretrained models to various downstream applications without finetuning all of a model’s parameters.

How to create a dataset for finetuning Large Language Models?

The main question many beginners in the finetuning space have is, what data do I need to finetune my model? And how specifically do I need to prepare the dataset?

While it's hard to impossible to give a general answer on what data exactly you need and especially how many of them, there are some general guidelines which apply:

Domain-specific data: If you want to finetune a model for a specific domain, you should gather data that is relevant to that domain. For example, if you are finetuning a model for legal text, you should use legal documents. If you are finetuning a model for medical text, you should use medical records.
Task-specific data: If you are finetuning a model for a specific task, you should gather data that is relevant to that task. For example, if you are finetuning a model for sentiment analysis, you should use a dataset with examples of the sentiments you want to find.
Diverse examples: It's important to have a diverse set of examples in your dataset. This helps the model learn to generalize to new examples. This last part is especially important. You want the model to learn to generalize to new examples.
Quality before quantity: It's better to have a smaller dataset with high quality and diverse examples than a large dataset with potentially wrong or repeating examples. LLMs have shown to be able to quickly fit to specific examples. This means: a. the models are very quickly "destroyed" if they see wrong examples (as they might "remember" them forever) but also b. models learn very quickly from examples. We created successful finetunes with just 100 rows of high-quality data.

Now, how to get to these data?

Create them manually: Still one of the best approaches, leading to highest quality. But, also the most time consuming.
Use LLMs to generate synthetic finetuning data: Many LLMs are trained nowadays on synthetic data. LLMs can generate quite good datasets when instructed well. This can be useful when you either want to extend a small manually created dataset or if you want to finetune your model for a field which the other model is already good in (eg. using GPT-4 to make smaller models better).

Another approach for synthetic data generation is, to use your internal documentation, emails, or other text sources and use them in the synthetic generation prompt. This way, the LLM which generates your dataset grounds it's output in your very own language, domain and truth.
Use existing, public datasets: There are hundreds of thousands of datasets on the Hugging Face dataset hub. Using one of them might be a good staring point - especially for very general purpose applications like text-to-sql. Not much sense in creating your own, when there are already established ones available.

Format of a finetuning dataset

With the introduction of the remarkable trl library, formatting finetuning datasets got much easier. Before showing the formats, let's consider one last time what we want to do: We want to teach a model to understand a specific domain or task and generate correct answers based on our input question or prompt.

This gives a hint on how to structure the dataset: We need a "question" and we need an "answer".

Specifically, with trl there are two ways to implement this format:

The conversational format:

1{"messages": [{"role": "system", "content": "Your system message"}, {"role": "user", "content": "Your first prompt"}, {"role": "assistant", "content": "Your first answer"}]}
2{"messages": [{"role": "system", "content": "Your system message"}, {"role": "user", "content": "Your second prompt"}, {"role": "assistant", "content": "Your second answer"}]}
3{....}

The instruction format:

1{"prompt": "Your first prompt", "completion": "Your first answer"}
2{"prompt": "Your second prompt", "completion": "Your second answer"}
3{....}

As you might find, these formats are very close to what the LLM APIs require the format to be during the subsequent inference.

In the first format, we assume a chatbot-like application and basically create artificial conversations. In the second format, we assume a instruction -> answer type of system and therefore simply create question/answer pairs.

In summary, decide for one format or another and make sure each "row" on your dataset is formatted as defined above.

What about test data?

An important question is how to split training and test data. While again this depends on the use-case, it's best to have 10-20% of your data dedicated for testing/validation. Either manually attribute the rows to training or testing (make sure to "uniformly" distribute the data) or simply use a random split. The latter works well for larger datasets, while the smaller the dataset gets the more you want to manually create the split.

It's best to save both the test and training dataset as separate files on your file system. Especially for larger datasets, keeping them just in memory might get on your nerves, as you might need to restart your python environment during setting up the training process.

Hands-on: Execute the finetuning process with QLoRA

We're almost there - but we've to talk about one theoretical aspect before: Most models are just enormous. ENORMOUS. They would require 100s of GBs of VRAM to be finetuned. However, thanks to enormous efforts of the open-source community, we have some options here. And, as experience shows, these options don't reduce the quality of the finetuned result in a significant way!

QLoRA: QLoRA was first introduced in May 2023 and was a major breakthrough in the field of finetuning. Quoting from the paper: QLoRA is an efficient finetuning approach that reduces memory usage enough to finetune a 65B parameter model on a single 48GB GPU while preserving full 16-bit finetuning task performance. QLoRA introduces a number of innovations to save memory without sacrificing performance: (a) 4-bit NormalFloat (NF4), a new data type that is information theoretically optimal for normally distributed weights (b) double quantization to reduce the average memory footprint by quantizing the quantization constants, and (c) paged optimziers to manage memory spikes.

As we'll see in a second, while QLoRA provides outstanding results, Hugging Face did a similarly great job in integrating QLoRA into their transformers library. Therefore, we can use it without much hassle.

For more information on QLoRA, check the Hugging Face intro.
Model quantization with bitsandbytes: While QLoRA reduces the memory usage, the models itself are still as big as they are. Using bitsandbytes we can quantize the model to lower floating point accuracy.

What does this mean? In short: We reduce the model's size by reducing the number of bits used to represent the model's weights. This can reduce the model's memory footprint and speed up the training process. This can can can theoretically also reduce the models performance, however, in practice this is surprisingly rare.

Floating Point formats (from https://huggingface.co/blog/4bit-transformers-bitsandbyte)

Read more about both of the optimizations we are using in this phenomenal blog post from Hugging Face

Now that we have this out of our hands, let's demonstrate the finetuning process.

First, import the required modules.

1import torch
2from datasets import load_dataset
3from transformers import AutoTokenizer, AutoModelForCausalLM, TrainingArguments, BitsAndBytesConfig
4from peft import LoraConfig, AutoPeftModelForCausalLM
5from trl import setup_chat_format, SFTTrainer

Then, load the dataset, and make sure it's in the jsonl format discussed above. Load the training dataset.

1train_data = load_dataset("json", data_files="dataset_training.json", split="train")

Next, lets load the model in 4-Bit quantization, using bitsandbytes:

1model_id = "meta-llama/Meta-Llama-3-8B" # The model to finetune, Hugging Face id
2
3quantization = BitsAndBytesConfig(
4    load_in_4bit=True,
5    bnb_4bit_use_double_quant=True,
6    bnb_4bit_quant_type="nf4",
7    bnb_4bit_compute_dtype=torch.bfloat16
8)
9
10model = AutoModelForCausalLM.from_pretrained(
11    model_id,
12    quantization_config=quantization,
13    attn_implementation="flash_attention_2", # Use 'eager', if not NVIDIA ampere or later architecture
14    device_map="auto",
15    torch_dtype=torch.bfloat16,
16)
17
18tokenizer = AutoTokenizer.from_pretrained(model_id)
19tokenizer.padding_side = 'right'

If you are using the "conversational format" for your training data (which is most likely the case), you need to add special tokens to the LLM which it mostly is not yet aware of. These tokens are to identify the start and end of a conversation, as well as of messages of the different roles (system, user, assistant). trl provides a convenient function for that:

1model, tokenizer = setup_chat_format(model, tokenizer)

Next, we define our QLoRA configuration.

Please check the tips in the line comments. The LoRA config is not the same for every model and even dataset. This needs to be experimented on. Try different values as indicated in the comments and compare the model results.

1qlora = LoraConfig(
2        lora_alpha=8, # LoRA scaling factor. Set to either 8, 16, 32, 64 or 128
3        r=16, # Rank, set to 16, 32, 64, 128 or 256. 16 seems best for most applications
4        lora_dropout=0.05, # From QLoRA paper. Keep at 0.05
5        bias="none",
6        target_modules="all-linear",
7        task_type="CAUSAL_LM",
8)

The LoRA parameters explained

lora_alpha: Defines the scaling of the low-rank matrices - it basically tells how much influence the low-rank matrices have for the finetuning process. Theoretically, the higher this value, the more "new" knowledge will influence the model. The lower the value, the more the already existing knowledge will prevail.
r: Defines the mathematical rank of the low-rank matrices, which defines the size of the matrices. The higher the number, the bigger the size of the matrices, the longer finetuning takes. But, bigger numbers also mean, that the influence of the low-rank adapters to the final model are higher. Therefore, this is again a parameter where you want to balance efficiency with model performance. In general, our experience shows that a rank of 16 produces good results for most current-gen models.
lora_dropout: During finetuning, randomly a portion of parameters are "dropped out". This should mainly prevent overfitting. High values might lead to underfitting, meaning the model will not "remember" what you told it. Low values might reduce the models ability to generalize over your data (because it just "clings" to the training data provided)

Almost there, before running the training process, let's define the training hyperparameters. Refer to the code comments for the 'easier' parameters and the section below the code sample for a more detailed description what they entail.

1training_params = TrainingArguments(
2    output_dir="llama3-finetuned",          # directory to save to
3    num_train_epochs=3,
4    per_device_train_batch_size=3,
5    gradient_accumulation_steps=2,
6    gradient_checkpointing=True,            # use gradient checkpointing to save memory
7    optim="adamw_torch",
8    logging_steps=10,                       # log every 10 steps
9    save_strategy="epoch",                  # save checkpoint every epoch
10    learning_rate=2e-4,                     # learning rate, based on QLoRA paper, keep it at that
11    bf16=True,                              # use bfloat16 precision
12    tf32=True,                              # use tf32 precision
13    max_grad_norm=0.3,                      # max gradient norm based on QLoRA paper, keep it at that
14    warmup_ratio=0.03,                      # warmup ratio based on QLoRA paper, keep it at that
15    lr_scheduler_type="constant",
16    push_to_hub=False,                      # don't push model to hugging face hub
17    report_to="tensorboard",                # report metrics to tensorboard
18)

The Hugging Face Training Arguments explained

num_train_epochs: Total number of training epochs to perform. 3 is good for most use-cases. As a rule of thumb, when your training loss stops declining, you can stop training. Check your loss curve in the tensorboard to check, whether the training loss stays more or less constant after 1 or 2 epochs - reduce this parameter in such a case.
per_device_train_batch_size: Batch size per GPU core for training - how many examples are processed at once. In general, set as high as your GPU memory allows, as, the training process is faster. Combine with gradient_accumulation_steps.
gradient_accumulation_steps: Number of updates steps to accumulate the gradients for, before performing back propagation. So, after how many batches do you want to back propagate? While it is advised to max out GPU usage as much as possible, a high number of gradient accumulation steps can result in a more pronounced training slowdown. Consider the following example. Let’s say, the per_device_train_batch_size=4 without gradient accumulation hits the GPU’s limit. If you would like to train with batches of size 64, do not set the per_device_train_batch_size to 1 and gradient_accumulation_steps to 64. Instead, keep per_device_train_batch_size=4 and set gradient_accumulation_steps=16. This results in the same effective batch size while making better use of the available GPU resources. (From Hugging Face)
optim: Which optimizer to user. adamw_torch or adamw_torch_fused are best from our experience.
lr_scheduler_type: The learning rate scheduler to use. constant applies the same learning rate to all epochs. cosine decays the learning rate after each training epoch. For QLoRA applications, keep constant.

Okay, we are finally there, we can put all together, instantiate our Trainer class and run the training process:

1trainer = SFTTrainer(
2    model=model,
3    args=training_params,
4    train_dataset=train_data,
5    peft_config=quantization,
6    max_seq_length=2048,              # Maximum number of tokens
7    tokenizer=tokenizer,
8    packing=True,
9    dataset_kwargs={                  # These settings are default as shown by many Hugging Face tutorials
10        "add_special_tokens": False,
11        "append_concat_token": False,
12    }
13)
14
15trainer.train()
16trainer.save_model()

The training script summarized

Find the python code for LLM finetuning summarized below:

1import torch
2from datasets import load_dataset
3from transformers import AutoTokenizer, AutoModelForCausalLM, TrainingArguments, BitsAndBytesConfig
4from peft import LoraConfig, AutoPeftModelForCausalLM
5from trl import setup_chat_format, SFTTrainer
6
7train_data = load_dataset("json", data_files="dataset_training.json", split="train")
8
9model_id = "meta-llama/Meta-Llama-3-8B" # The model to finetune, Hugging Face id
10
11quantization = BitsAndBytesConfig(
12    load_in_4bit=True,
13    bnb_4bit_use_double_quant=True,
14    bnb_4bit_quant_type="nf4",
15    bnb_4bit_compute_dtype=torch.bfloat16
16)
17
18model = AutoModelForCausalLM.from_pretrained(
19    model_id,
20    quantization_config=quantization,
21    attn_implementation="flash_attention_2", # Use 'eager', if not NVIDIA ampere or later architecture
22    device_map="auto",
23    torch_dtype=torch.bfloat16,
24)
25
26tokenizer = AutoTokenizer.from_pretrained(model_id)
27tokenizer.padding_side = 'right'
28model, tokenizer = setup_chat_format(model, tokenizer)
29
30qlora = LoraConfig(
31        lora_alpha=8, # LoRA scaling factor. Set to either 8, 16, 32, 64 or 128
32        r=16, # Rank, set to 16, 32, 64, 128 or 256. 16 seems best for most applications
33        lora_dropout=0.05, # From QLoRA paper. Keep at 0.05
34        bias="none",
35        target_modules="all-linear",
36        task_type="CAUSAL_LM",
37)
38
39training_params = TrainingArguments(
40    output_dir="llama3-finetuned",          # directory to save to
41    num_train_epochs=3,
42    per_device_train_batch_size=3,
43    gradient_accumulation_steps=2,
44    gradient_checkpointing=True,            # use gradient checkpointing to save memory
45    optim="adamw_torch",
46    logging_steps=10,                       # log every 10 steps
47    save_strategy="epoch",                  # save checkpoint every epoch
48    learning_rate=2e-4,                     # learning rate, based on QLoRA paper, keep it at that
49    bf16=True,                              # use bfloat16 precision
50    tf32=True,                              # use tf32 precision
51    max_grad_norm=0.3,                      # max gradient norm based on QLoRA paper, keep it at that
52    warmup_ratio=0.03,                      # warmup ratio based on QLoRA paper, keep it at that
53    lr_scheduler_type="constant",
54    push_to_hub=False,                      # don't push model to hugging face hub
55    report_to="tensorboard",                # report metrics to tensorboard
56)
57
58trainer = SFTTrainer(
59    model=model,
60    args=training_params,
61    train_dataset=train_data,
62    peft_config=quantization,
63    max_seq_length=2048,              # Maximum number of tokens
64    tokenizer=tokenizer,
65    packing=True,
66    dataset_kwargs={                  # These settings are default as shown by many Hugging Face tutorials
67        "add_special_tokens": False,
68        "append_concat_token": False,
69    }
70)
71
72trainer.train()
73trainer.save_model()

Use the finetuned Large Language Model

Now that we trained our model, we want to use (and as a matter of fact evaluate) it.

1model = AutoPeftModelForCausalLM.from_pretrained(
2  "./llama3-finetuned", # Directory where to load the model and adapters from
3  torch_dtype=torch.float16,
4  quantization_config= {"load_in_4bit": True}, # Whether to quantize the model during loading
5  device_map="auto"
6)
7tokenizer = AutoTokenizer.from_pretrained("./llama3-finetuned")
8
9# Create your input messages for the LLM
10messages = [{"role": "system", "content": "Your system message"},
11{"role": "user", "content": "Your first prompt"}]
12
13input_ids = tokenizer.apply_chat_template(messages,add_generation_prompt=True,return_tensors="pt").to(model.device)
14outputs = model.generate(
15    input_ids,
16    max_new_tokens=512,
17    eos_token_id= tokenizer.eos_token_id,
18    do_sample=True,
19    temperature=0.6,
20    top_p=0.9,
21)
22response = outputs[0][input_ids.shape[-1]:]

For our example, the results look pretty promising!

If you want to deploy this model to production, read our follow-up piece on How to securely self host your own LLM with vLLM and Caddy

Interested in building high-quality AI agent systems?

We prepared a comprehensive guide based on cutting-edge research for how to build robust, reliable AI agent systems that actually work in production. This guide covers:

Understanding the 14 systematic failure modes in multi-agent systems
Evidence-based best practices for agent design
Structured communication protocols and verification mechanisms

Get your free AI agents guide