In this tip, we will see how to fine tune Llama 2 (or any other foundational LLM) on custom datasets using a collection of libraries from HuggingFace: transformers, peft, etc.

First, install dependencies:

pip install -q huggingface_hub
pip install -q -U trl transformers accelerate peft
pip install -q -U datasets bitsandbytes einops wandb
pip install  -q ipywidgets
pip install -q scipy

and import all needed modules

from datasets import load_dataset
import torch
from transformers import AutoModelForCausalLM, BitsAndBytesConfig, AutoTokenizer, TrainingArguments
from peft import LoraConfig
from trl import SFTTrainer

Depending on the foundational model you are using, you may optionally need to authenticate to HuggingFace

huggingface-cli login

Data

A good format to structure a trainig dataset is to use a .jsonl structured file, where each row is a JSON object representing a training example with an input for the model and the associated output:

{"input": "What is the 'ultraviolet catastrophe'?", "output": "It is the misbehavior of a formula for higher frequencies."}
{"input": "Where did Ibn Battuta travel to after his visit to the Chagatai Khanate?", "output": "Constantinople"}
{"input": "From where did Ibn Battuta travel to Yemen after the hajj?", "output": "He traveled via the Red Sea."}

Now we can load the dataset us the datasets library as follows

train_dataset = load_dataset('json', data_files='train.jsonl', split='train')
test_dataset = load_dataset('json', data_files='test.jsonl', split='test')

To present the example as a prompt to the LLM, we need to create a formatting function prompt_func:

def prompt_func(example):
  text = f"### Question: {example['input']}\n ### Answer: {example['output']}"
  return [text]

Fine-tuning

Next, we load the foundational LLM as we well:

base_model_name = "meta-llama/Llama-2-7b-hf"

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.float16,
)

base_model = AutoModelForCausalLM.from_pretrained(
    base_model_name,
    quantization_config=bnb_config,
    device_map="auto",
    trust_remote_code=True,
    use_auth_token=True
)

base_model.config.use_cache = False
base_model.config.pretraining_tp = 1

Also, load the tokenizer

tokenizer = AutoTokenizer.from_pretrained(base_model_name, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token

Then, we define the training arguments in a TrainingArguments object

training_args = TrainingArguments(
    output_dir="./Llama-2-7b-hf-fine-tuned",
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,
    learning_rate=2e-4,
    logging_steps=50,
    max_steps=1000,
    logging_dir="./logs",         # Directory for storing logs
    save_strategy="steps",        # Save the model checkpoint every logging step
    save_steps=50,                # Save checkpoints every 50 steps
    evaluation_strategy="steps",  # Evaluate the model every logging step
    eval_steps=50,                # Evaluate and save checkpoints every 50 steps
    do_eval=True                  # Perform evaluation at the end of training
)

max_seq_length = 512

As well as the config for the Lora adapter:

peft_config = LoraConfig(
    lora_alpha=16,
    lora_dropout=0.1,
    r=64,
    bias="none",
    task_type="CAUSAL_LM",
)
trainer = SFTTrainer(
    model=base_model,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    peft_config=peft_config,
    formatting_func=formatting_func,
    max_seq_length=max_seq_length,
    tokenizer=tokenizer,
    args=training_args,
)

Finally, we kick start the fine-tuning:

trainer.train()

Inference

To use the fine-tuned version of the model, we need to load the weights of the base model and then merge it with the QLora weights which were saved by the PEFT library.

First step, load the base model

base_model_name="meta-llama/Llama-2-7b-hf"
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.float16,
)
base_model = AutoModelForCausalLM.from_pretrained(
    base_model_name,
    quantization_config=bnb_config,
    device_map="auto",
    trust_remote_code=True,
    use_auth_token=True
)
tokenizer = AutoTokenizer.from_pretrained(base_model_name, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token

Second step, load the QLora weights and merge them with the base model

model = PeftModel.from_pretrained(base_model, "./Llama-2-7b-hf-fine-tuned")

Now, we can use the model to run inference

eval_prompt = f"### Question: What is the stance on Ibn Battuta's Rihla?\n ### Answer: "

model_input = tokenizer(eval_prompt, return_tensors="pt").to("cuda")

model.eval()
with torch.no_grad():
    toks = model.generate(**model_input, max_new_tokens=100)[0]
    print(tokenizer.decode(toks, skip_special_tokens=True))