In this tip, we will fine-tune an LLM with two techniques; full fine-tuning and with Parameter Efficient Fine-Tuning (PEFT). We will use the FLAN-T5 LLM, which is a high quality instruction tuned model.

Different LLM tuning techniquesSource AWS Machine Learning Blog

First, install the required packages for the LLM, datasets and PEFT.

pip install -q torch torchdata transformers datasets loralib peft

Then, import the necessary modules

import torch
from datasets import load_dataset
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer, TrainingArguments, Trainer

Then, load the pre-trained FLAN-T5 model from HuggingFace.

model_name='google/flan-t5-base'

original_model = AutoModelForSeq2SeqLM.from_pretrained(model_name, torch_dtype=torch.bfloat16)
tokenizer = AutoTokenizer.from_pretrained(model_name)

Dataset

For fine-tuning, we can use custom entries or take few entries from an existent dataset. In our case we will use the DialogSum dataset from Hugging Face which contains 10,000+ dialogues with the corresponding manually labeled summaries and topics.

dataset = load_dataset("knkarthick/dialogsum")

We need to convert the prompt-response pairs of the dataset into a instructions for the LLM. For instance:

Summarize the following conversation.

    Person 1: Hi.
    Person 2: Hi, how are you?
    
Summary:
Person 1 and Person 2 are greeting each other.

Let’s define a helper function that takes an example from the dataset and convert it to a prompt

def prompt_func(example):
    START = 'Summarize the following conversation.\n\n'
    END = '\n\nSummary: '
    prompt = [START + dialogue + END for dialogue in example["dialogue"]]
    return prompt

Then define a helper function to preprocess the prompt-response dataset into tokens:

def tokenize_function(example):
    prompt = prompt_func(example)
    input_ids = tokenizer(prompt, padding="max_length", truncation=True, return_tensors="pt").input_ids
    labels = tokenizer(example["summary"], padding="max_length", truncation=True, return_tensors="pt").input_ids
    
    return {'input_ids': input_ids, 'labels': labels}

Now, we apply the tokenize_function on the different splits in the dataset (train, validation and test) in batches.

tokenized_datasets = dataset.map(tokenize_function, batched=True)

Optinally, we can subsample examples from the dataset:

tokenized_datasets = tokenized_datasets.filter(lambda example, index: index % 100 == 0, with_indices=True)

Full Fine-Tuning

We will use Hugging Face Trainer) for full-tuning as follows:

output_dir = f'./fulltuned-DialogSum-training'

training_args = TrainingArguments(
    output_dir=output_dir,
    learning_rate=1e-5,
    num_train_epochs=1,
    weight_decay=0.01,
    logging_steps=1,
    max_steps=1
)

trainer = Trainer(
    model=original_model,
    args=training_args,
    train_dataset=tokenized_datasets['train'],
    eval_dataset=tokenized_datasets['validation']
)

Then start the training by simply calling train:

trainer.train()

Parameter Efficient Fine-Tuning (PEFT)

PEFT is a form of instruction fine-tuning that is much more efficient than full fine-tuning - with comparable evaluation results. PEFT includes fine-tuning techniques like Low-Rank Adaptation (LoRA) and prompt tuning (which is NOT THE SAME as prompt engineering!).

In our case we will use LoRA which allows the user to fine-tune their model using fewer compute resources (in some cases, a single GPU). Using LoRA, we freeze the underlying LLM and only training the adapter. After fine-tuning with LoRA, the result is that the original LLM remains unchanged and a newly-trained “LoRA adapter” emerges. This LoRA adapter is much, much smaller than the original LLM - on the order of a single-digit % of the original LLM size (MBs vs GBs).

At inference time, the LoRA adapter needs to be reunited and combined with its original LLM to serve the inference request. The benefit, however, is that many LoRA adapters can re-use the original LLM which reduces overall memory requirements when serving multiple tasks and use cases.

First, we need to set up the PEFT/LoRA model for fine-tuning with a new layer/parameter adapter

from peft import LoraConfig, get_peft_model, TaskType

lora_config = LoraConfig(
    r=32, # Rank
    lora_alpha=32,
    target_modules=["q", "v"],
    lora_dropout=0.05,
    bias="none",
    task_type=TaskType.SEQ_2_SEQ_LM # FLAN-T5
)

Note: the r hyper-parameter defines the rank/dimension of the adapter to be trained.

Then, we add the LoRA adapter layers/parameters to our base LLM

peft_model = get_peft_model(original_model, lora_config)
print(print_number_of_trainable_model_parameters(peft_model))

Then, define training arguments and create Trainer instance.

output_dir = f'./peft-DialogSum-training'

peft_training_args = TrainingArguments(
    output_dir=output_dir,
    auto_find_batch_size=True,
    learning_rate=1e-3, # Higher learning rate than full fine-tuning.
    num_train_epochs=1,
    logging_steps=1,
    max_steps=1    
)
    
peft_trainer = Trainer(
    model=peft_model,
    args=peft_training_args,
    train_dataset=tokenized_datasets["train"],
)

And train the PEFT Adapter

peft_trainer.train()

Once training finishes, we save the parameters

peft_model_path="./peft-DialogSum-checkpoint"

peft_trainer.model.save_pretrained(peft_model_path)
tokenizer.save_pretrained(peft_model_path)

At infenrece, we need to add the PEFT adapter to the original LLM

from peft import PeftModel, PeftConfig

base_model = AutoModelForSeq2SeqLM.from_pretrained(model_name, torch_dtype=torch.bfloat16)
tokenizer = AutoTokenizer.from_pretrained(model_name)

peft_model = PeftModel.from_pretrained(
    base_model,
    peft_model_path,
    torch_dtype=torch.bfloat16,
    is_trainable=False # inference mode
    )

Evaluation

For evaluating the resulting models from the fine-tuning techniques discusser earlier, you can refer to a previous tip on evaluating LLMs Qualitatively (human evaluation) and Quantitatively (with ROUGE metrics) - link.