Best way to fine-tune your LLM using a T4 GPU. Part 3/3

Jair Neto
11 min readDec 27, 2023

In the past medium post, we learned about the theory behind the tools that will allow us today to fine-tune a Large Language Model (LLM). If you do not read them yet, do not lose time and click here and here.

In this post, you’ll learn insights into the code responsible for loading and preprocessing data for fine-tuning, implementing the techniques covered in previous posts, fine-tuning a model, pushing it to Hugging Face, and enabling you to make inferences effectively.

Introduction

To start, we need to define the task that we want to achieve, the model that we are going to use, and last but not least the data. But when do we need to fine-tune?

Before fine-tuning we can test other simpler methods. First, we can start by using the original model and analyzing its behavior in the task just using prompt engineering. Then we could test some techniques like a single shot or a few shot in the prompt to see if the model get the answers right. Depending on the task you can also use Retrieval Augmented Generation (RAG).

But if after all, the model answers are not good, you can try to switch to a better model or try the fine tuning. It’s worth noting that these techniques are not mutually exclusive and can be combined for a more nuanced approach.

Load Model

Now let’s go for the fun part, the code part to fine-tuning a LLM model. To begin with the fine-tuning we need first to choose a model to fine-tune. We are going to choose the meta-llama/Llama-2–7b-hf, and to get this model we will use the Hugging Face platform.

The Hugging Face Hub is a platform with over 350k models, 75k datasets, and 150k demo apps (Spaces), all open source and publicly available, in an online platform where people can easily collaborate and build ML together.

To get this model you need to follow some steps to get approval to download the model.

  1. Get approval from Hugging Face (https://huggingface.co/meta-llama/Llama-2-7b-hf).
  2. Get approval from Meta (https://ai.meta.com/resources/models-and-libraries/llama-downloads/).
  3. Create a WRITE access token on Hugging Face (https://huggingface.co/settings/tokens).

Note: Make sure your email address on your Hugging Face account is the same as the one you enter on Meta’s website for approval.

To load the model you can use the function below that receives the model_name and the bnb_config. The bnb_config is where we do the quantization of the model.

def load_model_tokenizer(model_name: str, bnb_config: BitsAndBytesConfig) -> Tuple[AutoModelForCausalLM, AutoTokenizer]:
"""
Load the model and tokenizer from the HuggingFace model hub using quantization.

Args:
model_name (str): The name of the model.
bnb_config (BitsAndBytesConfig): The quantization configuration of BitsAndBytes.

Returns:
Tuple[AutoModelForCausalLM, AutoTokenizer]: The model and tokenizer.
"""


model = AutoModelForCausalLM.from_pretrained(
model_name,
quantization_config = bnb_config,
device_map = "auto",
)

tokenizer = AutoTokenizer.from_pretrained(model_name, use_auth_token = True)

tokenizer.pad_token = tokenizer.eos_token

return model, tokenizer

Task

Now the second step is to have a specific task that you want the model to perform, that the original model is not so good at. In our case we are going to train the model to do text-to-sql. Where we are going to pass as input a query instruction in plain English and a table schema. As output we want the resultant SQL query.

Data

The data part is probably the most important one, to you fine-tuning success you need to have a good amount of high-quality data. As a rule of thumb, 1000 high-quality examples are a good starting point. In our case, we used b-mc2/sql-create-context.

There are 78,577 examples of natural language queries, SQL CREATE TABLE statements, and SQL Query answering the question using the CREATE statement as context. This dataset was built with text-to-sql LLMs in mind, intending to prevent hallucination of column and table names often seen when trained on text-to-sql datasets. The CREATE TABLE statement can often be copy and pasted from different DBMS and provides table names, column names and their data types. By providing just the CREATE TABLE statement as context, we can hopefully provide better grounding for models without having to provide actual rows of data, limiting token usage and exposure to private, sensitive, or proprietary data.

Some examples of the data

To load and shuffle the data you just need to run the code below that uses the hugging face load_dataset function.

dataset = load_dataset("b-mc2/sql-create-context")
shuffled_dataset = dataset['train'].shuffle(seed=42)

Data Preprocess

However we still can not use the data AS IS to train the model. First, we need to do some preprocessing.

In our case we want the LLM to receive as input an instruction to write from natural language a SQL query, so we need to format the prompt to do so.

def format_prompt(example: str) -> str:
""""
Format the prompt for the model.

Args:
example (str): The example.

Returns:
str: The formatted prompt.
"""

final_text = """### Instructions:
Your task is to convert a question into a SQL query, given a Postgres database schema.
Adhere to these rules:
- **Deliberately go through the question and database schema word by word** to appropriately answer the question
- **Use Table Aliases** to prevent ambiguity. For example, `SELECT table1.col1, table2.col1 FROM table1 JOIN table2 ON table1.id = table2.id`.
- When creating a ratio, always cast the numerator as float

### Input:
Generate a SQL query that answers the question `{question}`.
This query will run on a database whose schema is represented in this string:
{context}

### Response:
{answer}
### End
""".format(question = example['question'], context = example["context"], answer = example["answer"])

example["text"] = final_text

return example

Now we have examples like that:

### Instructions:
Your task is to convert a question into a SQL query, given a Postgres database schema.
Adhere to these rules:
- **Deliberately go through the question and database schema word by word** to appropriately answer the question
- **Use Table Aliases** to prevent ambiguity. For example, `SELECT table1.col1, table2.col1 FROM table1 JOIN table2 ON table1.id = table2.id`.
- When creating a ratio, always cast the numerator as float

### Input:
Generate a SQL query that answers the question `how many models produced where the plant is castle bromwich?`.
This query will run on a database whose schema is represented in this string:
CREATE TABLE table_250309_1 (models_produced VARCHAR, plant VARCHAR)

### Response:
SELECT COUNT(models_produced) FROM table_250309_1 WHERE plant = "Castle Bromwich"
### End

But it does not end there, to making human-readable text into a format that LLMs can understand and work with. We need to tokenize the text. The tokenizer takes plain text as input performs tokenization, and produces tokenized outputs that include token IDs and attention masks. After this step, we have the data ready for training.

https://docs.ai21.com/docs/tokenizer-tokenization
def preprocess_dataset(tokenizer: AutoTokenizer,
max_length: int,
seed: int,
columns_to_remove: List[str],
dataset: DatasetDict) -> DatasetDict:
"""
Preprocess the dataset for training.

Args:
tokenizer (AutoTokenizer): The tokenizer.
max_length (int): The maximum length of the model.
seed (int): The seed for shuffling the dataset.
columns_to_remove (List[str]): The columns to remove from the dataset.
dataset (DatasetDict): The Hugging face dataset.

Returns:
DatasetDict: The preprocessed dataset.
"""

_preprocessing_function = partial(tokenize_batch, max_length = max_length, tokenizer = tokenizer)
dataset = dataset.map(
_preprocessing_function,
batched = True,
remove_columns = columns_to_remove,
)

dataset = dataset.filter(lambda sample: len(sample["input_ids"]) < max_length)

dataset = dataset.shuffle(seed = seed)

return dataset

QLoRA

In the coding section, the QLoRA process is divided into two components: quantization and the addition of LoRA matrices. To facilitate quantization with a Large Language Model (LLM) loaded using Hugging Face, we leverage the bitsandbytes library.

For the incorporation of new LoRA weight matrices, we employ a library named peft (Parameter-Efficient Fine-Tuning). These libraries empower us to implement all the QLoRA optimizations discussed in the post, including the utilization of 4-bit Normal Floats, double quantization, and the addition of LoRA adapters.

In the function below, we establish all the configurations necessary to optimize the use of QLoRA.

   def get_qlora_configs(load_in_4bit: bool,
bnb_4bit_use_double_quant: bool,
bnb_4bit_quant_type: str,
bnb_4bit_compute_dtype: torch.dtype,
r: int,
lora_alpha: int,
target_modules: Union[List[str],str],
lora_dropout: float,
bias: str,
task_type: str) -> Tuple[BitsAndBytesConfig, LoraConfig]:
"""
Create the configurations for use QLoRA thechniques

Args:
load_in_4bit (bool): This flag is used to enable 4-bit quantization by replacing the Linear layers with FP4/NF4 layers from
`bitsandbytes`.
bnb_4bit_use_double_quant (bool): This flag is used for nested quantization where the quantization constants from the first quantization are
quantized again.
bnb_4bit_quant_type (str): This sets the quantization data type in the bnb.nn.Linear4Bit layers. Options are FP4 and NF4 data types
which are specified by `fp4` or `nf4`.
bnb_4bit_compute_dtype (torch.dtype): This sets the computational type which might be different than the input time. For example, inputs might be
fp32, but computation can be set to bf16 for speedups.
r (int): Lora attention dimension.
lora_alpha (int): The alpha parameter for Lora scaling.
target_modules (Union[List[str],str]): The names of the modules to apply Lora to.
lora_dropout (float): The dropout probability for Lora layers.
bias (str): Bias type for Lora. Can be 'none', 'all' or 'lora_only'. If 'all' or 'lora_only', the
corresponding biases will be updated during training. Be aware that this means that, even when disabling
the adapters, the model will not produce the same output as the base model would have without adaptation.
task_type (str): The task type for the model.

Returns:
Tuple[BitsAndBytesConfig, LoraConfig]: The configuration for BitsAndBytes and Lora.
"""

bnb_config = BitsAndBytesConfig(
load_in_4bit = load_in_4bit,
bnb_4bit_use_double_quant = bnb_4bit_use_double_quant,
bnb_4bit_quant_type = bnb_4bit_quant_type,
bnb_4bit_compute_dtype = bnb_4bit_compute_dtype,
)

lora_config = LoraConfig(
r = r,
lora_alpha = lora_alpha,
target_modules = target_modules,
lora_dropout = lora_dropout,
bias = bias,
task_type = task_type,
)

bnb_config, lora_config

Fine-tuning

For fine-tuning the model we need to first follow some steps.

Step 1: Prepare the model

def preprare_model_for_fine_tune(model: AutoModelForCausalLM,
lora_r: int,
lora_alpha: int,
lora_dropout: float,
bias: str,
task_type: str) -> AutoModelForCausalLM:
"""
Prepares the model for fine-tuning.

Args:
model (AutoModelForCausalLM): The model that will be fine-tuned.
lora_r (int): Lora attention dimension.
lora_alpha (int): The alpha parameter for Lora scaling.
lora_dropout (float): The dropout probability for Lora layers.
Bias type for Lora. Can be 'none', 'all' or 'lora_only'. If 'all' or 'lora_only', the
corresponding biases will be updated during training. Be aware that this means that, even when disabling
the adapters, the model will not produce the same output as the base model would have without adaptation.
task_type (str): The task type for the model.

Returns:
AutoModelForCausalLM: The model prepared for fine-tuning.
"""
# Enable gradient checkpointing to reduce memory usage during fine-tuning
model.gradient_checkpointing_enable()

# Prepare the model for training
model = prepare_model_for_kbit_training(model)

# Get LoRA module names
target_modules = find_all_linear_names(model)

# Create PEFT configuration for these modules and wrap the model to PEFT
peft_config = create_peft_config(lora_r, lora_alpha, target_modules, lora_dropout, bias, task_type)
model = get_peft_model(model, peft_config)

model.config.use_cache = False

return model

In the code above we start by calling the gradient_checkpointing_enable function to significantly reduce GPU memory usage. This feature helps free up a substantial amount of memory, although it comes with a slight trade-off: a modest decrease in training speed due to recomputing parts of the graph during back-propagation. For more read the paper Training Deep Nets with Sublinear Memory Cost. After that we use the best practice of PEFT lib so we prepare the model for quantization using prepare_model_for_kbit_training. Finally we apply the LoRA, get the PEFT model, and since we want to fine-tune the model we have to set the use_cache to False.

Step 2. Configure the trainer parameters

print_trainable_parameters(model)

# Training parameters
trainer = Trainer(
model = model,
train_dataset = dataset,
args = TrainingArguments(
per_device_train_batch_size = per_device_train_batch_size,
gradient_accumulation_steps = gradient_accumulation_steps,
warmup_steps = warmup_steps,
learning_rate = learning_rate,
fp16 = fp16,
logging_steps = logging_steps,
output_dir = output_dir,
optim = optim,
num_train_epochs=num_train_epochs
),
data_collator = DataCollatorForLanguageModeling(tokenizer, mlm = False)
)

Here we set the parameters that we need to use to fine tune the model. To get more information about them you can access its documentation.

Step 3. Fine tune the model, save the metrics, save the model and free the gpu memory for not giving Out Of Memory errors.

def fine_tune(model: AutoModelForCausalLM, trainer: Trainer, output_dir: str) -> None:
"""
Fine-tune the model.

Args:
model (AutoModelForCausalLM): The model to fine-tune.
trainer (Trainer): The trainer with the training configuration.
output_dir (str): The output directory to save the model.
"""

print("Training...")

train_result = trainer.train()

save_metrics(train_result, trainer)
save_model(model, output_dir)
free_memory(model, trainer)

Pushing the model

Now that we have our fine-tuned model, we need to push to the huggingFace hub in order to use it.

model = AutoPeftModelForCausalLM.from_pretrained(output_dir, device_map = "auto", torch_dtype = torch.bfloat16)

model = model.merge_and_unload()

# Save fine-tuned model at a new location
output_merged_dir = "results/sql_classification_llama2_7b/final_merged_checkpoint"
os.makedirs(output_merged_dir, exist_ok = True)
model.save_pretrained(output_merged_dir, safe_serialization = True)

# Save tokenizer for easy inference
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.save_pretrained(output_merged_dir)

# Push fine-tuned model and tokenizer to Hugging Face Hub
model.push_to_hub(new_model, use_auth_token = True)
tokenizer.push_to_hub(new_model, use_auth_token = True)
  1. Load the Fine-Tuned model.
  2. Merge the LoRA matrix weights into the base model.
  3. Save the merged model.
  4. Save the tokenizer.
  5. Push the model and tokenizer to Hugging Face Hub.

Inference

Now to use your brand new fine-tuned model you just need to load the model, that it's now in Hugging Face Hub. Replace the “<fine_tuned_model_name>” and “<tokenizer_name>” for the ones that you uploaded to the Hub.

torch.set_default_device('cuda')
model = AutoModelForCausalLM.from_pretrained("<fine_tuned_model_name>", trust_remote_code=True, torch_dtype="auto")
tokenizer = AutoTokenizer.from_pretrained("<tokenizer_name>", trust_remote_code=True, torch_dtype="auto")

After you load the model, you need to create a function in order to use it for returning the SQL query.

In the function, we format the prompt and use a regex to extract only the SQL query.

def print_inference(model: AutoModelForCausalLM, tokenizer: AutoTokenizer, question: str, context: str, answer: str) -> None:
"""
Print the inference from the model.

Args:
model (AutoModelForCausalLM): Fine-tuned model.
tokenizer (AutoTokenizer): Tokenizer.
question (str): The natural language question.
context (str): The database schema.
answer (str): The query answer.
"""

message = f'''
### Instructions:
Your task is to convert a question into a SQL query, given a Postgres database schema.
Adhere to these rules:
- **Deliberately go through the question and database schema word by word** to appropriately answer the question
- **Use Table Aliases** to prevent ambiguity. For example, `SELECT table1.col1, table2.col1 FROM table1 JOIN table2 ON table1.id = table2.id`.
- When creating a ratio, always cast the numerator as float

### Input:
Generate a SQL query that answers the question `{question}`.
This query will run on a database whose schema is represented in this string:
{context}

### Response:
'''
inputs = tokenizer(message, return_tensors="pt", return_attention_mask=False)

outputs = model.generate(**inputs, max_length=400)

print_extracted_answer(tokenizer.batch_decode(outputs)[0])

Now you are ready to go and use the model to generate SQL queries.

Below you have some example generated by the fine tuned model.

You can run the code used in this post in this Colab.

Next Steps

  • Compare the results from the fine-tuned model with the original model.
  • If the results are not satisfactory, fine-tune the model with more data or select observations with higher quality.
  • Build a UI and deploy the model so anyone can use it and add in the UI feedback in order to improve the model.

Greetings

This post was inspired by the posts of Kshitiz Sahay .

P.S.

Now that you know the theory and the code behind fine-tuning a model, you can use some library to fine-tune a model easily and even locally (If you have the needed hardware).

One option is to use Auto train from Hugging Face.

Another option is using services like https://www.lamini.ai/.

Feel free to reach me with any comments on my Linkedin account and thank you for reading this post.

If you like what you read be sure to 👏 it below, share it with your friends and follow me to not miss this series of posts.

References

https://colab.research.google.com/drive/1aC0S9V31eHX87RY6cjCeBo397DM9NUWd#scrollTo=tCbnhnxtnhvh

--

--

Jair Neto

ML engineer / Analytics engineer | UCI & UFCG Alumni