In our previous article, we delved into the fundamentals of fine-tuning a Large Language Model (LLM) on a T4 GPU, exploring concepts like gradients, optimizers, and quantization. Now, let’s take the next step in this journey.

In this post, we’ll introduce you to a powerful technique known as LoRA (Low Rank Adaptation). LoRA, as detailed in the paper “LoRA: Low-Rank Adaptation of Large Language Models” published in October 2021, is the key that unlocks the path to creating your very own custom LLM, fine-tuned with your unique data.

# Introduction

Let’s begin with the main hypothesis of the paper

When adapting to a specific task, Aghajanyan et al. (2020) shows that the pre-trained language models have a low “intrinsic dimension” and can still learn efficiently despite a random projection to a smaller subspace. Inspired by this, we hypothesize the updates to the weights also have a low “intrinsic rank” during adaptation.

But, let’s translate these hypotheses to plain english. To start, let me remind you of the concepts that you probably have learned in Linear Algebra.

# Low Intrinsic Dimension

This idea centers on the ability to compress the weight matrix of a Large Language Model (LLM) into a smaller, more manageable subspace without causing major changes in its behavior. To grasp this concept, picture a massive black and white image, a thousand pixels by a thousand, with just a single dot. It’s easy to see that you don’t need a million pixel values to precisely represent that single dot. In such cases, the pixel matrix of the image possesses a low intrinsic dimension because you can convey its content with much less data, making it a more efficient representation. This notion is similar to how we can capture essential information using a smaller subspace in Linear Algebra.

*Inspired by this, we hypothesize the updates to the weights also have a low “intrinsic rank” during adaptation*

The rank of a matrix represents the maximum number of rows or columns that are linearly independent. It’s the largest set of columns (or rows) that cannot be expressed as a linear combination of the others.

So if we have the matrix A the rank is equal to 1, because we can represent the second column as a linear combination of the first column, just multiplying it by 2.

But in the example B below the rank is 2, because no column can be expressed as a linear combination of the other.

# LoRA

Here’s where the magic of LoRA comes into play. Instead of fine-tuning all the individual matrix weights of a Large Language Model (LLM), LoRA takes a different approach. It introduces two low-rank matrices, and the key here is to keep these matrices as small as possible, containing the fewest elements necessary. This clever technique allows for efficient adaptation without the need to fine-tune every single matrix weight

In traditional fine tuning we usually train all the models weights as we can see in the image below.

In LoRA we replace the huge weight update matrix, by a decomposition of two matrix A and B assuming that

and the rank r ≪ min(d,k). Now we only train A and B. Freezing the pretrained weights. You can see an image representation of the process below.

# Advantages of LoRA

## Memory and Speed

LoRA improves training efficiency and reduces the hardware requirements by up to three times when using adaptive optimizers. This is because we no longer need to calculate gradients or maintain optimizer states for most parameters. Instead, we focus on optimizing smaller matrices that we add to the model and the number of trainable parameters W can be as small as 0.01%

## Adaptability

Since we have to only train two small matrix. This allows us to share a pre-trained model and create multiple smaller LoRA modules for various tasks. We can ‘freeze’ the shared model and efficiently switch between tasks by making small changes, reducing the need for extensive storage and speeding up the task-switching process. Like we could have N LoRA modules trained for a specific task, and just add them to the LLM weights,

## No additional delay at inference

To deploy in production we need just to add W0 + BA and performance inference normally. Because both W0 and BA are dXk.

## Robust to catastrophic forgetting

Catastrophic forgetting is the phenomenon that occurs when we fine-tune a deep learning model and the model ‘forgets’ what it has learned. Since we freeze the LLM weights and train new ones, LoRA also makes the LLM learn new tasks with a low probability of losing prior knowledge.

# QLoRA

But there were room for improvement and in May 2023 was realized the paper “QLoRA: Efficient Fine Tuning of Quantized LLMs”

QLora introduces 3 main improvements to LoRA.

## 4-bit NormalFloat

An information theoretically optimal quantization data type for normally distributed data that yields better empirical results than 4-bit Integers and 4-bit Floats.

## Double Quantization

A method that quantizes the quantization constants, saving an average of about 0.37 bits per parameter (approximately 3 GB for a 65B model).

## Paged Optimizers

Using NVIDIA unified memory to avoid the gradient checkpointing memory spikes that occur when processing a mini-batch with a long sequence length.

# How QLORA works?

- First quantize the LLM to 4 bits (NF4).
- Each element in the matrix is stored using only 4 bits.
- Then we do a LoRA training in 16 bit precision BFloat16.
- And the process goes like this. Only when we are training the LoRA adapters do we dequantize (convert back) the model to do the model weights to BFloat16 and then convert back to NF4.

# Summarize

To summarize, QLORA has one storage data type (usually 4-bit NormalFloat) and a computation data type (16-bit BrainFloat). We dequantize the storage data type to the computation data type to perform the forward and backward pass, but we only compute weight gradients for the LoRA parameters which use 16-bit BFloat.

The key takeaway is that QLoRA significantly cuts the average memory needs for fine-tuning a 65B parameter model from over 780GB of GPU memory to less than 48GB. Importantly, this efficiency doesn’t compromise runtime or predictive performance when compared to a 16-bit fully finetuned baseline.

# Paper Results

The research findings consistently support the effectiveness of 4-bit QLoRA fine-tuning with NF4 data type, showcasing performance on par with 16-bit full fine-tuning and 16-bit LoRA fine-tuning across established academic benchmarks. Moreover, it highlights that NF4 outperforms FP4 and that double quantization has no detrimental impact. These results provide compelling evidence that 4-bit QLoRA fine-tuning reliably matches the performance of 16-bit methods. The study covers an extensive range of over 1,000 fine-tuned models, offering insights into instruction following and chatbot performance across various datasets, model types, and scales, including large parameter models. Notably, QLoRA fine-tuning on a small, high-quality dataset leads to state-of-the-art results, even with smaller models. The research also underscores the cost-effectiveness of GPT-4 evaluations as an alternative to human evaluations for assessing chatbot performance and raises questions about the reliability of current chatbot benchmarks. Additionally, it provides an analysis of where Guanaco falls short compared to ChatGPT and makes all models and code, including CUDA kernels for 4-bit training, available to the community.

# Conclusion

You’ve learned about LoRA, with its smart approach of using low-rank matrices, opens up new possibilities for tailoring LLMs to specific tasks while reducing memory requirements and enhancing adaptability. The recent advancements in QLoRA further refine this process by introducing 4-bit NormalFloat, double quantization, and paged optimizers, which collectively optimize the efficiency of fine-tuning. These developments are not just theoretical; they have been rigorously tested and found to consistently match or outperform traditional 16-bit methods across various benchmarks.

This marks an exciting step toward democratizing the use of Large Language Models and fine-tuning for a broader audience.

With LoRA and QLoRA, we have the tools to create more efficient, adaptive, and cost-effective language models, heralding a new era of possibilities for AI and NLP applications

Now, the 7-billion parameters, loaded as 4-bit NormalFloat (4NF), occupy a mere 3.5GB of memory. We allocate 1GB for LoRA parameters, another 1GB for gradients, and 4GB for optimizer states. With an additional 4GB set aside for memory peak activations during training, we now require just 13.5GB for the fine-tuning process. You’ve grasped the complexities and their solutions for training your custom LLM with your data.

# Up Next

In the upcoming article, we’ll roll up our sleeves and dive into the code, putting all these concepts into action. Stay tuned for the practical implementation!

Feel free to reach me with any comments on my Linkedin account and thank you for reading this post.

If you like what you read be sure to 👏 it below, share it with your friends and follow me to not miss this series of posts.