Demystifying Large Language Models for Everyone: Fine-Tuning Your Own LLM. Part 1/3
The first thing that you have to keep in mind is that LLM stands for Large Language Models,they are immense models comprising billions of parameters, and their huge size, coupled with the vast amount of data used for training lends them their almost magic capabilities.
However, this presents a challenge for many of us, as these models demand substantial computational resources. Sam Altman once mentioned that training GPT-4 came with a staggering cost of 100 million dollars.
In this series, we will delve into the first solution that allows us to fine-tune an LLM using a standard 16GB T4 GPU, making this powerful technology more accessible to a broader audience.
What you will learn
- Optimizer States
- Techniques to decrease the memory usage of LLMs
In this example, we’ll be working with the Meta Llama 7B model, a model with 7 billion parameters. Each parameter is represented as a 32-bit floating-point number, occupying 4 bytes of memory. To simply load this model into memory, we require a staggering 28 gigabytes of memory 7 Billions (of model parameters) * 4 Bytes (32 floating point size) = 28 GBs. Using a T4 GPU, we would have an “Out of Memory” (OOM) error.
However, loading the model is just the beginning. The fine-tuning process involves two additional memory-intensive steps: managing the gradients and the optimizer’s states. These steps further add to the memory requirements, making efficient memory management an essential aspect of working with such large-scale models.
Let’s Start By the Gradients
Gradiantes play a pivotal role in training machine learning models.
They are the centerpiece that enables models to really learn about the data and improve their performance over the training process. During the training process, the model learns using a technique called Gradient Descent.
To learn more about Gradient Descent you can read my article about gradient descent in the linear regression model.
But summarizing it in Gradient Descent we minimize a loss function by iteratively adjusting model parameters based on calculated gradients and a learning rate. So gradients are essentially partial derivatives that indicate the direction and magnitude of the parameters changes to minimize the model loss.
Typically, there’s one gradient for each model parameter. Therefore, in the case of a 7-billion-parameter model, we’re dealing with 7 billion gradients, each occupying 4 bytes of memory. This totals to a memory requirement of 28 gigabytes, just for storing the gradients.
Now that you learned about gradients and gradient descent, for fine tuning large language models we could use Stochastic Gradient Descent (SGD), but we usually use an Optimizer called Adam, often we use a variant known as AdamW. (https://towardsdatascience.com/why-adamw-matters-736223f31b5d)
In this process is where we update the LLM parameters weights and for that we need to store the Vt and the St as you can see in the image below. So 7B * 4 (bytes) * 2 (Vt and St) = 56GB.
Totalizing 28 + 28 + 56 = 112 GB needed to train the model using 32 floating point variables.
Among the techniques available to tackle the memory challenge, a straightforward approach involves transitioning from 32-bit floating-point precision to 16-bit precision. This adjustment effectively reduces memory consumption by half, transforming the requirement from 112 gigabytes to 56 gigabytes. However, this approach comes with a trade-off: by simply opting for 16-bit floating-point representation, we reduce memory usage but also limit the model’s ability to represent approximately ±65k distinct values. When we are using 32 bits floating points we could represent ±2 billions distinct values.
Quantization it’s essentially the process of mapping continuous values to discrete values. In our case 32 bit floating point to 8 bits integers.
But this is a challenge, because now we have to convert billions of numbers to only 255 values of int8, without losing too much of model accuracy.
The quantization process usually consists of two key steps: calibration and quantization itself.
The goal of Calibration is to find the best α factor to maximize precision, after the α is calibrated we use the formula below to realize the quantization.
We’re looking at a range of numbers, and anything beyond a certain range (let’s call it the “α-interval”) will be cut off or clipped. Anything within this range will be rounded to the nearest int8 number. It’s important to choose this range carefully. If the range is too big, it can include a lot of numbers, but that might lead to rough approximations and high errors, both from clipping and rounding. So, picking the right range is often about finding a balance between making mistakes by cutting off numbers and making mistakes by rounding them.
So now we would have 7 * 1 + 7 * 1 + 7 * 2 * 1 = 28GB (where 7 stands for the number of model parameters, 1 it’s one byte for the int8 and 2 stands for the 2 values in optimizer states that we need to keep). We are now close to fitting the model in the 16GB T4 GPU memory.
In this article you learned about:
- Model Size: Large Language Models (LLMs) are massive, with billions of parameters, making them resource-intensive to work with.
- Gradients: Storing gradients for a 7-billion-parameter LLM alone requires a whopping 28GB of memory.
- Optimizers: Advanced optimizers like AdamW are commonly used in LLM fine-tuning, contributing to memory requirements.
- Precision Change: Transitioning from 32-bit floating-point precision to 16-bit can halve memory usage, but it comes with a trade-off of reduced representational capacity.
- Quantization: Quantization is a technique that maps continuous values to discrete ones, a strategy employed to reduce memory usage. However, it necessitates a careful balance between memory efficiency and model accuracy.
Feel free to reach me with any comments on my Linkedin account and thank you for reading this post.
If you like what you read be sure to 👏 it below, share it with your friends and follow me to not miss this series of posts.
In the next article I am going to talk more about a technique called low rank adaptation (LoRA) that will continue to help us in the process of fine tuning LLM models.