LLM Fine-tuning: Old school, new school, and everything in between
In the previous blog, as part of our LLM series, we explored various LLMs, their characteristics and outlined some helpful LLM tooling. We also highlighted the distinction between closed-source and open-source LLMs. In this blog, we dive deep into LLM finetuning. In the following sections, we explain where and why finetuning an LLM is useful. We also provide a historical overview of different LLM finetuning approaches and discuss emergent techniques in the field.
Training LLMs involves several stages, such as pretraining, supervised finetuning, and using advanced techniques like reinforcement learning human feedback (RLHF), to produce a conversational assistant that excels at many different tasks, resembling a “Jack of all trades”. In the pre-training stage, a large transformer model is trained on an even larger dataset. A suitable dataset consists of trillions of words scraped from the internet. For example, one data recipe curated by RedPajama contains data scraped from popular websites such as wikipedia, stack exchange and github. This dataset contains 1.2 trillion tokens and takes around 3 TB of disk space. The task for this stage is simply to learn to fill in the blanks. A model trained in such a self-supervised fashion creates an internal representation of language. These models learn statistical information, predicting what is the likely next word for a given input context. For example, predicting the next word in the sentence “The sky is ”. Many diverse training datasets enable these pretrained models to establish a base understanding of natural language. Such pretrained models are also referred to as foundational models.
Pretrained base models are good at one thing; completions. They produce sentences that are grammatically correct and fluent. Sometimes referred to as stochastic parrots 🦜. They are not especially good for NLP tasks such as sentiment analysis or text classification. This is where the supervised finetuning phase plays an important role. Fine-tuning uses these pre-trained base models and trains them using domain- or task-specific datasets in a supervised fashion. Fine-tuning leverages the general knowledge learnt through pretraining and adapts it to a specific domain, resembling a “Master of one”. This process is also known as “transfer learning”. It allows a model to transfer knowledge gained from one task to another. Pretraining models is a resource-intensive process. Fine-tuning models, on the other hand, requires minimal compute resources (relatively speaking) for additional training.
Fine-tuning a pretrained model is useful for quickly developing high-quality models, for a specific task, with minimal training data. There is also a meaningful privacy benefit; where the application domain requires sensitive data to be used to fine-tune a base model, this can all be done on premise, owing to the relatively modest computational resources required for finetuning. There are different approaches to finetuning, where some change the weights of the base model, others add a few extra layers on top of the base model and train only these additional layers. We will explain the various approaches required for finetuning LLM in the next section.
Historical Fine-tuning techniques
In the previous section, we looked at where and why finetuning is useful. In this section, we focus on the question of how to fine-tune. We outline different approaches to finetuning an LLM. Broadly, these approaches can be categorized into two groups. The first group consists of three old-school approaches, which are not limited to finetuning LLMs. The second group consists of innovative approaches for parameter efficient finetuning, emerging from research into LLMs. In the following section, we will explore each of the two groups in greater detail.
All the approaches we discuss below use a pre-trained LLM as a base model. Fine-tuning is performed using this base model on a labelled, domain-specific dataset. Fine-tuning involves copying the weights from a pre-trained network and tuning them on the downstream task. Below we outline three approaches that slightly tweak the training process.
1. Feature-based approach: In the feature-based approach, the pretrained LLM model acts as a feature-extractor. For a given input, e.g. “This movie is great”, the pre-trained LLM model outputs a fixed-sized, n-dimensional array, e.g. 512-dimensional vectors. We create a separate classifier neural network which consists of a simple logistic regression model or a complex, multi-layer, fully connected layer, with the last layer containing neurons equal to the number of outputs for a NLP classification task. The input to this classifier is an n-dimensional array, which is the output of a pretrained model. The output of this classifier gives the probability of input belonging to a particular class for classification task. As noted in the first figure above, the training process consists of changing only the weights of the separate classifier and keeping the weights of pre-trained LLM frozen. This is why the pre-trained LLM acts as a feature extractor and a separate network uses these features to train a classification model.
2. Finetuning I: In this approach, it is similar to the feature-based approach. Instead of creating a separate classifier, we append a few dense layers at the end of the pre-trained LLM model. While training, we freeze the weights corresponding to pretrained LLM and only change the weights corresponding to newly added layers. Freezing here means that the weights of pretrained LLM will not change in the training process. It has been observed through experimentation that the performance of this approach is slightly better than the feature-based approach.
3. Finetuning II: In this approach, we build upon the Finetuning I approach where we unfreeze the entire network. While training, all the weights of the entire model are changed. This can potentially lead to the problem of catastrophic forgetting where new features overwrite the previously learnt features. This approach is resource intensive and expensive as all parameters of the LLM model are involved. In practice, this approach provides superior results as compared to the other two.
Interestingly, we use the Finetuning II approach to finetune a LLM for summarizing legal documents. More details about the process and the application can be found here.
All three approaches above are not specific to the NLP domain but also work well for the vision domain. The different tasks in the vision domain include image segmentation, image classification and object detection.
One major drawback with using the old school approaches for finetuning current, state-of-the-art LLMs is the number of parameters of bleeding-edge LLMs. Consider, for example, the task of finetuning Falcon LLM, which comes in two variants, 7B and 40B, where B represents billions of parameters. For inference alone, the 7B model requires resources in order of 16 GB of RAM, with the 40B model needing as much as 90 GB of RAM. Finetuning would be quite expensive and resource intensive for multibillion-scale LLMs. The larger the base model, the more expensive it is to train all the layers.
Parameter Efficient FineTuning (PEFT) solves the problem of finetuning LLM by training only a subset of parameters. This might be a set of newly added parameters or select, existing model parameters. PEFT approaches also help circumvent the problem of catastrophic forgetting as they don’t change the existing weights of pretrained models. The literature for various PEFT approaches has exploded in the past few years. The authors of the paper “Scaling Down to Scale Up: A Guide to Parameter-Efficient Fine-Tuning” provide an excellent overview of 30 different PEFT methods. On a high-level, all various PEFT approaches can be classified into 3 main groups based on their underlying approach; Additive-based, Selection-based and Re-parameterization-based, as shown in diagram above. Two sub-categories emerging from Additive-based approaches are Adapters-like methods and Soft prompts.
1. Additive methods: The idea of additive methods is to add extra parameters to existing pretrained models and only train these new parameters. Adding extra parameters increases the training time but memory efficiency improvements are achieved using techniques like quantization, reducing the size of the gradients and the optimizer states. Additive PEFT methods enable the finetuning of much larger networks or the use of larger microbatch sizes, which improves training throughput on GPUs.
2. Adapter methods: These methods are a type of additive PEFT method that add small fully-connected layers after Transformer sub-layers. One adapter paper achieves the same performance of a fully finetuned BERT model using BERT with adapters requiring only 3.6% of parameters compared to 100% of parameters in fully finetuning.
3. Soft prompting methods: In soft prompting approaches such as soft prompt tuning or prefix tuning, a small trainable parameter is introduced alongside the model input or prepended to each transformer block. Training only this trainable parameter using gradient descent improves the performance on the target task. One prefix tuning paper was able to achieve results comparable to a fully-finetuned GPT2 model, while only requiring training for 0.1% of parameters.
4. Selective methods: Selective approach is the simplest method involving finetuning only the top layers of the network. Amongst the different approaches, the BitFit paper introduces a sparse finetuning method where only the bias-terms of the model are modified. Using the BERT LLM, this approach only finetunes 0.09% of parameters and achieves comparable results to finetuning the entire model.
5. Reparametrization-based methods: Reparametrization-based parameter-efficient finetuning methods leverage low-rank representations to minimize the number of trainable parameters. The Low Rank Adaptation(LoRa) paper adds trainable rank decomposition matrices alongside each transformer layer and only trains those newly added weights. For a given weight W of shape (n, n), we decompose W by adding weights A of shape (n, rank) and B of shape (rank, n) as additional parameters. The new count of parameter increases by only (n * rank + rank * n) and only these new parameters are trained. This approach has been tested with LLM models having 175 billion parameters and outperforms adapter and selective methods.
LLMs can also be fine-tuned using a few examples. Fine-tuning from small examples is known as few-shot learning or in-context learning. LLM are also known to perform better using a small amount of task-specific data. In this process, no weights of LLM are updated. Later in our series on LLMs, we will go into more detail on different approaches, such as zero shot learning and prompting techniques like chain of thought (CoT) prompting that require no training or finetuning LLM model at all. These techniques are useful in scenarios where we don’t have direct access to the model.
In the second-part series on LLM, we dived into the exciting area of finetuning LLMs. Finetuning is an effective way to tailor LLMs for specific business requirements without requiring extensive computational resources. Finetuning can be thought of as standing on the shoulders of the giants, the giants being multibillion-parameters LLMs. We explained why finetuning is required on top of pretrained LLMs that act as stochastic parrots 🦜. To put it simply, finetuning involves copying the weights from a pre-trained network and modifying them on the downstream task. In historical finetuning techniques, we visited 2 schools of approaches. The old school approaches evolved and were widely used before the LLM era. The LLM era introduced models ranging from 60 million to 175 billion parameters, making the use of old school approaches infeasible due to resource constraints. This gave rise to Parameter Efficient FineTuning (PEFT) approaches that can efficiently finetune LLM with billions of parameters minimizing compute resources and cost.
In the last part of the LLM series, we will explore LLM prompt techniques and in-context learning that don’t require any finetuning or training. Stay tuned.