Introduction
Did you know that fine-tuning a GPT-3 model costs around $100,000 in computing resources, while a LoRA approach can reduce this cost by 99.9%? π€― In 2024, fine-tuning is no longer a question of “if” but “how” – and most importantly, which technique to choose.
With the explosion of LLMs (Large Language Models), every team wants to adapt these models to their proprietary data. But here’s the problem: fine-tuning billions of parameters requires A100 GPUs for weeks. That’s where PEFT (Parameter-Efficient Fine-Tuning) techniques like LoRA and Adapters come in.
In this article, you’ll discover the concrete differences between these approaches, understand when to use each one, and get a practical guide to start your first efficient fine-tuning. Whether you’re managing a startup or a data team in a large corporation, you’ll leave with a clear roadmap.
π― Understanding Fine-Tuning: The Essential Basics
What Is Fine-Tuning Really?
Fine-tuning is the art of adapting a pre-trained model to a specific task. Imagine you buy a sports car: it already runs very well. Fine-tuning is like optimizing it for a particular circuit by adjusting the suspension and engine.
Concretely, you take a model like LLaMA-2 or Mistral that has learned from terabytes of text, and you specialize it on your data: customer support, legal documentation, or internal code. The model retains its general knowledge but becomes an expert in your domain.
The Problem with Classic Fine-Tuning
Traditional fine-tuning (full fine-tuning) updates all parameters of the model. For a 7 billion parameter model, that represents:
- 28 GB of memory just to store the weights
- 112 GB additional for gradients and optimizers
- Days of computation on high-performance GPUs
According to a Hugging Face study (2024), only 12% of companies can afford this luxury. That’s why PEFT techniques have exploded in popularity.
π¬ PEFT: The Fine-Tuning Revolution
The Principle of Parameter-Efficient Fine-Tuning
PEFT is based on a fascinating observation: you don’t need to modify all parameters of a model to adapt it. It’s like learning a new language when you already speak three: your brain doesn’t completely rebuild itself, it just adds new connections.
PEFT methods freeze the majority of the original model’s parameters and only train a small subset. Result: you reduce memory requirements by 90-99% while maintaining comparable performance.
The Three Major PEFT Families
Adapters: add small modules between existing layers LoRA (Low-Rank Adaptation): decomposes weight matrices into products of smaller matrices Prompt Tuning: optimizes only input embeddings
Each family has its strengths. A typical Adapter module represents 0.5-3% of total parameters. LoRA can go down to 0.1%. Prompt tuning doesn’t even touch the model architecture.
βοΈ LoRA In-Depth: The Star Technique
How Does LoRA Work?
LoRA starts from an elegant mathematical observation. When you fine-tune a model, weight changes (ΞW) often have a low rank – meaning they can be approximated by the product of two smaller matrices.
Instead of modifying a W matrix of dimension 4096Γ4096, LoRA learns two matrices A and B of dimensions 4096Γ8 and 8Γ4096. The number of parameters goes from 16 million to only 65,000. That’s a 99.6% reduction! π
# Simplified LoRA implementation example with PyTorch
import torch
import torch.nn as nn
class LoRALayer(nn.Module):
def __init__(self, in_features, out_features, rank=8, alpha=16):
super().__init__()
# Original matrix (frozen during training)
self.weight = nn.Parameter(torch.randn(out_features, in_features))
self.weight.requires_grad = False
# LoRA matrices (trainable)
self.lora_A = nn.Parameter(torch.randn(rank, in_features))
self.lora_B = nn.Parameter(torch.zeros(out_features, rank))
self.scaling = alpha / rank
def forward(self, x):
# Classic computation + LoRA adaptation
result = torch.matmul(x, self.weight.T)
lora_result = torch.matmul(torch.matmul(x, self.lora_A.T), self.lora_B.T)
return result + lora_result * self.scaling
# Usage
layer = LoRALayer(in_features=768, out_features=768, rank=8)
# Only lora_A and lora_B will be optimized (about 12k params instead of 590k)
Critical LoRA Hyperparameters
The rank (r) determines adaptation capacity. A rank of 8 is often sufficient for simple tasks. For highly specialized domains (medical, legal), go up to 64 or 128.
Alpha controls adaptation intensity. The typical alpha/rank ratio is between 1 and 2. Too high, and you risk catastrophic forgetting (the model forgets its general knowledge).
According to a Microsoft Research analysis (2023), LoRA with r=16 achieves 95-98% of full fine-tuning performance on most NLP benchmarks, with only 0.2% of trainable parameters.
π§© Adapters: The Modular Approach
Architecture and Functioning
Adapters insert small feedforward networks between Transformer layers. Each adapter typically contains:
- A dense layer that reduces dimensionality (bottleneck)
- A non-linear activation
- A dense layer that restores original dimension
- A residual connection
The idea is to create “plug-and-play” modules that you can activate or deactivate. Want to adapt your model to medical French? Activate the French adapter AND the medical adapter simultaneously. This composition is impossible with LoRA.
Advantages and Use Cases
Adapters shine in multi-task scenarios. A Google Research study (2024) shows that with 12 different adapters, you can specialize a single base model on 12 domains, with only 4% additional memory per domain.
Concrete use case: Mistral AI used Adapters for their multilingual support system. A single model with 20 language-specific adapters, allowing responses in French, German, Spanish without duplicating the entire model.
π Comparison Table: Choosing the Right Approach
| Criterion | Full Fine-tuning | LoRA | Adapters | Prompt Tuning |
|---|---|---|---|---|
| Trainable parameters | 100% | 0.1-1% | 0.5-3% | <0.01% |
| GPU memory required | Very high (140+ GB) | Low (20-30 GB) | Medium (30-40 GB) | Very low (15 GB) |
| Training time | Baseline (100%) | 30-40% | 40-60% | 20-30% |
| Performance vs baseline | 100% | 95-98% | 93-97% | 85-92% |
| Multi-task | β Expensive | β οΈ Possible but limited | β Native | β |
| Implementation ease | Medium | Very easy | Medium | Easy |
| Best for | Unlimited budget | General case | Multi-domain | Quick tests |
π‘ Key insight: For 90% of enterprise use cases, LoRA is the best performance/cost compromise.
π§ Concrete Enterprise Use Cases
FinTech Startup: Transaction Classification
Context: A fintech wants to categorize 100 million transactions per year into 200 business categories.
Chosen approach: LoRA on LLaMA-2 7B
- Data: 50,000 labeled transactions
- Rank: 16, alpha: 32
- Training time: 6 hours on 1Γ A10G
- Cost: ~$20 on cloud
Results: 94.2% accuracy (vs 91.7% with base model), deployed to production in 2 weeks.
E-commerce Scale-up: Multilingual Customer Support
Context: Automate responses in French, German, and Italian.
Chosen approach: Adapters on Mistral 7B
- 3 adapters (one per language) + 1 “e-commerce” adapter
- Adapter composition for each language
- Total memory: 7.5 GB (base model) + 1.2 GB (adapters)
Results: 87% of tickets automatically resolved, response time divided by 5.
π» How to Get Started: Practical Guide
Step 1: Assess Your Needs
Before coding, answer these questions:
- How many different tasks? (1 = LoRA, 3+ = Adapters)
- What GPU budget? (<30 GB = LoRA or Prompt Tuning)
- What acceptable inference latency? (LoRA adds ~5% latency)
- Do you need to compose capabilities? (Yes = Adapters)
Step 2: Prepare Your Data
For efficient fine-tuning, aim for minimum 1000 quality examples. Three possible formats:
- Instruction-following:
{"instruction": "...", "input": "...", "output": "..."} - Question-answer:
{"question": "...", "answer": "..."} - Completion:
{"prompt": "...", "completion": "..."}
β οΈ Common mistake: imbalanced data. If 90% of your examples are from one category, your model will bias. Apply oversampling or data augmentation.
Step 3: Choose Your Tools
For LoRA:
- Library:
peftfrom Hugging Face (the reference) - Framework:
transformers+accelerate - No-code interface: Axolotl, OpenLLaMa
For Adapters:
- Library:
adapter-transformers - Framework: Compatible with
transformersAPI
# Minimal installation for LoRA
pip install transformers peft accelerate bitsandbytes
pip install datasets trl # For dataset management and RLHF
Step 4: Recommended Training Configuration
To start with LoRA on a 7B model:
from peft import LoraConfig, get_peft_model
# Conservative configuration (works in 80% of cases)
lora_config = LoraConfig(
r=16, # Rank - start small
lora_alpha=32, # Scaling = 2Γrank
target_modules=["q_proj", "v_proj"], # Attention layers
lora_dropout=0.05, # Regularization
bias="none", # Generally not necessary
task_type="CAUSAL_LM" # For text generation
)
# Apply LoRA to your base model
model = get_peft_model(base_model, lora_config)
print(f"Trainable parameters: {model.num_parameters(only_trainable=True):,}")
# Typical output: ~4.2M parameters (0.06% of total)
Essential Resources to Go Further
π Official documentation:
π οΈ Tools and templates:
- Axolotl: complete framework for fine-tuning
- LitGPT: optimized implementations
- Unsloth: 2-5Γ acceleration of LoRA training
β FAQ: Frequently Asked Questions
What’s the difference between PEFT and LoRA?
PEFT (Parameter-Efficient Fine-Tuning) is a generic term for all techniques that fine-tune efficiently. LoRA is a specific method of PEFT that uses low-rank matrix decomposition. It’s like saying “car” (PEFT) vs “Tesla Model 3” (LoRA).
Can you combine LoRA and Adapters on the same model?
Yes, but it’s rarely useful in practice. Both techniques modify different parts of the model: LoRA adjusts projection matrices, Adapters add layers. Combine them only if you have a very specific use case requiring both flexibility AND maximum efficiency. Generally, choose one or the other.
How much data is needed for good LoRA fine-tuning?
For simple tasks (classification, extraction), 500-1000 examples are often sufficient. For complex generation or technical domains, aim for 5000-10000 examples. Quality trumps quantity: 1000 well-annotated examples beat 10,000 noisy examples. Test first with a small subset to validate the approach.
Does LoRA degrade base model performance?
Not significantly if well configured. Stanford benchmarks (2024) show that LoRA with rank β₯16 achieves 95-98% of full fine-tuning performance. The main risk is catastrophic forgetting if alpha is too high: the model forgets its general capabilities. Start with alpha/rank = 2 and adjust progressively.
Can you use LoRA on any model?
Technically yes, but it’s optimized for Transformers. LoRA works on attention layers (Q, K, V projections) where matrices are wide. On CNNs or RNNs, the impact is less. For LLMs (GPT, LLaMA, Mistral, Falcon), it’s perfect. For vision (ViT), it also works very well.
π― Conclusion: Your 2025 Roadmap
Three key points to remember from this article:
1. LoRA is the new standard: with 0.1% of trainable parameters, you get 95%+ of full fine-tuning performance. It’s the default choice for 90% of use cases.
2. Adapters for multi-task: if you manage multiple domains or languages simultaneously, their modular architecture is unbeatable. One base model, multiple capabilities.
3. Start small, scale progressively: test first with 1000 examples and rank=8. Measure, iterate, increase only if necessary. Over-engineering kills more AI projects than under-engineering.
The PEFT landscape is evolving rapidly. In 2025, new techniques like QLoRA (quantization + LoRA) and AdaLoRA (adaptive rank) are pushing the limits even further. But master the fundamentals first before chasing the latest innovation.
π To go further: discover our article on RAG VS fine tuning, and to learn how to reduce your costs even more.
Was this article useful? Share it with your data team and join our newsletter to receive practical AI techniques in production every week.

