LLMs are advanced neural networks trained on massive datasets to perform a wide range of natural language processing (NLP) tasks. They have billions of parameters, which represent the weights and biases learned during training.
For instance, GPT-3, one of the most well-known LLMs, has 175 billion parameters. These parameters enable the model to understand complex language patterns, but its huge size introduces following challenges such as,
- computational requirements: Training and running LLMs demand powerful GPUs and extensive energy consumption.
- Storage limitations: Storing these models, especially in environments like mobile devices, is often impractical.
- Inference latency: Real-time applications struggle with the delays caused by large model sizes.
But we can overcome those problems by using Quantization. It reducing the precision of numerical representations (parameter values) within the model.
What Is Quantization in AI/ML?
As we discussed earlier, Quantization is a technique used to optimize machine learning models by reducing the precision of their weight values and activations. Instead of using 32-bit floating-point (FP32) numbers, quantization maps these values to lower-precision formats such as 8-bit integers (INT8). This reduction significantly decreases the memory and computational requirements of the model.
Here are the main goals of LLM quantization,
- Reducing model size
- Improving inference speed
- Lowering energy consumption, making it ideal for edge and mobile applications
How Quantization Works
In simple words, Quantization works by mapping high-precision values to a smaller set of lower-precision values.
For example, consider weights in a model with values ranging from −234.1 to 251.51. These might be scaled and rounded to fit within the range of −128 to 127 (the range of INT8). While this transformation introduces some loss of precision, advanced techniques help mitigate its impact on model accuracy.
Calculate scale factor
Calculate zero point
Calculate quantized values
Here is the example quantized tensor
Different Types of Quantization Techniques
Quantization can be applied using two main techniques
Post-Training Quantization (PTQ)
Post training quantization involves quantizing a pre-trained model without retraining it. This approach requires minimal computational resources.
This technique is easy to implement and not require access to the original training data to making it highly convenient. However, this simplicity can come at the cost of significant accuracy loss, particularly for models performing complex tasks.
Quantization-Aware Training
Quantization-Aware training incorporates quantization into the training process itself.
It minimizes accuracy loss and ensures reliability in applications where precision is critical. However, it needs more computational resources and extended training times when compared to other techniques.
Quantization: Practical Example
Implementing quantization involves several steps, depending on the technique and framework used.
Install required libraries
!pip install transformers -q
!pip install torch -q
Import necessary modules
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
Load pre-trained model and tokenizer
model_name = "EleutherAI/pythia-410m"
model = AutoModelForCausalLM.from_pretrained(model_name, low_cpu_mem_usage=True)
tokenizer = AutoTokenizer.from_pretrained(model_name)
Click here to see the model details.
Load pre-trained model and tokenizer
model_name = "EleutherAI/pythia-410m"
model = AutoModelForCausalLM.from_pretrained(model_name, low_cpu_mem_usage=True)
tokenizer = AutoTokenizer.from_pretrained(model_name)
Estimate memory usage
– Number of parameters: 400 million
– Each parameter is stored in FP32 (4 Bytes).
– Total memory usage:
400 × 106 × 4 Bytes = 1600 × 106 Bytes = 1600 MB = 1.6 GB.
Define a function to calculate the size of a model
def calc_model_size(model):
total_params = sum(p.numel() for p in model.parameters())
dtype_size = torch.tensor(0, dtype=torch.float32).element_size()
model_size_in_bytes = total_params * dtype_size
model_size_in_mb = model_size_in_bytes / (1024 ** 2)
model_size_in_gb = model_size_in_bytes / (1024 ** 3)
print(f"Total Parameters: {total_params}")
print(f"Model Size: {model_size_in_mb:.2f} MB ({model_size_in_gb:.2f} GB)"
)
calc_model_size(model)
OUTPUT:
Total Parameters: 405334016
Model Size: 1546.23 MB (1.51 GB)
Quantize the model
quantized_model = torch.quantization.quantize_dynamic(
model,
{torch.nn.Linear},
dtype=torch.float16
)
calc_model_size(quantized_model)
OUTPUT:
Total Parameters: 51611648
Model Size: 196.88 MB (0.19 GB)
Pros and Cons of Quantization
Advantages
- Reduced model size: Lower precision means smaller memory requirements, making models deployable on edge devices.
- Faster inference: Reduced computational demands accelerate real-time processing.
- Energy efficiency: Quantization lowers energy consumption, which is crucial for battery-powered devices.
Disadvantages
- Accuracy loss: Lower precision can lead to reduce performance, especially for sensitive tasks.
- Complexity in implementation: Advanced techniques like Quantized-Aware training require significant expertise and computational resources.
Conclusion
Quantization is an important technique that enables Large Language Models to operate efficiently in diverse environments. By reducing model size and computational requirements, we can avoid challenges associated with deploying LLMs in the real world.