The Value of Quantization in the LLM World

LLMs are advanced neural networks trained on massive datasets to perform a wide range of natural language processing (NLP) tasks. They have billions of parameters, which represent the weights and biases learned during training.

For instance, GPT-3, one of the most well-known LLMs, has 175 billion parameters. These parameters enable the model to understand complex language patterns, but its huge size introduces following challenges such as,

  • computational requirements: Training and running LLMs demand powerful GPUs and extensive energy consumption.
  • Storage limitations: Storing these models, especially in environments like mobile devices, is often impractical.
  • Inference latency: Real-time applications struggle with the delays caused by large model sizes.

But we can overcome those problems by using Quantization. It reducing the precision of numerical representations (parameter values) within the model.

What Is Quantization in AI/ML?

As we discussed earlier, Quantization is a technique used to optimize machine learning models by reducing the precision of their weight values and activations. Instead of using 32-bit floating-point (FP32) numbers, quantization maps these values to lower-precision formats such as 8-bit integers (INT8). This reduction significantly decreases the memory and computational requirements of the model.

Here are the main goals of LLM quantization,

  • Reducing model size
  • Improving inference speed
  • Lowering energy consumption, making it ideal for edge and mobile applications

How Quantization Works

In simple words, Quantization works by mapping high-precision values to a smaller set of lower-precision values.

For example, consider weights in a model with values ranging from −234.1 to 251.51. These might be scaled and rounded to fit within the range of −128 to 127 (the range of INT8). While this transformation introduces some loss of precision, advanced techniques help mitigate its impact on model accuracy.value-of-quantization

Calculate scale factorvalue-of-quantization

Calculate zero pointvalue-of-quantization

Calculate quantized valuesvalue-of-quantization

Here is the example quantized tensorvalue-of-quantization

Different Types of Quantization Techniques

Quantization can be applied using two main techniques

Post-Training Quantization (PTQ)

Post training quantization involves quantizing a pre-trained model without retraining it. This approach requires minimal computational resources.

This technique is easy to implement and not require access to the original training data to making it highly convenient. However, this simplicity can come at the cost of significant accuracy loss, particularly for models performing complex tasks.

Quantization-Aware Training

Quantization-Aware training incorporates quantization into the training process itself.

It minimizes accuracy loss and ensures reliability in applications where precision is critical. However, it needs more computational resources and extended training times when compared to other techniques.

Quantization: Practical Example

Implementing quantization involves several steps, depending on the technique and framework used.

Install required libraries

!pip install transformers -q
!pip install torch -q

Import necessary modules

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

Load pre-trained model and tokenizer

model_name = "EleutherAI/pythia-410m"

model = AutoModelForCausalLM.from_pretrained(model_name, low_cpu_mem_usage=True)

tokenizer = AutoTokenizer.from_pretrained(model_name)

Click here to see the model details.

Load pre-trained model and tokenizer

model_name = "EleutherAI/pythia-410m"

model = AutoModelForCausalLM.from_pretrained(model_name, low_cpu_mem_usage=True)

tokenizer = AutoTokenizer.from_pretrained(model_name)

Estimate memory usage

– Number of parameters: 400 million
– Each parameter is stored in FP32 (4 Bytes).
– Total memory usage:
   400 × 106 × 4 Bytes = 1600 × 106 Bytes = 1600 MB = 1.6 GB.

Define a function to calculate the size of a model

def calc_model_size(model):
    total_params = sum(p.numel() for p in model.parameters())

    dtype_size = torch.tensor(0, dtype=torch.float32).element_size()
    model_size_in_bytes = total_params * dtype_size

    model_size_in_mb = model_size_in_bytes / (1024 ** 2)
    model_size_in_gb = model_size_in_bytes / (1024 ** 3)

    print(f"Total Parameters: {total_params}")
    print(f"Model Size: {model_size_in_mb:.2f} MB ({model_size_in_gb:.2f} GB)"
)
calc_model_size(model)

OUTPUT:
Total Parameters: 405334016
Model Size: 1546.23 MB (1.51 GB)

Quantize the model

quantized_model = torch.quantization.quantize_dynamic(
    model,
    {torch.nn.Linear},
    dtype=torch.float16
)
calc_model_size(quantized_model)

OUTPUT:
Total Parameters: 51611648
Model Size: 196.88 MB (0.19 GB)

Pros and Cons of Quantization

Advantages

  • Reduced model size: Lower precision means smaller memory requirements, making models deployable on edge devices.
  • Faster inference: Reduced computational demands accelerate real-time processing.
  • Energy efficiency: Quantization lowers energy consumption, which is crucial for battery-powered devices.

Disadvantages

  • Accuracy loss: Lower precision can lead to reduce performance, especially for sensitive tasks.
  • Complexity in implementation: Advanced techniques like Quantized-Aware training require significant expertise and computational resources.

Conclusion

Quantization is an important technique that enables Large Language Models to operate efficiently in diverse environments. By reducing model size and computational requirements, we can avoid challenges associated with deploying LLMs in the real world.

Share your love

Newsletter Updates

Enter your email address below and subscribe to our newsletter

Leave a Reply

Your email address will not be published. Required fields are marked *