The Yandex Research team has created two cutting-edge techniques for compressing huge language models: PV-Tuning and Additive Quantization for Language Models (AQLM), in partnership with researchers from IST Austria, NeuralMagic, and KAUST. When combined, these methods provide a maximum 8-fold decrease in model size while maintaining a 95% response quality level. The techniques are meant to optimize resources and improve efficiency while executing large language models. This research article’s details approach has been presented at the International Conference on Machine Learning (ICML), which is taking place in Vienna, Austria right now.
Key features of AQLM and PV-Tuning
For LLM compression, AQLM makes advantage of additive quantization, which has contemporarily been employed for information retrieval. The proposed technique allows LLMs to be deployed on everyday go-to devices like computers and home PCs while maintaining and even improving model accuracy under severe compression. As a result, memory use is significantly decreased.
Errors that could occur during the model compression process are addressed by PV-Tuning. Combining AQLM with PV-Tuning yields the best results possible: compact models that can produce excellent response even with limited computing resources.
Method evaluation and recognition
LLama 2, Llama 3, Mistral, and other well-known open-source models were used to thoroughly evaluate the methodologies’ efficacy. Even after compressing these large language models eight times, the researchers were still able to retain an astounding 95% answer quality when compared to English-language benchmarks, WikiText2 and C4.
Who can benefit from AQLM and PV-Tuning
The new methods provide significant resource savings for companies creating and implementing open-source LLMs and proprietary language models. For example, post-compression, the Llama 2 model with 13 billion parameters may now operate on a single GPU instead of four, resulting in an eight-fold reduction in hardware expenses. This implies that sophisticated LLMs like Llama can be run on regular PCs by entrepreneurs, lone researchers, and LLM enthusiasts.
Exploring new LLM applications
Deploying models offline on devices with constrained computing resources is made easy by AQLM and PV-Tuning, opening up new use cases for smart speakers, cellphones, and other devices. Personalized suggestions, text and image generation, voice assistance, and even real-time language translation are all possible with powerful LLMs integrated into them—all without requiring a live internet connection.
Furthermore, because the models compressed by these approaches require fewer computations, they can function at up to four times quicker speeds.
Implementation and access
GitHub users may already utilize AQLM and PV-Tuning, which are open to developers and academics globally. The authors’ Demo materials include instructions on how to train compressed LLMs for a variety of applications. Developers can also obtain well-known open-source models that have already undergone the techniques’ compression.
ICML highlight
A scientific article by Yandex Research on the AQLM compression method has been featured at ICML, one of the world’s most prestigious machine learning conferences. Co-authored with researchers from IST Austria and experts from AI startup Neural Magic, this work signifies a significant advancement in LLM compression technology.
To Read the Full research article