Trending

Yandex Research Team New Technique For Compressing Huge Language Models

The Yandex Research team has created two cutting-edge techniques for compressing huge language models: PV-Tuning and Additive Quantization for Language Models (AQLM), in partnership with researchers from IST Austria, NeuralMagic, and KAUST. When combined, these methods provide a maximum 8-fold decrease in model size while maintaining a 95% response quality level. The techniques are meant to optimize resources and improve efficiency while executing large language models. This research article’s details approach has been presented at the International Conference on Machine Learning (ICML), which is taking place in Vienna, Austria right now.PV-Tuning, AQLM by Yandex Research Team, IST Austria the volt post

Key features of AQLM and PV-Tuning

For LLM compression, AQLM makes advantage of additive quantization, which has contemporarily been employed for information retrieval. The proposed technique allows LLMs to be deployed on everyday go-to devices like computers and home PCs while maintaining and even improving model accuracy under severe compression. As a result, memory use is significantly decreased.

Errors that could occur during the model compression process are addressed by PV-Tuning. Combining AQLM with PV-Tuning yields the best results possible: compact models that can produce excellent response even with limited computing resources.

Method evaluation and recognition

LLama 2, Llama 3, Mistral, and other well-known open-source models were used to thoroughly evaluate the methodologies’ efficacy. Even after compressing these large language models eight times, the researchers were still able to retain an astounding 95% answer quality when compared to English-language benchmarks, WikiText2 and C4.

Who can benefit from AQLM and PV-Tuning

The new methods provide significant resource savings for companies creating and implementing open-source LLMs and proprietary language models. For example, post-compression, the Llama 2 model with 13 billion parameters may now operate on a single GPU instead of four, resulting in an eight-fold reduction in hardware expenses. This implies that sophisticated LLMs like Llama can be run on regular PCs by entrepreneurs, lone researchers, and LLM enthusiasts.

Exploring new LLM applications

Deploying models offline on devices with constrained computing resources is made easy by AQLM and PV-Tuning, opening up new use cases for smart speakers, cellphones, and other devices. Personalized suggestions, text and image generation, voice assistance, and even real-time language translation are all possible with powerful LLMs integrated into them—all without requiring a live internet connection.

Furthermore, because the models compressed by these approaches require fewer computations, they can function at up to four times quicker speeds.

Implementation and access

GitHub users may already utilize AQLM and PV-Tuning, which are open to developers and academics globally. The authors’ Demo materials include instructions on how to train compressed LLMs for a variety of applications. Developers can also obtain well-known open-source models that have already undergone the techniques’ compression.

ICML highlight

PV-Tuning, AQLM by Yandex Research Team, IST Austria the volt post

A scientific article by Yandex Research on the AQLM compression method has been featured at ICML, one of the world’s most prestigious machine learning conferences. Co-authored with researchers from IST Austria and experts from AI startup Neural Magic, this work signifies a significant advancement in LLM compression technology.

To Read the Full research article

Don't Miss