A popular technique for enhancing artificial intelligence efficiency presents drawbacks.
One of the most common techniques to optimize the efficiency of artificial intelligence models, quantization, has certain limitations, and the industry may be close to reaching them.
One of the most commonly used techniques to increase the efficiency of artificial intelligence models is quantization, although it has its limits, and the industry may be approaching them. In the realm of AI, quantization involves reducing the number of bits needed to represent information, similar to how one might respond to the question of the time. Instead of a detailed answer, simpler formulas are chosen that, while correct, encompass less detail. The need for precision varies depending on the context.
AI models are composed of different elements that can be quantized, especially the parameters, which are the internal variables that the models use to make predictions or decisions. This is practical, as models perform millions of calculations when executed. Quantized models, which use fewer bits to represent their parameters, require less mathematical demand and, therefore, fewer computational resources. It is important to clarify that this process is distinct from "distillation," which involves a more specific pruning of parameters.
However, quantization can lead to more drawbacks than previously thought. A study conducted by researchers from several prestigious universities indicates that quantized models experience inferior performance if the original unquantized version was trained for a long time with large volumes of data. This suggests that, in certain cases, it may be preferable to train a smaller model rather than reduce a larger one, which could be problematic for AI companies that train extremely large models in the hope of improving answer quality and then attempt to quantize them to reduce costs.
These effects are already beginning to be observed. Recently, it was reported that the quantization of Meta's Llama 3 model was found to be "more detrimental" compared to other models, possibly due to its training process. Tanishq Kumar, a Harvard mathematics student and lead author of the study, expressed that the main cost in AI is inference, and his research shows that one of the ways to reduce it may not be sustainable in the long run.
Contrary to what one might think, the inference cost of an AI model, such as when ChatGPT responds to a question, is often higher than the training cost of the model. For example, it is estimated that Google spent around 191 million dollars to train one of its most prominent models, while using that model to generate 50-word answers to half of the search queries would cost them nearly 6 billion dollars a year.
AI labs have adopted the strategy of training models on massive datasets under the belief that "scaling" the data and computational power used in training will translate into more competent AI. For instance, Meta trained Llama 3 with a dataset of 15 trillion tokens, compared to the 2 trillion used to train the previous generation, Llama 2. However, there is evidence suggesting that scaling may eventually yield diminishing returns, as seen in recent models trained by Anthropic and Google that did not meet internal expectations.
In light of the labs' resistance to training models with smaller datasets, the question arises whether there are ways to make models less susceptible to degradation. Kumar suggests that training models in "low precision" could make them more robust. The term "precision" refers to the number of digits that a type of numerical data can accurately represent. Most current models are trained with a precision of 16 bits and then quantized to 8 bits. This conversion involves a loss of precision in some components of the model (such as its parameters), which can be understood as rounding certain calculations.
Hardware manufacturers, like Nvidia, promote the use of lower precision in the inference of quantized models, and their new Blackwell chip supports 4-bit precision. However, Kumar warns that unless the original model has a remarkably large number of parameters, using precisions lower than 7 or 8 bits can lead to a significant decrease in quality.
Despite the technical complexity of the study, the underlying message is that AI models are not fully understood and that common solutions that work in other types of calculations may not be effective here. Kumar emphasizes that there are limits to the optimizations that can be made and that his work aims to add nuance to the discussions about the use of increasingly low precisions for training and inference of models. Although his study has a relatively small scope, they plan to conduct more tests in the future, trusting that the learning gained will endure: there are no free alternatives when it comes to reducing inference costs.