Distillation can make AI models smaller and cheaper
Original version From This story It appeared Quanta Magazine.
Deepseek Chinese Corporation earlier this year released a ChatBot R1, which attracted a lot of attention. Most of this focuses on the fact that a relatively small and unknown company said that he has made a chat that competes with the performance of the world’s most famous artificial intelligence companies, but using part of the power and cost of the computer. As a result, the stocks of many Western technology companies have declined. Nvidia, which sells chips that runs leading AI models, lost more stock value in one day in history.
Some of this attention included the element of the charge. Sources claimed that Deepseek, using a technique known as distillation, unauthorized the knowledge of the proprietary O1 Openai model. Much of the news coverage frames the possibility as a shock for the artificial intelligence industry, implying that Deepseek has discovered a new and more efficient way to build artificial intelligence.
But distillation, also called knowledge distillation, is a widespread tool in artificial intelligence, a topic of computer science research for a decade and a tool that major technology companies use in their models. “Distillation is one of the most important tools that companies today have to make models more efficient,” said Enric Boix-Adera
Dark knowledge
The idea of distillation began with the 2015 article by three researchers on Google, including Jeffrey Hinton, the so -called Godfather and a 2024 Nobel Prize winner. At that time, researchers often executed a set of models – “Many models are clinging together”, “Orio Winels, and the original scientist. “But the implementation of all models was very difficult and expensive in parallel,” VINYALS said. “We were fascinated by the idea of distillation on a single model.”
The researchers thought they might progress by addressing a significant disadvantage in machine learning algorithms: the wrong answers were all considered as bad, no matter how wrong they could. For example, in an image classification model, “confuse the dog with the fox in the same way that confuses a dog with pizza.” The researchers thought that group models contain information that the wrong answers were less than others. Perhaps a smaller “student” model may use the large “teacher” model information to understand the categories that were supposed to arrange the images faster. Hinton summoned this “dark knowledge” and called the analogy with the dark matter of cosmology.
After discussing this possibility with Hinton, Vienyals created a way to get the teacher’s big pattern to convey more information about the image categories to a smaller student model. The main thing was to have the “soft goals” in the teacher’s pattern-where it assigned this possibility to any possibility, not to strengthen this answer. For example, a model calculated that there is a 30 % probability that a dog’s image shows, 20 % of the cat, 5 % indicates a cow and 0.5 percent. Using these possibilities, the teacher’s pattern effectively revealed to the student that dogs are quite similar to cats, not so different from cows, and are completely distinct from cars. The researchers found that this information helps the student learn how to identify images of dogs, cats, cows and cars more efficiently. A large, complex model can be reduced to a lean that is almost used to lose accuracy.
Explosive growth
The idea was not an urgent impact. This article was rejected by a conference, and Vienyals, discouraged, turned to other topics. But the distillation reached an important moment. At this time, engineers realized that the more educational data being fed in neural networks, the more effective these networks were. The size of the models exploded soon, as their capabilities were done, but the costs of their implementation were climbed to the size.
Many researchers have turned to distillation as a way to build smaller models. For example, in 2018, Google researchers unveiled a powerful language model called Bert, which the company soon began to use it to help analyze billions of web search. But Bert was large and expensive, so the following year, other developers distilled a smaller version called Distilbert, which was widely used in business and research. The distillation gradually became pervasive and is now offered as a service of companies such as Google, Openai and Amazon. The original distillation article, still published only on the Arxiv.org Preprint server, is now listed more than 25,000 times.
Given that distillation requires access to the teacher pattern parts, it is not possible for a third party to distillate heartbroken data from a closed source model such as O1 Openai, as it was thought to have done Deepseek. According to this, a student model can still teach a teacher’s pattern completely and only use a Socrates approach to distillation through the teacher with specific questions and the use of answers to teach his specific models.
Meanwhile, other researchers are still finding new programs. In January, the Novasky Lab at UC Berkeley showed that distillation works well to teach chain reasoning models of good thinking, which uses multi -step “thinking” to answer complex questions better. The laboratory says the Sky-T1 model costs less than $ 450 for training and has achieved similar results with a much larger open source model. “We were really amazed at how distillation worked in this environment,” said Ducheng Lee, a PhD student and partner of Nawassky’s partner. “Distillation is a basic technique in artificial intelligence.”
The main story Printed with permission from Quanta Magazine, An independent publishing in terms of editorial Simmons Foundation Their mission is to strengthen the general understanding of science by covering research developments and mathematics and physical sciences and life.