Meta on Tuesday unveiled the latest incarnation of Llama, its family of large language models (LLMs). The company says Llama 3.1 is the first open-source “frontier model,” a claim reserved generally for the biggest examples of AI code.
Llama 3.1 comes in multiple sizes, and the largest, “405B,” is not only noteworthy for the scale of computing it involves — Llama 3.1 has 405 billion neural “weights,” or, parameters, larger than prominent open-source models such as Nvidia’s Nemotron 4, Google’s Gemma 2, and Mixtral — it is also significant for three choices that the Meta team made.
Also: Meta inches toward open-source AI with new Llama 3.1
Taken together, the three decisions are a tour de force of neural network engineering and are at the heart of how the company built and trained Llama 3.1 405B. They complement advances Meta showed with Llama 2 that suggested ways to slim down deep learning’s total compute budget.
(An “AI model” is the part of an AI program that contains numerous neural net parameters and activation functions that are the key elements for how an AI program functions.)
First, Llama 3.1 405B dispenses with what’s called a “mixture of experts,” the approach Google uses for its newest closed-source model, Gemini 1.5, and Mistral uses for its Mixtral models.
A mixture of experts creates various alternate combinations of the neural weights. Some can be switched off so that a subset of weights is used to make predictions. Meta’s researchers “opted for a standard decoder-only transformer model architecture,” the near-ubiquitous building block first developed in 2017 as Google’s Transformer. The researchers claim this makes the model more stable during its training.
Also: Switzerland now requires all government software to be open source
Second, to improve the results of the plain-vanilla transformer-based model, Meta’s researchers describe an ingenious approach to training the model in stages. It’s well known that both the amount of training data and the amount of compute used can be balanced in an optimal way to produce better predictions.
As described in the formal paper for Llama 3.1, the researchers took a look at existing “scaling laws,” which tell how well a model will do at producing a correct prediction depending on the size of the model and the amount of training data. That approach doesn’t really tell how good a model is at carrying out a “downstream” task, such as a standardized test of reasoning.
Instead, Meta came up with its own scaling law. The company progressively increased both the amount of training data and the amount of compute, checking over multiple iterations to see how well the resulting trained model does on the downstream tasks.