in

I broke Meta’s Llama 3.1 405B with one question (which GPT-4o mini gets right)

Muhammed Abdullah Kurtar/Anadolu via Getty Images

Meta last week unveiled its largest large language model to date, Llama 3.1 405B, which the company claims is the first “frontier model” in open-source software, meaning, a model that can compete with the best that closed-source has to offer, such as OpenAI’s GPT-4 and Google’s Gemini 1.5. 

It turns out that Llama 3.1 can also be broken as easily, or even easier than those models. Similar to how I broke Gemini 1.5 with a query pertaining to language translation when it first became available, I was able to cause Llama 3.1 to resort to gibberish with my very first question. 

Also: Beware of AI ‘model collapse’: How training on synthetic data pollutes the next generation

The Google Gemini fail is such a beautiful example of a simple question, that it has now become my go-to first question for testing large language models. Sure enough, I was able to use it to break Meta’s Llama 3.1 405B on the first try.

It’s a corner case, you could say, a question about the Georgian-language verb “ყოგნა,” meaning, “to be.” Except that, situated in the Caucasus region, between the Black Sea and the Caspian Sea, the country of Georgia is home to almost four million speakers of the Georgian language. 

Messing up the conjugation of the most important verb for a language spoken by four million people seems a bit more than a corner case.

In any event, I submitted my query to Llama 3.1 405B in the following form:

What is the conjugation of the Georgian verb ყოფნა? 

Also: I caused Google’s Gemini 1.5 Pro to fail with my first prompt

I submitted the question both on Meta’s Meta AI site, where you can use Llama 3.1 405B for free, and also on HuggingFace’s HuggingChat, where you can create chatbots from any open-source AI model with a public code repository. 

I also tried the query on a third-party, commercially-hosted chatbot, Groq. In all cases the response was gibberish. 

First, here’s the the correct answer, from OpenAI’s GPT-4o mini:

(Most of the other LLMs and chatbots, including Google’s Gemini, now answer this question correctly.)

<!–> chatgpt-4o-succeeds-at
OpenAI

At first, the Meta AI site protested, offering a message that ყოფნა was too complicated. After I insisted, it came up with a ridiculous made-up set of words. Here’s Llama 3.1 405B’s answer:

–>
Meta AI

As you’ll notice, in comparison to the correct answer above, the Llama 3.1 answers aren’t even close.

The HuggingFace and Groq versions didn’t even protest; they directly offered up the same ridiculous answer. In HuggingFace’s response, it gave a different set of gibberish words from the ones offered by the Meta AI site:

<!–> HuggingChat response
HuggingChat

The utter failure of Llama 3.1 on a foreign-language question is particularly galling given that Meta’s researchers talk at length in their technical paper about how Llama 3.1 advances on the prior version in terms of what they call “multilinguality,” meaning, support for a lot of other languages beyond English. 

The authors solicited a lot of extra human feedback on language answers. “We collect high-quality, manually annotated data from linguists and native speakers,” they write. “These annotations mostly consist of open-ended prompts that represent real world use case.”

Also: 3 ways Meta’s Llama 3.1 is an advance for Gen AI

It’s possible to see some interesting aspects that hint at what is going on with Llama 3.1 405B in the failure case. The spelling of the fake first-person answer, “ვაყოფ,” certainly sounds, even to my non-native ears, like a legitimate Georgian word. The prefix “ვ-” is a common prefix for the first-person conjugation, and the suffix “-ოფ” is a valid Georgian-language suffix. 

–>

So, it may be the case that the model is over-generalizing, finding a quick way to answer a question by coming up with synthetic answers, if you will, answers that work for many parts of a given language as patterns, but fail if overly-applied without observing exceptions. 

It’s interesting that Llama 3.1 405B’s answers can vary with multiple attempts. Here, for example, when the question is tried again, the model outputs a valid table of conjugations for the present tense: 

<!–> meta-llama-3-1-405b-gets-the-present-tense-of

–>

Meta AI

But, when prompted for the future tense, the model almost gets it right, but not quite. If fails to add the first-person prefix ვ- to the very first conjugation in the table:

<!–> meta-llama-3-1-405b-fails-on-future-tense-of

–>

Meta AI

Also interesting is the fact that Llama 3.1 405B’s smaller cousin, 70B, actually gets the right answer for the present tense on the very first try. That suggests that all of the extra training and computing power that has gone into the larger 405B version has the tendency to, perhaps in small cases, actually degrade the results. 

I’d imagine Meta’s engineers need to look closely at their corner cases and failure instances and see if their software is over-generalizing. 

Note that the researchers made extensive use of synthetic data to “fine-tune” the model and supplement the human feedback they gathered. It’s an open question whether synthetic data used at great scale contributes to an over-regularization, as is suggested by an article last week in Nature magazine.