The Arab AI safety stress test

The Equity Cost of Language: Why AI Rewards Arabic and Abandons Aboriginal Tongues

Consider that you've been tasked with teaching an AI your native language — one that isn't English. But you're not permitted to show it a single piece of written text: no digital artifacts in the way of books, articles, internet forms, or dictionaries. It's to learn purely from your voice. How well do you think it would actually learn?

If you consider how we currently build and train large language models, stripping away the written word would be akin to raising a skyscraper without concrete and steel. The entire architecture relies, fundamentally, on ingested text.

At the time of writing in 2026, linguists are witnessing a shifting landscape of AI language support, with two language families set on completely diverging trajectories. On one side are Arabic dialects, which are experiencing a boom; on the other, Australian Indigenous languages, which remain simply inaccessible to off-the-shelf AI. The contrast lays bare the "equity cost" of current foundation models. Put differently, the very architecture of modern AI creates a structural asymmetry — one that powerfully rewards languages nourished by vast amounts of standardized, internet-based text.

Proprietary AI requires billions, if not trillions, of text tokens to learn patterns. But for languages that are primarily oral and consequently lack a web-scale presence, the architecture leaves them behind. Let us focus on that requirement for text first, because it precisely explains the Arabic AI boom. Prior to 2022, if you tried to run a casual, slang-laden social media post from Beirut through an AI translator, it would have been a futile exercise — an immensely unreliable means of unlocking the meaning. But between 2024 and 2026, something shifted. Here's why.

Diglossia and the Blind Spot

In the Arabic-speaking world, there is a phenomenon known as diglossia — two layers of language operating simultaneously. First, there is Modern Standard Arabic, or MSA: the formal language of the news broadcast, of published literature, of government documents and formal education. Accordingly, there is an ocean of MSA text on the internet, and the AI had plenty of formal textbooks to read. The catch with MSA, one that English speakers rarely appreciate, is that Arabic-speaking people do not speak it at home, hanging out with friends, or when texting. Their everyday speech is in regional dialects — Egyptian, Levantine, Khaleeji, and so on.

The gap between the formal MSA that the AI knew and the colloquial dialects that people actually used every day was a blind spot. Recently, however, informal social media data has been harvested far more deliberately, so that foundation models might conquer these dialects.

A few flagship, Arabic-centric model families have matured of late: Jais out of the UAE, AceGPT v2, Allam from Saudi Arabia, and Fanar from Qatar. Fanar 227B stands as the most dialect-aware of the bunch. It uses a Gemma 3 backbone to master Khaleeji, Levantine, and Egyptian dialects. Rather than building the AI from scratch, the developers borrowed the base model from Google's open-weight Gemma 3 — which already carries a structural understanding of how language works in general — and then fine-tuned it deeply and specifically on Arabic regional data.

The Accuracy Paradox

But looking at the performance data for Fanar 2, an obvious contradiction warrants attention. Fanar 2 gained 9.1 points in general Arabic knowledge on its evaluation tests, yet in the specific category of dialects it gained only 3.5. This makes little sense: if Fanar is explicitly designed and marketed as a dialect-focused model, why should the dialect category show only minor, incremental improvement? This accuracy paradox is, indeed, one of the most frustrating problems facing AI developers.

To understand it, we have to look at how we actually test an AI's translation skills — through automated evaluation metrics. The BLEU score stands for bilingual evaluation understudy. At its core, it is essentially a matching game: it measures how many specific words, or sequences of words, in the AI's output exactly match a human-translated answer key. The problem is that those reference materials — the answer keys — stubbornly default to formal Modern Standard Arabic.

Say the AI is translating a phrase into Levantine Arabic, and it correctly uses the authentic regional word biddi for "want." Or perhaps it reaches for the Maghrebi word makla for food. The question the metric must settle is which exact words a human would naturally use. Yet the BLEU answer key strictly prefers the formal MSA equivalent. So the automated metric scans the output, sees biddi instead of the formal MSA oreed, and simply declares: match error. The metric penalizes the AI for being authentically dialectal. And so, across iterative training epochs, the model that quietly defaults to formal, robotic-sounding MSA actually scores higher on the test than the model that acquires the true regional dialect. The AI is punished for doing its job correctly — like a student who has mastered regional street slang, only for the exam to be graded against an Oxford dictionary.

This means the real-world capability of models like Fanar 2 is likely much higher than their official test scores suggest. So if the AI is getting that good at Egyptian and Levantine, does that mean the Arabic dialect problem is solved?

Persistent Failure Modes

Not quite. There remain persistent failure modes. The Maghrebi dialects — Moroccan Darija and Algerian in particular — are still the most weakly covered regions across the board. And Mesopotamian Iraqi is so poorly represented that, in parsing an Iraqi social media post, the AI will be significantly less reliable than it would be with an Egyptian one.

Yet the most intriguing edge case is not a traditional spoken dialect at all. It is Arabizi: romanized Arabic chat script. It uses the standard English alphabet, but presses numbers into service to represent Arabic sounds that don't exist in English. Arabists will be familiar with the number 3, which frequently substitutes for the Arabic letter ayn.

Arabizi emerged organically in the early days of the internet and texting, when Arabic keyboards simply weren't widely available or supported on mobile phones. But, curiously, this organic chat script has become a literal safety loophole for the largest frontier AI models — a means of jailbreaking foundation models. When prompted in Arabizi, the failure rate is staggering: researchers found that the rate at which the AI produced unsafe or toxic outputs jumped to 12.12%.

It all comes down to tokenization. When an AI processes text, it chops words into smaller pieces called tokens. With safety guardrails in place, LLMs are supposed to refuse harmful requests — but those guardrails were built primarily using standard Arabic-script tokens. When an AI safety filter scans a prompt written in Arabizi, it sees a jumble of Latin letters and numbers that, to the safety classifier, appear like English gibberish, which it waves through as harmless. The deeper semantic layers of the model, however — the pattern-recognition parts — recognize the underlying Arabic meaning. So the technical quirk is this: the model processes the harmful request and generates a response, bypassing the safety net entirely.

Researchers identified a fix in 2026, and it does not require retraining the entire model. As a workaround, a developer simply prepends an invisible system instruction that says, in effect, translate this Arabizi into Arabic script before processing the user's request. The unsafe output rate then plummets. By forcing the AI to rewrite the secret note into the formal alphabet, all those standard safety guardrails activate — and researchers found that forcing that internal translation step actually improves the AI's overall comprehension of the prompt as well.

When There Is No Text at All

All of this success, though — even the ability to patch that Arabizi safety loophole — relies crucially on millions of people having smartphones, texting one another, and leaving a vast digital footprint of written text. Which brings us to the other issue: what happens when a language's origins reside in a purely oral culture, where its richness is nowhere evident in a digital text footprint?

Out of the box, the massive 2024-era models — OpenAI's Whisper, Google's Universal Speech Model, and Meta's Massively Multilingual Speech (MMS) — support absolutely zero Australian Aboriginal or Torres Strait Islander languages. When Meta released MMS, it explicitly claimed support for over 1,100 languages. But that figure omits Australian Indigenous languages entirely, because Meta's model learned almost wholly from Christian missionary recordings of the New Testament.

For hundreds of low-resource languages, the only professionally translated, standardized text with an accompanying high-quality audio recording is the Bible. So Meta scraped that data to teach the AI how spoken sounds map to written words across the globe. That process excluded Australia owing to a massive geographical bias in the dataset: of the hundreds of Indigenous languages in Australia, only one — Kriol — has a complete translated Bible.

So it is fair to ask what prevents fieldwork with microphones alongside speaker communities, and expanded efforts to record oral histories. But data scarcity is only half the battle. There is a fundamental linguistic hurdle that standard AI architectures simply cannot clear.

As explored in the 2024 study by Le Ferrand et al., who took state-of-the-art models and fine-tuned them on ten different Indigenous languages: they took one of those languages, Bininj Kunwok, and fed it an hour of carefully curated, spontaneous speech — an hour of high-quality audio being usually enough to get a baseline speech-recognition system working. That holds true for a language like English or Spanish. But with Bininj Kunwok, the model failed spectacularly, producing the worst word error rate in the entire study. The reason comes down to grammar.

Polysynthesis and the Limits of the Word

The technical term is polysynthetic morphology. In English, we build a sentence from many separate, isolated words. "He is sitting on the chair" — that is six distinct words separated by spaces, and the AI can process each one individually. But Bininj Kunwok is polysynthetic: you can encode that entire English clause into a single, complex phonological word. A whole sentence in one word — the subject, the verb, the object, the tense, all baked into one continuous string of sounds through various prefixes and suffixes attached to a root.

Standard AI metrics calculate success using word error rate; they are looking for individual words. Word error rate tallies mistakes based on entire words being substituted, inserted, or deleted. So if an AI tries to transcribe a polysynthetic word in Bininj Kunwok — a word that contains the meaning of an entire sentence — and gets 95% of the sounds perfectly correct but misses one tiny suffix at the very end, the metric grades the entire word as 100% wrong. Because polysynthetic languages allow you to stack these prefixes and suffixes indefinitely to create slight variations in meaning, a vocabulary explosion results, with a practically infinite number of possible unique words. The AI metrics, by contrast, are built on the Western assumption that words are small, discrete Lego bricks separated by spaces. Presented with polysynthesis, the probabilistic, token-driven math simply breaks.

It is the same accuracy paradox we saw with the Arabic dialects. But instead of being penalized for slang, the AI is penalized at a deep, structural, grammatical level. This traditional approach — what researchers call extractive automatic speech recognition, or ASR — simply isn't working, and it has prompted an institutional pivot in Australia and a phasing out of the old tools. Coedl's Elpís platform, a major tool for a while, is essentially in maintenance mode now, given its reliance on older acoustic technology that foundation models have surpassed. But, as we have just discussed, foundation models break when they encounter polysynthesis. So the center of gravity has shifted toward the Language Data Commons of Australia, or LDaCA, which archives something approaching 18,500 hours of audio.

So if foundation models can't process it, what are researchers actually doing with it?

From Extraction to Sovereignty

This is where leading researchers like Steven Bird are changing the paradigm — pivoting away from extractive ASR, and no longer merely trying to suck audio data out of these communities to feed a Big Tech algorithm. They are moving toward community-controlled, participatory design. Instead of asking how we can force this language into an AI model, they ask how the technology might actually serve the speech community.

It is all about data sovereignty. They are building workflows for community linguists. Success is no longer measured by an abstract word error rate; it is measured by whether a tool actually saves a human linguist time as they transcribe community archives.

But it does leave us with two completely different trajectories. On one hand, Arabic dialects are surfing a massive wave of social media data, finding clever technical hacks to bypass safety-evaluation blind spots. On the other, Australian Indigenous languages are hitting the limits of how AI processes grammar, forcing researchers to rethink the entire purpose of language technology.

Is there any room for hope that oral, polysynthetic languages might yet secure functional AI support?

A Game-Changer on the Horizon

There is one major potential game-changer on the horizon, and it returns us to Meta. In November 2025, the company released something called Omnilingual ASR — the successor to that MMS model that had relied entirely on Bible translations. The key innovation of Omnilingual ASR is a zero-shot decoder.

To appreciate why that matters, consider the alternative. A conventional decoder functions by recording a sound and attempting to match it against a pre-written dictionary of known words. Since polysynthetic languages contain an infinite number of words, such a dictionary is rendered theoretically useless. A zero-shot decoder relies on no dictionary at all. It maps acoustic features — the raw sounds — directly to written characters, based on universal phonetic patterns it has learned. It can transcribe a language it has never explicitly been trained to understand. Meta built this using a 3,350-hour corpus of transcribed speech across 350 underserved languages.

Significantly, success here is measured differently — at a more granular level, using character error rates rather than word error rates. Character error rates fall below 10% for the vast majority of those 350 languages, which comes down to the fact that measuring character by character doesn't punish the AI for getting one letter wrong in a sentence-length word. Instead, it credits the AI for the 95% of characters it got right.

Researchers are still actively confirming exactly which Australian Indigenous languages are included, and functioning well, in the model. But this zero-shot approach is, realistically, the most plausible near-term hope for out-of-the-box support for larger Australian languages by 2026 or 2027.

The Core Takeaway

AI loves the internet. If millions of people type a dialect, the AI will eventually ingest and master it. But that same gap is actively widening for oral-only, structurally complex languages. Its reliance on text tokens, and on the spaces between words, is built in a way that struggles to comprehend languages that don't fit the text-heavy mold. This is going to put downward pressure on linguistic diversity, and on worldwide digital assimilation more broadly — especially if AI foundation models become the underlying operating system of the future internet. Speaking an unsupported language might then be treated by the system as noise: an error to be autocorrected, or actively filtered out.