When an Arabic LLM team reports a headline score, the first question worth asking is: evaluated on what? The Arabic evaluation landscape in 2026 has matured considerably since the days of translated MMLU, but it still has enough gaps that a model can benchmark impressively on public leaderboards and then perform poorly on real Saudi user queries. The gap between benchmark performance and production performance is not random — it is predictable from which datasets were used for evaluation and which weren't.
This article covers the major Arabic evaluation benchmarks in use in 2026, what each measures and misses, how to read OALL leaderboard rankings without being misled, and how organisations building KSA-facing products can build custom evaluation datasets that close the gap between benchmark and production.
Why Translated Benchmarks Lie About Arabic Model Quality
The history of multilingual NLP evaluation is largely a history of translating English benchmarks and treating high scores as evidence of genuine multilingual capability. For Arabic, this was especially misleading because of how fundamentally Arabic differs from English at every linguistic level.
Translated Arabic benchmarks carry three systematic distortions:
- Translationese syntax. Machine-translated Arabic retains English phrase-order patterns even when grammatically adjusted. Arabic questions translated from English tend to follow SVO order rather than the VSO patterns that dominate native Arabic writing. Models that learn to answer translated questions may perform worse on native-authored queries that follow natural Arabic structure.
- Cultural reference displacement. MMLU questions about the American civil rights movement, US tax law, or Western classical music translate into Arabic without becoming culturally relevant to Arabic speakers. A model that answers these translated questions well has demonstrated English knowledge in Arabic script — not Arabic world knowledge.
- Diacritical noise. Most Arabic MSA formal writing omits diacritics (harakat), but machine translation often introduces them inconsistently. Models fine-tuned on diacritised translated text can be brittle on undiacritised native Arabic input, which is the standard form in every real production context.
The practical consequence: teams that benchmarked on translated Arabic MMLU through 2023–2024 were optimising models for translated-query patterns. When those models encountered real Saudi social media, government portals, or financial queries, the performance gap was often 12–20 percentage points on intent accuracy — large enough to make the difference between a usable product and one that requires human fallback on most queries.
The solution isn't to dismiss benchmarks entirely. It is to understand exactly what each benchmark measures and supplement with task-specific, natively-authored evaluation data for your use case.
ArabicMMLU: The Native-Authored Academic Benchmark
ArabicMMLU is the most significant development in Arabic LLM evaluation since AraBench. Published in 2023 and continuously expanded, it contains over 14,000 multiple-choice questions natively authored in Arabic — not translated — across 40 subjects including mathematics, Islamic jurisprudence, Arabic literature, science, and social studies, calibrated to KSA and GCC school and university curricula.
Key characteristics that distinguish ArabicMMLU from its translated predecessors:
| Dimension | ArabicMMLU | Translated MMLU-Arabic |
|---|---|---|
| Authorship | Native Arabic, GCC-calibrated | English → Arabic machine/human translation |
| Cultural relevance | High — includes KSA curriculum subjects | Low — US/Western cultural references |
| Islamic Studies coverage | Yes — Fiqh, Quran, Hadith | Absent or marginal |
| Dialect coverage | MSA-dominant (some formal Khaleeji) | MSA translation only |
| Question count (approx.) | 14,000+ | Varies by translation version |
Where ArabicMMLU is strong: academic reasoning in MSA, subject-matter knowledge that aligns to KSA educational standards, Islamic jurisprudence tasks that no translated benchmark covers meaningfully. For teams building educational AI, government-facing conversational products, or knowledge-base QA for KSA enterprise, ArabicMMLU is the right primary benchmark.
Where ArabicMMLU is weak: it is almost entirely MSA. Khaleeji dialectal inputs — which dominate Saudi consumer social media, customer service, and conversational AI — are not well represented. A model that achieves 74% on ArabicMMLU might still struggle with Hejazi colloquial text from Jeddah-based customers, or with the code-switched Arabic-English that dominates Saudi professional communication. ArabicMMLU measures academic knowledge, not conversational capability.
AlGhafa: Arabic Cultural and Literary Depth
AlGhafa (Arabic: الغفى, referencing deep knowledge) is an evaluation suite developed to fill a specific gap in ArabicMMLU: cultural, historical, and literary Arabic knowledge. Where ArabicMMLU focuses on academic subjects analogous to MMLU's Western curricula, AlGhafa tests knowledge that is distinctly Arabic — pre-Islamic poetry (Mu'allaqat), classical Arabic prose, Arab historical events, and cultural heritage knowledge that educated native speakers would be expected to possess.
AlGhafa includes several task types that expose weaknesses invisible in ArabicMMLU:
- Classical Arabic comprehension. Questions drawn from فصحى classical texts require understanding of grammatical structures that differ substantially from modern MSA. Models trained primarily on contemporary Arabic web text typically score 15–25 points lower on classical Arabic than on modern MSA, revealing a training data gap relevant to legal, religious, and heritage applications.
- Arabic proverbs and idiom comprehension. Arabic is among the richest languages for proverbs (أمثال). AlGhafa tests comprehension of idioms that require cultural knowledge to interpret correctly — a capability that direct translation pipelines cannot acquire. This is the most discriminating task type for models intended for culturally-aware applications.
- Historical events and geography. Arab-specific historical events, geography of the Arab world, and MENA geopolitical knowledge. Models with strong Western knowledge bases but limited Arabic-world training data show large gaps here.
For practical evaluation strategy: AlGhafa scores correlate well with performance on government documentation tasks, Islamic finance product descriptions, and heritage/cultural content applications. They are less predictive of performance on tech-sector customer service queries or logistics tracking applications. Match your evaluation dataset to your use case — not to the benchmark that makes your model look best.
The OALL Leaderboard: Reading It Without Being Misled
The Open Arabic LLM Leaderboard (OALL), hosted on Hugging Face, is the primary public ranking for Arabic language model comparisons in 2026. As of mid-2026 it aggregates results across six evaluation dimensions: ArabicMMLU, AlGhafa, AraAGI, AraTrust, ACVA (Arabic Cultural Values Alignment), and an Arabic reading comprehension component.
The composite OALL score has become a standard citation in Arabic model papers and vendor comparisons. It is substantially better than citing translated MMLU alone, but several caveats apply when reading leaderboard rankings:
Caveat 1: OALL is MSA-dominant
All six OALL datasets are in Modern Standard Arabic. Dialectal capability — Khaleeji, Egyptian, Levantine, Maghrebi — is entirely absent from the composite score. A model can rank first on OALL and still produce incoherent Najdi Arabic or fail to parse Gulf code-switching. For consumer-facing Saudi or UAE products, OALL scores predict very little about real user experience.
Caveat 2: Contamination risk
ArabicMMLU and AlGhafa have been publicly available long enough that their contents may appear in the training data of models submitted to OALL. The leaderboard does not currently verify training data exclusion. Models that have been trained on OALL benchmark data — intentionally or not — will overstate their generalisation capability. Cross-reference OALL rankings with performance on private or recently-released evaluation sets before making procurement decisions.
Caveat 3: Task-type weighting doesn't match product needs
OALL's composite score weights each sub-benchmark equally. For a Saudi government chatbot, Islamic jurisprudence and cultural values alignment should be weighted heavily. For a GCC fintech application, those dimensions matter far less than Arabic reading comprehension and entity extraction accuracy on financial documents. Use OALL as a broad sanity check, not as a procurement decision framework.
Notable models on OALL as of mid-2026: SDAIA's Allam model (the KSA sovereign LLM, developed under Vision 2030's AI national strategy) consistently ranks highly on ArabicMMLU-adjacent tasks, reflecting its training on KSA-curated Arabic corpora. ALLaM and Jais (from G42 in the UAE) both show strong AlGhafa performance due to their Gulf-centric training data. International models — GPT-4o, Gemini, Claude — score competitively on MSA academic tasks but lag on Khaleeji dialectal evaluation not captured by OALL.
Need Custom Arabic Evaluation Data?
We build native-authored Arabic evaluation datasets for KSA, UAE, and GCC product teams — Khaleeji dialect coverage, domain-specific task types, and PDPL-aligned data handling. Custom benchmarks that reflect your actual user base, not public leaderboard distributions.
Other Benchmarks Worth Knowing: AraTrust, AraAGI, ACVA
Beyond ArabicMMLU and AlGhafa, the OALL suite includes three additional evaluation dimensions that target different model capabilities:
- AraTrust: An alignment benchmark testing whether Arabic models follow safety, truthfulness, and helpfulness constraints when responding in Arabic. It includes adversarial prompts designed to elicit harmful, misleading, or culturally inappropriate outputs. For teams deploying Arabic models in KSA government or healthcare contexts, AraTrust performance is a meaningful proxy for deployment risk. Models that score well on ArabicMMLU but poorly on AraTrust are academic performers that would require significant safety fine-tuning before production deployment in sensitive MENA contexts.
- AraAGI: Tests general reasoning capability in Arabic including mathematical word problems, logical reasoning chains, and multi-step inference tasks. It is the Arabic equivalent of the AGI-eval suite. High AraAGI performance indicates that the model can reason in Arabic rather than reasoning in English and translating the output — a distinction visible in latency and intermediate chain-of-thought quality.
- ACVA (Arabic Cultural Values Alignment): Arguably the most distinctive OALL component. ACVA evaluates whether model outputs align with Arabic cultural values across a range of social, familial, and religious contexts. For products deployed in KSA and the GCC, cultural values alignment is not a marginal concern — it is a commercial and regulatory necessity. A model that answers medical questions in a way inconsistent with Islamic jurisprudence, or suggests social behaviour at odds with Saudi cultural norms, will not be deployed at scale regardless of its ArabicMMLU score. ACVA score is the single OALL component most relevant to Vision 2030 enterprise deployments.
For Arabic model selection across most KSA enterprise use cases, a reasonable decision framework weights: ACVA (cultural alignment) ≥ ArabicMMLU (academic knowledge) > AlGhafa (cultural depth) > AraTrust (safety) > AraAGI (reasoning). Adjust weights based on your vertical — healthcare teams should up-weight AraTrust; fintech teams should up-weight AraAGI; educational platforms should reverse the ArabicMMLU weighting to primary.
Building Custom Arabic Evaluation: A Practical Framework
Public benchmarks measure what the benchmark authors decided to measure. Custom evaluation measures whether your model works for your users. For KSA and GCC products, the gap between these is large enough that teams relying on OALL rankings alone will frequently be surprised by production performance.
A practical custom Arabic evaluation framework has five components:
- Production query sampling. Extract a stratified sample of real user queries from your product (or from analogous Arabic-language products in your domain if you are pre-launch). Aim for 500–1,000 queries representative of the query distribution: topic mix, dialect mix, query length distribution, and proportion of code-switched input. This is the single most valuable step — it anchors evaluation to actual user behaviour rather than benchmark distributions.
- Native-speaker annotation of ground truth. For each sampled query, have native-speaker annotators with domain expertise produce reference answers or evaluation criteria. For Khaleeji-facing products, annotators must have certified sub-dialect coverage (Najdi, Hejazi, Emirati as relevant). Using MSA-fluent annotators for Khaleeji evaluation is a common error that produces ground truth with the same dialect blind spots you are trying to evaluate against.
- Dialect-stratified evaluation. Separate your evaluation set by dialect tier and report metrics per tier. Aggregating MSA and Khaleeji performance hides the gaps that matter commercially. Report: MSA performance, Khaleeji performance, code-switched performance as three distinct numbers — not a combined figure.
- PDPL-compliant data handling. If your evaluation data includes real user queries, Saudi PDPL Article 29 restrictions on cross-border data transfer apply. Anonymise user identifiers, strip device and location metadata, and process evaluation data within the GCC or under contractual mechanisms that satisfy PDPL adequacy requirements before sending to annotation vendors.
- Blind comparison against baseline. Evaluate your model alongside at least one public baseline (Allam, Jais, or a general-purpose international model) on your custom set. This calibrates your custom benchmark against known capability levels and protects against benchmark overfitting on your custom evaluation data.
The annotation cost for a 500-query custom Arabic evaluation set with native-speaker reference answers runs approximately AUD 8,000–15,000 depending on dialect complexity and domain specialisation — a small fraction of model training costs, and a much smaller cost than shipping a model that fails in production and requires emergency fine-tuning and redeployment.
Vision 2030 Context: Why Evaluation Rigour Matters Now
Saudi Arabia's Vision 2030 AI strategy is producing a new class of Arabic AI procurement decisions at scale. SDAIA, the National Center for AI (NCAI), PIF-backed technology companies, and Saudi government ministries are all in active procurement cycles for Arabic LLM capabilities. The evaluation rigour of their vendor selection processes has increased substantially in 2025–2026.
The practical implication for vendors: citing a high OALL score is no longer sufficient for serious KSA enterprise procurement. Procurement teams at major Saudi organisations now typically ask for:
- ACVA cultural values alignment score, broken out separately from composite OALL
- Khaleeji dialectal performance on a custom evaluation set provided by the procurement team
- Proof of training data provenance demonstrating that KSA or GCC data was used
- Data handling documentation demonstrating PDPL compliance in training data collection
- Performance on domain-specific tasks relevant to the intended deployment (government services, healthcare, fintech)
For model developers targeting the KSA market, this means the annotation investment is not just about training data — it is about building the evaluation infrastructure that procurement teams are demanding. Teams that have invested in native-authored, dialect-certified evaluation sets have a material advantage in competitive bid processes.
The emerging standard, as of 2026, is a “dual evaluation” approach: public benchmark scores (OALL composite) as table stakes, plus a client-provided or independently-administered domain-specific evaluation as the differentiating evidence. Annotation vendors that can produce both the training data and the evaluation infrastructure are positioned to support the full Arabic AI development cycle rather than just the labelling phase.
Internal Links and Further Reading
- → Arabic NLP annotation — dialect-certified evaluation dataset construction
- → Arabic text annotation — NER, classification, preference data for LLM training
- → Saudi Arabia data annotation — PDPL-aligned, Khaleeji-native workflows
- → Arabic data labelling overview
- → How to build Arabic LLM training data — training data requirements for Arabic foundation models
- → Why translated training data fails — forensic analysis of translationese and benchmark inflation
- → Khaleeji vs MSA Arabic AI dialect strategy — choosing the right dialect coverage for your product
FAQ
What is ArabicMMLU and why does it matter for Arabic LLM evaluation?
ArabicMMLU is a native-authored Arabic multitask benchmark covering 40 subjects calibrated to KSA and GCC curricula, including Islamic jurisprudence subjects absent from translated alternatives. It is the closest Arabic equivalent to MMLU and the primary headline benchmark for Arabic foundation model comparisons.
What is the OALL leaderboard and how is it structured?
OALL (Open Arabic LLM Leaderboard) aggregates results across six Arabic benchmarks — ArabicMMLU, AlGhafa, AraAGI, AraTrust, ACVA, and reading comprehension — into a composite score. It is the primary public ranking for Arabic model comparisons but is MSA-dominant with no dialectal coverage.
Why do translated English benchmarks mislead Arabic model evaluation?
Translated benchmarks retain English syntax patterns, Western cultural references, and diacritical noise from machine translation. They test translation quality as much as Arabic comprehension. Models optimised for translated benchmarks can underperform by 12–20 percentage points on real Saudi user queries compared to their benchmark scores.
What does AlGhafa evaluate that ArabicMMLU doesn't?
AlGhafa focuses on classical Arabic literary and cultural knowledge — pre-Islamic poetry, Arabic proverbs, Arab historical events, and cultural heritage. It is the most discriminating benchmark for models intended for cultural heritage, educational, or government applications where deep Arabic cultural knowledge matters.
How should Vision 2030 organisations approach custom Arabic evaluation?
Sample production queries from your real user distribution, have native Khaleeji or Hejazi speaker annotators produce reference answers, report dialect-stratified metrics separately, handle data under PDPL Article 29 requirements, and compare against a public baseline. A 500-query custom evaluation set costs approximately AUD 8,000–15,000 — small relative to model training and deployment costs.
Can ArabicMMLU scores predict real-world Saudi product performance?
ArabicMMLU is a reliable predictor for academic knowledge tasks in MSA but a weak predictor of dialectal conversational performance. A model at 74% on ArabicMMLU can still fail on Hejazi colloquial customer queries or Gulf code-switched input. Supplement with a custom dialect-specific evaluation set before making deployment decisions for Saudi consumer products.
Need Arabic LLM Evaluation Data?
Native-authored ArabicMMLU-style questions, dialect-specific evaluation sets, and PDPL-aligned benchmark construction for KSA and GCC product teams.
Neel Bennett
AI Annotation Specialist at AI Taggers
Neel has over 8 years of experience in AI training data and machine learning operations. He specializes in helping enterprises build high-quality datasets for computer vision and NLP applications across healthcare, automotive, and retail industries.
Connect on LinkedIn