LLM Training June 2026 15 min read

Arabic LLM Evaluation: ArabicMMLU, AlGhafa, and Building Custom Benchmarks

Translated English benchmarks give Arabic LLMs deceptively high scores. ArabicMMLU, AlGhafa, and the OALL leaderboard are closer to the truth — but still miss the production gaps that matter for GCC products. This guide explains each benchmark, where it fails, and how to build custom evaluation data that actually predicts whether your model will work in Saudi Arabia.

When an Arabic LLM team reports a headline score, the first question worth asking is: evaluated on what? The Arabic evaluation landscape in 2026 has matured considerably since the days of translated MMLU, but it still has enough gaps that a model can benchmark impressively on public leaderboards and then perform poorly on real Saudi user queries. The gap between benchmark performance and production performance is not random — it is predictable from which datasets were used for evaluation and which weren't.

This article covers the major Arabic evaluation benchmarks in use in 2026, what each measures and misses, how to read OALL leaderboard rankings without being misled, and how organisations building KSA-facing products can build custom evaluation datasets that close the gap between benchmark and production.

Why Translated Benchmarks Lie About Arabic Model Quality

The history of multilingual NLP evaluation is largely a history of translating English benchmarks and treating high scores as evidence of genuine multilingual capability. For Arabic, this was especially misleading because of how fundamentally Arabic differs from English at every linguistic level.

Translated Arabic benchmarks carry three systematic distortions:

The practical consequence: teams that benchmarked on translated Arabic MMLU through 2023–2024 were optimising models for translated-query patterns. When those models encountered real Saudi social media, government portals, or financial queries, the performance gap was often 12–20 percentage points on intent accuracy — large enough to make the difference between a usable product and one that requires human fallback on most queries.

The solution isn't to dismiss benchmarks entirely. It is to understand exactly what each benchmark measures and supplement with task-specific, natively-authored evaluation data for your use case.

ArabicMMLU: The Native-Authored Academic Benchmark

ArabicMMLU is the most significant development in Arabic LLM evaluation since AraBench. Published in 2023 and continuously expanded, it contains over 14,000 multiple-choice questions natively authored in Arabic — not translated — across 40 subjects including mathematics, Islamic jurisprudence, Arabic literature, science, and social studies, calibrated to KSA and GCC school and university curricula.

Key characteristics that distinguish ArabicMMLU from its translated predecessors:

DimensionArabicMMLUTranslated MMLU-Arabic
AuthorshipNative Arabic, GCC-calibratedEnglish → Arabic machine/human translation
Cultural relevanceHigh — includes KSA curriculum subjectsLow — US/Western cultural references
Islamic Studies coverageYes — Fiqh, Quran, HadithAbsent or marginal
Dialect coverageMSA-dominant (some formal Khaleeji)MSA translation only
Question count (approx.)14,000+Varies by translation version

Where ArabicMMLU is strong: academic reasoning in MSA, subject-matter knowledge that aligns to KSA educational standards, Islamic jurisprudence tasks that no translated benchmark covers meaningfully. For teams building educational AI, government-facing conversational products, or knowledge-base QA for KSA enterprise, ArabicMMLU is the right primary benchmark.

Where ArabicMMLU is weak: it is almost entirely MSA. Khaleeji dialectal inputs — which dominate Saudi consumer social media, customer service, and conversational AI — are not well represented. A model that achieves 74% on ArabicMMLU might still struggle with Hejazi colloquial text from Jeddah-based customers, or with the code-switched Arabic-English that dominates Saudi professional communication. ArabicMMLU measures academic knowledge, not conversational capability.

AlGhafa: Arabic Cultural and Literary Depth

AlGhafa (Arabic: الغفى, referencing deep knowledge) is an evaluation suite developed to fill a specific gap in ArabicMMLU: cultural, historical, and literary Arabic knowledge. Where ArabicMMLU focuses on academic subjects analogous to MMLU's Western curricula, AlGhafa tests knowledge that is distinctly Arabic — pre-Islamic poetry (Mu'allaqat), classical Arabic prose, Arab historical events, and cultural heritage knowledge that educated native speakers would be expected to possess.

AlGhafa includes several task types that expose weaknesses invisible in ArabicMMLU:

For practical evaluation strategy: AlGhafa scores correlate well with performance on government documentation tasks, Islamic finance product descriptions, and heritage/cultural content applications. They are less predictive of performance on tech-sector customer service queries or logistics tracking applications. Match your evaluation dataset to your use case — not to the benchmark that makes your model look best.

The OALL Leaderboard: Reading It Without Being Misled

The Open Arabic LLM Leaderboard (OALL), hosted on Hugging Face, is the primary public ranking for Arabic language model comparisons in 2026. As of mid-2026 it aggregates results across six evaluation dimensions: ArabicMMLU, AlGhafa, AraAGI, AraTrust, ACVA (Arabic Cultural Values Alignment), and an Arabic reading comprehension component.

The composite OALL score has become a standard citation in Arabic model papers and vendor comparisons. It is substantially better than citing translated MMLU alone, but several caveats apply when reading leaderboard rankings:

Caveat 1: OALL is MSA-dominant

All six OALL datasets are in Modern Standard Arabic. Dialectal capability — Khaleeji, Egyptian, Levantine, Maghrebi — is entirely absent from the composite score. A model can rank first on OALL and still produce incoherent Najdi Arabic or fail to parse Gulf code-switching. For consumer-facing Saudi or UAE products, OALL scores predict very little about real user experience.

Caveat 2: Contamination risk

ArabicMMLU and AlGhafa have been publicly available long enough that their contents may appear in the training data of models submitted to OALL. The leaderboard does not currently verify training data exclusion. Models that have been trained on OALL benchmark data — intentionally or not — will overstate their generalisation capability. Cross-reference OALL rankings with performance on private or recently-released evaluation sets before making procurement decisions.

Caveat 3: Task-type weighting doesn't match product needs

OALL's composite score weights each sub-benchmark equally. For a Saudi government chatbot, Islamic jurisprudence and cultural values alignment should be weighted heavily. For a GCC fintech application, those dimensions matter far less than Arabic reading comprehension and entity extraction accuracy on financial documents. Use OALL as a broad sanity check, not as a procurement decision framework.

Notable models on OALL as of mid-2026: SDAIA's Allam model (the KSA sovereign LLM, developed under Vision 2030's AI national strategy) consistently ranks highly on ArabicMMLU-adjacent tasks, reflecting its training on KSA-curated Arabic corpora. ALLaM and Jais (from G42 in the UAE) both show strong AlGhafa performance due to their Gulf-centric training data. International models — GPT-4o, Gemini, Claude — score competitively on MSA academic tasks but lag on Khaleeji dialectal evaluation not captured by OALL.

Need Custom Arabic Evaluation Data?

We build native-authored Arabic evaluation datasets for KSA, UAE, and GCC product teams — Khaleeji dialect coverage, domain-specific task types, and PDPL-aligned data handling. Custom benchmarks that reflect your actual user base, not public leaderboard distributions.

Other Benchmarks Worth Knowing: AraTrust, AraAGI, ACVA

Beyond ArabicMMLU and AlGhafa, the OALL suite includes three additional evaluation dimensions that target different model capabilities:

For Arabic model selection across most KSA enterprise use cases, a reasonable decision framework weights: ACVA (cultural alignment) ≥ ArabicMMLU (academic knowledge) > AlGhafa (cultural depth) > AraTrust (safety) > AraAGI (reasoning). Adjust weights based on your vertical — healthcare teams should up-weight AraTrust; fintech teams should up-weight AraAGI; educational platforms should reverse the ArabicMMLU weighting to primary.

Building Custom Arabic Evaluation: A Practical Framework

Public benchmarks measure what the benchmark authors decided to measure. Custom evaluation measures whether your model works for your users. For KSA and GCC products, the gap between these is large enough that teams relying on OALL rankings alone will frequently be surprised by production performance.

A practical custom Arabic evaluation framework has five components:

  1. Production query sampling. Extract a stratified sample of real user queries from your product (or from analogous Arabic-language products in your domain if you are pre-launch). Aim for 500–1,000 queries representative of the query distribution: topic mix, dialect mix, query length distribution, and proportion of code-switched input. This is the single most valuable step — it anchors evaluation to actual user behaviour rather than benchmark distributions.
  2. Native-speaker annotation of ground truth. For each sampled query, have native-speaker annotators with domain expertise produce reference answers or evaluation criteria. For Khaleeji-facing products, annotators must have certified sub-dialect coverage (Najdi, Hejazi, Emirati as relevant). Using MSA-fluent annotators for Khaleeji evaluation is a common error that produces ground truth with the same dialect blind spots you are trying to evaluate against.
  3. Dialect-stratified evaluation. Separate your evaluation set by dialect tier and report metrics per tier. Aggregating MSA and Khaleeji performance hides the gaps that matter commercially. Report: MSA performance, Khaleeji performance, code-switched performance as three distinct numbers — not a combined figure.
  4. PDPL-compliant data handling. If your evaluation data includes real user queries, Saudi PDPL Article 29 restrictions on cross-border data transfer apply. Anonymise user identifiers, strip device and location metadata, and process evaluation data within the GCC or under contractual mechanisms that satisfy PDPL adequacy requirements before sending to annotation vendors.
  5. Blind comparison against baseline. Evaluate your model alongside at least one public baseline (Allam, Jais, or a general-purpose international model) on your custom set. This calibrates your custom benchmark against known capability levels and protects against benchmark overfitting on your custom evaluation data.

The annotation cost for a 500-query custom Arabic evaluation set with native-speaker reference answers runs approximately AUD 8,000–15,000 depending on dialect complexity and domain specialisation — a small fraction of model training costs, and a much smaller cost than shipping a model that fails in production and requires emergency fine-tuning and redeployment.

Vision 2030 Context: Why Evaluation Rigour Matters Now

Saudi Arabia's Vision 2030 AI strategy is producing a new class of Arabic AI procurement decisions at scale. SDAIA, the National Center for AI (NCAI), PIF-backed technology companies, and Saudi government ministries are all in active procurement cycles for Arabic LLM capabilities. The evaluation rigour of their vendor selection processes has increased substantially in 2025–2026.

The practical implication for vendors: citing a high OALL score is no longer sufficient for serious KSA enterprise procurement. Procurement teams at major Saudi organisations now typically ask for:

For model developers targeting the KSA market, this means the annotation investment is not just about training data — it is about building the evaluation infrastructure that procurement teams are demanding. Teams that have invested in native-authored, dialect-certified evaluation sets have a material advantage in competitive bid processes.

The emerging standard, as of 2026, is a “dual evaluation” approach: public benchmark scores (OALL composite) as table stakes, plus a client-provided or independently-administered domain-specific evaluation as the differentiating evidence. Annotation vendors that can produce both the training data and the evaluation infrastructure are positioned to support the full Arabic AI development cycle rather than just the labelling phase.

Internal Links and Further Reading

FAQ

What is ArabicMMLU and why does it matter for Arabic LLM evaluation?

ArabicMMLU is a native-authored Arabic multitask benchmark covering 40 subjects calibrated to KSA and GCC curricula, including Islamic jurisprudence subjects absent from translated alternatives. It is the closest Arabic equivalent to MMLU and the primary headline benchmark for Arabic foundation model comparisons.

What is the OALL leaderboard and how is it structured?

OALL (Open Arabic LLM Leaderboard) aggregates results across six Arabic benchmarks — ArabicMMLU, AlGhafa, AraAGI, AraTrust, ACVA, and reading comprehension — into a composite score. It is the primary public ranking for Arabic model comparisons but is MSA-dominant with no dialectal coverage.

Why do translated English benchmarks mislead Arabic model evaluation?

Translated benchmarks retain English syntax patterns, Western cultural references, and diacritical noise from machine translation. They test translation quality as much as Arabic comprehension. Models optimised for translated benchmarks can underperform by 12–20 percentage points on real Saudi user queries compared to their benchmark scores.

What does AlGhafa evaluate that ArabicMMLU doesn't?

AlGhafa focuses on classical Arabic literary and cultural knowledge — pre-Islamic poetry, Arabic proverbs, Arab historical events, and cultural heritage. It is the most discriminating benchmark for models intended for cultural heritage, educational, or government applications where deep Arabic cultural knowledge matters.

How should Vision 2030 organisations approach custom Arabic evaluation?

Sample production queries from your real user distribution, have native Khaleeji or Hejazi speaker annotators produce reference answers, report dialect-stratified metrics separately, handle data under PDPL Article 29 requirements, and compare against a public baseline. A 500-query custom evaluation set costs approximately AUD 8,000–15,000 — small relative to model training and deployment costs.

Can ArabicMMLU scores predict real-world Saudi product performance?

ArabicMMLU is a reliable predictor for academic knowledge tasks in MSA but a weak predictor of dialectal conversational performance. A model at 74% on ArabicMMLU can still fail on Hejazi colloquial customer queries or Gulf code-switched input. Supplement with a custom dialect-specific evaluation set before making deployment decisions for Saudi consumer products.

Free Sample · 24-48 hours

Need Arabic LLM Evaluation Data?

Native-authored ArabicMMLU-style questions, dialect-specific evaluation sets, and PDPL-aligned benchmark construction for KSA and GCC product teams.

No commitment. NDA available on request. We respond within 24 hours, often the same day for Gulf-region inquiries.

Neel Bennett

AI Annotation Specialist at AI Taggers

Neel has over 8 years of experience in AI training data and machine learning operations. He specializes in helping enterprises build high-quality datasets for computer vision and NLP applications across healthcare, automotive, and retail industries.

Connect on LinkedIn

Arabic LLM Evaluation That Reflects Real Users

Native-authored evaluation datasets. Khaleeji dialect coverage. PDPL-aligned workflows. Built for Saudi and GCC production deployments.

Discuss Your Evaluation Project