Arabic & MENA

Where Do Arabic NLP Datasets Come From — and How Do You Build Your Own?

Arabic NLP datasets divide cleanly into two camps: large but dialect-poor public corpora, and small but high-quality custom collections. Understanding which to use — and how to fill the gaps — is the first decision every MENA AI team faces.

26 June 202613 min read

Direct Answer

Arabic NLP datasets come from three main sources: public corpora (OSCAR, Arabic Gigaword, MADAR), licensed commercial collections, and custom-built datasets using consented native-speaker annotation. Public sources cover Modern Standard Arabic well but underrepresent dialects — Khaleeji, Egyptian, Moroccan Darija. For MENA product AI, building a custom dialect-specific dataset via native-speaker annotation delivers 15–25 percentage-point model accuracy improvements over public-only training data.

The State of Arabic NLP Dataset Coverage in 2026

Arabic is the fifth most spoken language globally with approximately 420 million native speakers, yet it remains dramatically under-resourced in publicly available NLP datasets. A 2023 analysis published in the ACL Anthology estimated that Arabic accounts for less than 1.5% of Common Crawl web data despite representing over 5% of internet users — and that the available Arabic text skews heavily toward Modern Standard Arabic (MSA) used in formal journalism and government communications.

The practical consequence: teams training Arabic NLP models on freely available data are training on a register of Arabic that most of their users never actually speak. Gulf state residents use Khaleeji dialect in conversation and social media. Egyptian users write in Egyptian Arabic. Moroccan users code-switch between Darija and French. None of these dialects are well-represented in any major public corpus.

For teams building Arabic NLP annotation workflows, understanding this gap is foundational. It determines whether your model will behave correctly for your actual user population.

Major Public Arabic NLP Datasets and Their Limitations

The principal public Arabic corpora each solve different problems — and each has a different failure mode for production NLP work.

OSCAR (Open Super-large Crawled Aggregated coRpus) is the most widely used Arabic web crawl dataset, extracting Arabic content from Common Crawl. The 2022 release contains roughly 89 billion Arabic tokens. Volume is high; quality control is minimal. Teams using OSCAR without perplexity filtering and deduplication routinely encounter spam, duplicate content, and Arabic text rendered with incorrect Unicode encoding. It is reasonable as a pre-training base; it is not sufficient as a sole source.

Arabic Gigaword (LDC) is the most reliable MSA corpus, drawn from major Arabic newswires including Al-Ahram, Al-Hayat, and An-Nahar. Coverage is high-quality for formal written Arabic but limited to a single register — useful for news summarisation, document classification, and formal NER, but poor training data for conversational or social media applications.

MADAR (Multi-Arabic Dialect Applications and Resources) is the most cited dialect corpus, covering 25 Arabic city dialects with parallel translations. However, MADAR was built for dialect identification research, not for downstream NLP task training. Its sentence volume per dialect is too small for fine-tuning production intent classifiers or sentiment models at scale.

CAMeL-DA (Columbia Arabic Morphological Analyser Dialect Annotated) and the PADT treebank provide morphological annotation for syntactic tasks but are research-oriented datasets with restricted licensing.

The summary: no public dataset handles Arabic dialects at production NLP scale. Teams that need Khaleeji intent classification, Saudi Arabic sentiment analysis, or Egyptian Arabic entity recognition must build their own or commission custom Arabic data labeling from native-speaker annotation services.

Dialect Balance: Why Your Arabic Corpus Ratios Matter

If you are building a model for Saudi users and your training corpus is 90% Egyptian Arabic and 10% Khaleeji, your model will perform poorly on Saudi inputs — even if both are labelled "Arabic." Dialect imbalance is one of the most commonly missed failure modes in Arabic NLP projects.

According to a 2024 study in the Arabic NLP literature reviewing dialect identification models, top-performing Arabic dialect identification systems still confuse Gulf sub-dialects (Saudi Najdi vs Emirati vs Bahraini) at rates above 30% on short utterances. If automated dialect identification is unreliable, blind mixing of dialect data will corrupt your training signal.

The recommended practice is dialect-stratified corpus design: decide target dialect ratios based on your product's user geography before data collection begins. For a Saudi-first product, a reasonable ratio might be 60% Khaleeji / 20% MSA / 10% Egyptian / 10% Levantine — not because Egyptian Arabic is relevant but because cross-dialect robustness prevents degradation on Egyptians who use the product. Native annotators must confirm dialect assignment for every batch.

Building an Arabic NLP Dataset? Start With a Pilot.

AI Taggers designs custom Arabic corpus collection programs — dialect-stratified, PDPL-compliant, annotated by native Khaleeji, Egyptian, and Levantine speakers. We deliver pilot batches in 48 hours.

Learn About Our Arabic NLP Annotation Services

PDPL, Licensing, and Data Provenance for Arabic Datasets

Saudi Arabia's Personal Data Protection Law (PDPL), administered by SDAIA, imposes consent requirements on any collection of personal data for AI training purposes. Unlike GDPR, which has a public-interest research exemption that can cover some academic NLP data collection, PDPL requires explicit consent for personal data processing — and social media posts and customer service transcripts containing Arabic dialect text are classified as personal data.

Teams collecting Twitter/X Arabic data, scraping Saudi news comment sections, or sourcing WhatsApp conversation data for training must establish a compliant legal basis. The safest approach for commercial AI projects is consented data collection: recruiting native speakers who explicitly agree to their text being used for AI training, with compensation, and with clear data handling terms.

Public dataset licensing is a separate concern. Many Arabic NLP datasets published by universities are licenced for non-commercial research only (e.g. LDC licences, CC-BY-NC-ND). Using LDC Arabic Gigaword in a commercial product without an LDC commercial licence is a licence violation. Teams should audit every component of their training data for commercial-use permission before deployment.

For teams subject to UAE PDPL, Qatar Law No. 13, or Bahrain's PDPL — which share GDPR-adjacent frameworks — the consent requirements are similar in substance. Cross-border transfer restrictions are more complex: training data collected in Saudi Arabia and processed on compute infrastructure outside the Kingdom requires attention to PDPL data localisation provisions.

How to Build a Custom Arabic NLP Dataset: The Process

Custom Arabic dataset construction follows a five-stage process that differs from generic NLP data collection in several important ways.

Stage 1 — Corpus Design. Define dialect mix, domain coverage, and task schema before any data is collected. For intent classification, enumerate every intent class and the minimum examples per class (typically 200–500 for fine-tuning). For NER, define entity types, ambiguous boundary rules, and the handling of transliterated foreign names — a significant category in Gulf Arabic business text.

Stage 2 — Source Identification and Consent. Identify text sources — customer service logs, social media posts, survey responses, broadcast media transcripts — and establish legal basis. Consented speaker recruitment is the most reliable route for dialect-specific collections. Annotation vendors who specialise in Arabic typically maintain panels of dialect-verified native speakers with standing consent agreements.

Stage 3 — Annotation Guideline Development. Arabic annotation guidelines require more detail than English equivalents. They must specify how to handle: diacritics on classical Arabic loanwords, code-switching between Arabic and English or French, transliterated brand names, numerals (Arabic-Indic vs Western Arabic), and ambiguous dialect tokens that appear in multiple dialects with different meanings.

Stage 4 — Pilot Annotation and IAA Measurement. Run a pilot batch of 200–500 records with at least three native-speaker annotators per record. Calculate inter-annotator agreement (Cohen's kappa or Fleiss's kappa for multi-annotator tasks). For Arabic NER, a pilot kappa below 0.75 indicates guideline ambiguity that must be resolved before scaling. For sentiment, below 0.70 is a warning signal.

Stage 5 — Production Annotation with Rolling QA. Scale to full volume with a gold standard sample (typically 5–10% of records re-annotated by a senior reviewer). Track per-annotator accuracy against the gold standard and route borderline records to adjudication. Report final inter-annotator agreement alongside the dataset as a quality credential — particularly important for research partnerships and regulatory submissions.

Case Study: Building a Saudi E-Commerce Chatbot Intent Dataset

A Saudi Arabian e-commerce platform came to AI Taggers with an intent classifier that performed at 61% accuracy on real customer messages — despite having been trained on a 40,000-example dataset sourced from a translated English intent corpus and augmented with MSA-labelled examples.

The diagnosis was straightforward: the existing dataset contained almost no Khaleeji dialect text. Real Saudi customer messages used Gulf-specific vocabulary ("أبغى" for "أريد," diminutive constructions, code-switching with English product names), and the classifier had never seen these patterns in training.

The project involved three stages over six weeks. In week one, we designed a dialect-stratified corpus: 12,000 sentences targeting Khaleeji Saudi dialect across 28 intent classes, sourced from consented customer message archives with personal identifiers removed. In weeks two and three, native Najdi and Hijazi Arabic annotators completed annotation with a per-class minimum of 200 examples and a pilot kappa of 0.83. In weeks four through six, the annotated dataset was used to fine-tune an AraBERT base model, with held-out validation on a separate 2,000-sentence test set of real customer messages.

Results: intent classification accuracy improved from 61% to 84% on the real-message test set — a 23 percentage-point gain. False escalations to human agents dropped by 41%, saving an estimated AUD $4,200 per month in agent-handling costs. The client subsequently expanded the program to add Egyptian Arabic and Levantine dialect coverage for their regional expansion.

This outcome pattern — large gains from dialect-correct annotation that a translated or MSA corpus cannot deliver — is consistent across the Arabic NLP projects we run. Related reading: Arabic Sentiment Analysis: The Complete Guide for MENA AI Teams covers how the same dialect problem manifests in sentiment tasks.

Fine-Tuning vs Training From Scratch: What Arabic Dataset Volume You Actually Need

A common misconception in Arabic NLP projects is that you need millions of annotated examples. You don't — if you are fine-tuning a strong pre-trained Arabic language model.

Strong base models exist for Arabic in 2026. AraBERT (Arabic BERT), CAMeL-BERT, GigaBERT-v4, and AraGPT2 are all publicly available and encode substantial Arabic language knowledge from pre-training on hundreds of gigabytes of MSA text. For many production tasks, fine-tuning these base models on 5,000–20,000 high-quality dialect-annotated examples outperforms training a smaller model from scratch on millions of lower-quality records.

The key insight from published Arabic NLP benchmarks is that quality beats volume at the fine-tuning stage. A 2024 comparison published at ArabicNLP workshop showed that fine-tuning AraBERT on 8,000 Khaleeji-dialect sentiment examples produced higher F1 scores on Gulf social media evaluation sets than fine-tuning on 50,000 mixed-dialect examples from OSCAR. The quality of dialect matching between training and evaluation data matters more than raw sentence count.

For teams evaluating Arabic LLMs, our post on Arabic LLM Evaluation: ArabicMMLU, AlGhafa, and Building Custom Benchmarks covers how to measure model capability in a way that surfaces real dialect weaknesses rather than MSA benchmark inflation.

When to Commission Custom Arabic Annotation vs Use Public Data

The decision framework is simpler than it looks. Use public corpora for pre-training and domain-agnostic feature learning. Commission custom annotation for any task where your end users speak a specific dialect, come from a specific geography, or use domain vocabulary that formal MSA text does not contain.

Concretely: if you are building an Arabic news classifier or MSA document summariser, public corpora are likely adequate. If you are building a Saudi government chatbot, a GCC banking NLP system, or an Egyptian e-commerce search model, you need custom dialect-annotated data. There is no middle ground — a model trained on the wrong dialect will fail in production regardless of how much data it saw.

Our Arabic NLP annotation service covers the full pipeline: dialect-stratified corpus design, consented native-speaker data collection, multi-pass annotation with IAA measurement, PDPL-compliant data handling, and export in your required format (JSONL, CoNLL-2003, HuggingFace Datasets). For teams beginning a new Arabic AI project, our Arabic Data Annotation Guide for Saudi and GCC AI Teams provides the full strategic framework.

Frequently Asked Questions

What are the best publicly available Arabic NLP datasets?
The most widely used public Arabic NLP datasets include OSCAR (filtered Common Crawl Arabic), the Arabic Gigaword corpus (newswire), MADAR (25 Arabic city dialects), PADT (Penn Arabic Dependency Treebank), and ArSentD-LEV (Levantine sentiment). Most skew heavily toward Modern Standard Arabic. Dialect coverage — particularly Khaleeji, Moroccan Darija, and Sudanese Arabic — remains sparse in public sources.
Can I use Common Crawl data to train an Arabic NLP model?
Common Crawl Arabic data exists in large volume but comes with significant quality problems: high noise from HTML artefacts, spam, and auto-generated content; near-exclusive MSA focus with minimal dialect coverage; and inconsistent Unicode normalisation. Teams typically apply aggressive quality filters before using it. For production models targeting dialect speakers, Common Crawl alone will not produce a performant model without supplementary dialect-specific annotation.
Does Saudi Arabia's PDPL restrict training data collection?
Yes. Saudi Arabia's Personal Data Protection Law (PDPL), enforced by SDAIA, requires consent before collecting personal data for AI training. Text data that can identify individuals — social media posts, customer service transcripts, medical records — falls under PDPL. Compliant approaches include consented data collection programs, synthetic data generation, and working with annotation vendors who have established PDPL-compliant pipelines.
How many annotated examples does an Arabic NLP model need?
For fine-tuning pre-trained Arabic models (AraBERT, CAMeL-BERT): intent classification typically needs 200–500 annotated examples per class; NER needs 5,000–15,000 labelled sentences; sentiment classification needs 2,000–10,000 examples. High-quality dialect-matched data at these volumes consistently outperforms larger volumes of mixed or MSA data.
What makes Arabic dialect annotation different from MSA annotation?
MSA annotation follows a consistent grammar and vocabulary standard. Dialect annotation cannot — Khaleeji, Egyptian, Levantine, and Moroccan Darija each have distinct vocabularies, phonological patterns, and code-switching behaviours. Automated dialect identification is unreliable on mixed-dialect inputs. Native-speaker annotators from the specific dialect region are the only reliable quality control mechanism.
How long does it take to build a custom Arabic NLP dataset?
A 10,000-sentence NER dataset for a single Arabic dialect using consented collection typically takes 4–8 weeks: 1–2 weeks for data sourcing and consent, 1–2 weeks for annotation guideline development and pilot, and 2–4 weeks for production annotation and QA. Sentiment classification datasets run 2–4 weeks at similar volume.
Free Sample · 24-48 hours

Need Arabic NLP Training Data for Your MENA AI Product?

Share your dialect requirements and task type. We'll scope a custom Arabic dataset — dialect-stratified, PDPL-compliant, native-speaker annotated — and deliver a pilot batch free.

No commitment. NDA available on request. We respond within 24 hours, often the same day for Gulf-region inquiries.

Neel Bennett

AI Annotation Specialist at AI Taggers

Neel has over 8 years of experience in AI training data and machine learning operations. He specializes in helping enterprises build high-quality datasets for computer vision and NLP applications across healthcare, automotive, and retail industries.

Connect on LinkedIn