By the end of 2026, every serious GCC enterprise will have an Arabic LLM strategy. Some will train. Most will fine-tune. A few will rebrand existing models with Arabic adapters and call it a day. All of them will run into the same wall: training data.
This guide is for the people inside that strategy — the ML engineers and applied research leads who actually have to make the dataset choices. We'll cover pre-training corpora, SFT data design, RLHF preference modelling, eval benchmark selection, and the specific dataset mistakes that we see derail Arabic LLM projects in the wild.
The Four Data Layers of an Arabic LLM
Modern LLM training stacks across four data layers. Each has Arabic-specific requirements that English-trained instincts get wrong.
- Pre-training corpus — the raw text scale that teaches the model language.
- SFT (supervised fine-tuning) — instruction-response pairs that teach the model to follow tasks.
- RLHF / DPO preference data — human preference judgments that align the model with human taste.
- Evaluation benchmarks — held-out tests that tell you whether any of the above is working.
Layer 1: Pre-Training Corpus
For continued pre-training on top of an English-strong base model (the most common path for GCC teams), 10-50 billion Arabic tokens of clean, deduplicated, register-balanced data is the realistic target. For a from-scratch foundation model with Arabic as a primary language, you're looking at 500B+ tokens.
The hard part is the word "clean". Open Arabic corpora (OSCAR, CC100, mC4) contain extensive issues:
- Machine-translated content masquerading as native Arabic
- Repeated boilerplate from Arabic news syndication
- Heavy register skew toward news and away from conversational Arabic
- Misclassified Arabic dialect samples (Maghrebi mixed in with MSA, Egyptian labelled as MSA)
- Sensitive religious content with inconsistent handling
A production-grade Arabic pre-training corpus is built, not downloaded. The minimum bar:
- Source diversity: news, books, government, academic, conversational, social media — with deliberate ratios, not whatever Common Crawl happens to surface
- Dialect distribution: 70/30 MSA-to-dialect by default; adjust based on product target
- Deduplication: exact + near-duplicate at the document and paragraph level
- Quality filtering: classifier-based filtering for translationese, broken encoding, ad-heavy content
- PII scrubbing: non-negotiable if you're going anywhere near GCC personal data
Layer 2: Supervised Fine-Tuning (SFT)
SFT is where Arabic LLM projects either find their footing or quietly become demo-only. Translating an English SFT dataset (Alpaca, Dolly, even high-quality datasets like Tulu) produces models that follow instructions in Arabic but feel like they're reading from a script. The model never learned how Arabic speakers actually phrase requests.
A strong Arabic SFT dataset is 10K-50K records and covers eight task families:
- Open-ended Q&A grounded in MENA context
- Summarisation of Arabic news, documents, and conversations
- Classification with Arabic-native label schemas
- Code generation with Arabic problem statements
- Reasoning chains in Arabic (not translated chain-of-thought)
- Multi-turn dialogue in target dialect
- Safety refusals calibrated to GCC cultural and religious norms
- Knowledge grounding for KSA, UAE and broader MENA facts
Quality over quantity is not a cliché here — it's the result. We have repeatedly seen 20K hand-crafted native pairs outperform 200K machine-translated examples on every downstream eval. The cost difference is real (native SFT pairs run $3-8 each at production quality) but the model gap is much larger than the cost gap.
SFT cost-quality reality check:
A 20K Arabic SFT dataset runs roughly $60K-$160K at native quality. That's less than two months of one senior ML engineer. If your Arabic LLM project budget can't support native SFT, you're building a model you won't be able to ship.
Layer 3: RLHF and Preference Data
RLHF (or its lighter-weight cousin DPO) teaches the model what humans prefer. That preference signal has to come from humans whose preferences match your target users. This is where translated preference data falls apart catastrophically.
If you collect preferences from English speakers comparing Arabic outputs, you get a model that produces Arabic in the shape English speakers prefer. The vocabulary is wrong. The register skews formal where it should be casual. The greetings feel scripted. Saudi users notice immediately. Egyptian users notice immediately. Even users who don't consciously identify the problem rate the experience as "off".
Arabic-native preference data needs three things:
- Annotators who are native speakers of the target dialect (or MSA, depending on use case)
- Rubrics that capture Arabic-specific quality signals (register appropriateness, dialect consistency, cultural fluency, religious sensitivity)
- Multi-turn preference judgments, not just single-turn ratings — Arabic conversational coherence breaks across turns in distinct ways
Volume target: 5K-30K preference pairs for a competent reward model. The cost is meaningful ($10K-$240K range), which is why we recommend scoping with a 1K-pair pilot before committing to the full run.
Layer 4: Evaluation Benchmarks
If you're measuring your Arabic model on translated English benchmarks, you're measuring the wrong thing. The benchmarks you need for production-credible Arabic LLM evaluation:
- ArabicMMLU — general academic knowledge. Useful but, like its English parent, doesn't correlate strongly with real product quality.
- AlGhafa — MENA cultural and contextual knowledge. Stronger signal for products targeting Arab users.
- OALL (Open Arabic LLM Leaderboard) suite — broader Arabic-focused eval ensemble.
- Dialect-identification eval — does the model output in the requested dialect when asked, and does it stay there across turns?
- Custom internal benchmark — this is the one that matters most. Hand-write 200-500 eval items in your target dialect that reflect your actual product use cases. Run this on every model checkpoint.
We help teams build that custom internal benchmark as part of our Arabic NLP dataset service. It is the highest-leverage eval investment most teams skip.
The 5 Pitfalls That Derail Arabic LLM Projects
From the projects we see in scoping calls and pilots, these are the recurring failure modes:
Pitfall 1: The translation trap
Building the entire data stack on translated English data, then being surprised when the model feels foreign. Symptom: passes evals, fails with real users.
Pitfall 2: Dialect blindness
Training on 'Arabic' as if it were one language, then discovering the chatbot speaks Khaleeji to Cairo users and MSA to Riyadh teenagers.
Pitfall 3: Eval shopping
Cherry-picking benchmarks that flatter the model and ignoring the ones that don't. Particularly common in launch posts — beware of any Arabic LLM claim that doesn't show OALL or AlGhafa numbers.
Pitfall 4: Single-pass SFT
Annotating SFT with one annotator per record and no adjudication. The label noise rate is high enough to materially degrade SFT effectiveness.
Pitfall 5: Compliance afterthought
Building the dataset on offshore annotation, then realising you can't legally use it for KSA enterprise customers. Solving this after the fact costs more than getting it right upstream.
Build vs Buy: When Each Makes Sense
For pre-training corpora, build with a specialist partner — the volume is too large and the cleaning too specialised for in-house teams to do efficiently. For SFT and RLHF, the right answer is almost always "buy from a partner who works with your team on annotation guidelines". You provide the rubric and acceptance criteria; the partner provides the native-speaker workforce, QA, and dual annotation.
For evaluation benchmarks, build with the partner but keep the test set private. The eval benchmark is your moat against benchmark contamination and your only honest signal on real product quality. Treat it like your model weights.
Scoping an Arabic LLM dataset?
We've scoped SFT, RLHF and eval datasets for several MENA AI teams. Send us a 30-minute scoping call and we'll work through dataset size, dialect mix, budget and timeline together.
Related Reading
- → The complete guide to Arabic data annotation — the broader playbook
- → Khaleeji vs MSA dialect strategy — getting your training data mix right
- → Saudi Arabia AI boom and Vision 2030 — market context
- → Our Arabic NLP dataset service
FAQ
How much Arabic pre-training data do I need?
10-50B tokens for continued pre-training on top of an English-strong base. 500B+ for a from-scratch Arabic-primary foundation model.
What's a reasonable SFT dataset size?
10K-50K high-quality native pairs. 20K hand-crafted native pairs outperform 200K machine-translated examples on every benchmark we've measured.
ArabicMMLU vs AlGhafa — which one?
Both, plus a custom internal benchmark. ArabicMMLU is academic knowledge; AlGhafa is MENA cultural context. Neither alone is enough.
Can I use translated RLHF data?
No. Preference signal must come from native speakers in your target dialect, or the model learns English speakers' preferences expressed in Arabic.
What dialect mix should I target?
Default 70% MSA / 30% dialect. KSA-targeted weights heavily toward Khaleeji within the dialect 30%. Pan-Arab consumer spreads across Egyptian, Levantine and Khaleeji.
Scope your Arabic LLM dataset
Tell us your target dialect mix, dataset size and timeline. We'll respond with a tailored scoping doc within 24 hours.
Neel Bennett
AI Annotation Specialist at AI Taggers
Neel has over 8 years of experience in AI training data and machine learning operations. He specializes in helping enterprises build high-quality datasets for computer vision and NLP applications across healthcare, automotive, and retail industries.
Connect on LinkedIn