What's a reasonable SFT dataset size for Arabic instruction tuning?

10K-50K high-quality native-Arabic instruction-response pairs is a strong starting point for SFT. Quality matters more than quantity — 20K hand-crafted native pairs outperforms 200K machine-translated examples on every benchmark we've seen.

Should I evaluate on ArabicMMLU or AlGhafa?

Use both, plus a custom internal benchmark. ArabicMMLU measures general academic knowledge, AlGhafa covers MENA cultural context. Neither alone tells the full story. A KSA-targeted product also needs Saudi-specific eval items your annotators write fresh.

Can I use translated English RLHF data for Arabic?

No. RLHF teaches the model what humans prefer. If your preference data was rated by English speakers on translated outputs, the model learns English speakers' preferences expressed in Arabic — which produces models that feel foreign to Arabic users. RLHF must come from native Arabic speakers.

What's the right dialect mix for my Arabic LLM training data?

Default is roughly 70% MSA, 30% spread across target dialects. A KSA-targeted product weights heavily toward Khaleeji within that 30%. A pan-Arab consumer product distributes more evenly across Egyptian, Levantine and Khaleeji. Maghrebi and Iraqi require explicit targeting if they're priorities.

How to Build an Arabic LLM: Training Data & Pitfalls

By the end of 2026, every serious GCC enterprise will have an Arabic LLM strategy. Some will train. Most will fine-tune. A few will rebrand existing models with Arabic adapters and call it a day. All of them will run into the same wall: training data.

This guide is for the people inside that strategy — the ML engineers and applied research leads who actually have to make the dataset choices. We'll cover pre-training corpora, SFT data design, RLHF preference modelling, eval benchmark selection, and the specific dataset mistakes that we see derail Arabic LLM projects in the wild.

The Four Data Layers of an Arabic LLM

Modern LLM training stacks across four data layers. Each has Arabic-specific requirements that English-trained instincts get wrong.

Pre-training corpus — the raw text scale that teaches the model language.
SFT (supervised fine-tuning) — instruction-response pairs that teach the model to follow tasks.
RLHF / DPO preference data — human preference judgments that align the model with human taste.
Evaluation benchmarks — held-out tests that tell you whether any of the above is working.

Layer 1: Pre-Training Corpus

For continued pre-training on top of an English-strong base model (the most common path for GCC teams), 10-50 billion Arabic tokens of clean, deduplicated, register-balanced data is the realistic target. For a from-scratch foundation model with Arabic as a primary language, you're looking at 500B+ tokens.

The hard part is the word "clean". Open Arabic corpora (OSCAR, CC100, mC4) contain extensive issues:

Machine-translated content masquerading as native Arabic
Repeated boilerplate from Arabic news syndication
Heavy register skew toward news and away from conversational Arabic
Misclassified Arabic dialect samples (Maghrebi mixed in with MSA, Egyptian labelled as MSA)
Sensitive religious content with inconsistent handling

A production-grade Arabic pre-training corpus is built, not downloaded. The minimum bar:

Source diversity: news, books, government, academic, conversational, social media — with deliberate ratios, not whatever Common Crawl happens to surface
Dialect distribution: 70/30 MSA-to-dialect by default; adjust based on product target
Deduplication: exact + near-duplicate at the document and paragraph level
Quality filtering: classifier-based filtering for translationese, broken encoding, ad-heavy content
PII scrubbing: non-negotiable if you're going anywhere near GCC personal data

Layer 2: Supervised Fine-Tuning (SFT)

SFT is where Arabic LLM projects either find their footing or quietly become demo-only. Translating an English SFT dataset (Alpaca, Dolly, even high-quality datasets like Tulu) produces models that follow instructions in Arabic but feel like they're reading from a script. The model never learned how Arabic speakers actually phrase requests.

A strong Arabic SFT dataset is 10K-50K records and covers eight task families:

Open-ended Q&A grounded in MENA context
Summarisation of Arabic news, documents, and conversations
Classification with Arabic-native label schemas
Code generation with Arabic problem statements
Reasoning chains in Arabic (not translated chain-of-thought)
Multi-turn dialogue in target dialect
Safety refusals calibrated to GCC cultural and religious norms
Knowledge grounding for KSA, UAE and broader MENA facts

Quality over quantity is not a cliché here — it's the result. We have repeatedly seen 20K hand-crafted native pairs outperform 200K machine-translated examples on every downstream eval. The cost difference is real (native SFT pairs run $3-8 each at production quality) but the model gap is much larger than the cost gap.

SFT cost-quality reality check:

A 20K Arabic SFT dataset runs roughly $60K-$160K at native quality. That's less than two months of one senior ML engineer. If your Arabic LLM project budget can't support native SFT, you're building a model you won't be able to ship.

Layer 3: RLHF and Preference Data

RLHF (or its lighter-weight cousin DPO) teaches the model what humans prefer. That preference signal has to come from humans whose preferences match your target users. This is where translated preference data falls apart catastrophically.

If you collect preferences from English speakers comparing Arabic outputs, you get a model that produces Arabic in the shape English speakers prefer. The vocabulary is wrong. The register skews formal where it should be casual. The greetings feel scripted. Saudi users notice immediately. Egyptian users notice immediately. Even users who don't consciously identify the problem rate the experience as "off".

Arabic-native preference data needs three things:

Annotators who are native speakers of the target dialect (or MSA, depending on use case)
Rubrics that capture Arabic-specific quality signals (register appropriateness, dialect consistency, cultural fluency, religious sensitivity)
Multi-turn preference judgments, not just single-turn ratings — Arabic conversational coherence breaks across turns in distinct ways

Volume target: 5K-30K preference pairs for a competent reward model. The cost is meaningful ($10K-$240K range), which is why we recommend scoping with a 1K-pair pilot before committing to the full run.

Layer 4: Evaluation Benchmarks

If you're measuring your Arabic model on translated English benchmarks, you're measuring the wrong thing. The benchmarks you need for production-credible Arabic LLM evaluation:

ArabicMMLU — general academic knowledge. Useful but, like its English parent, doesn't correlate strongly with real product quality.
AlGhafa — MENA cultural and contextual knowledge. Stronger signal for products targeting Arab users.
OALL (Open Arabic LLM Leaderboard) suite — broader Arabic-focused eval ensemble.
Dialect-identification eval — does the model output in the requested dialect when asked, and does it stay there across turns?
Custom internal benchmark — this is the one that matters most. Hand-write 200-500 eval items in your target dialect that reflect your actual product use cases. Run this on every model checkpoint.

We help teams build that custom internal benchmark as part of our Arabic NLP dataset service. It is the highest-leverage eval investment most teams skip.

The 5 Pitfalls That Derail Arabic LLM Projects

From the projects we see in scoping calls and pilots, these are the recurring failure modes:

Pitfall 1: The translation trap

Building the entire data stack on translated English data, then being surprised when the model feels foreign. Symptom: passes evals, fails with real users.

Pitfall 2: Dialect blindness

Training on 'Arabic' as if it were one language, then discovering the chatbot speaks Khaleeji to Cairo users and MSA to Riyadh teenagers.

Pitfall 3: Eval shopping

Cherry-picking benchmarks that flatter the model and ignoring the ones that don't. Particularly common in launch posts — beware of any Arabic LLM claim that doesn't show OALL or AlGhafa numbers.

Pitfall 4: Single-pass SFT

Annotating SFT with one annotator per record and no adjudication. The label noise rate is high enough to materially degrade SFT effectiveness.

Pitfall 5: Compliance afterthought

Building the dataset on offshore annotation, then realising you can't legally use it for KSA enterprise customers. Solving this after the fact costs more than getting it right upstream.

Build vs Buy: When Each Makes Sense

For pre-training corpora, build with a specialist partner — the volume is too large and the cleaning too specialised for in-house teams to do efficiently. For SFT and RLHF, the right answer is almost always "buy from a partner who works with your team on annotation guidelines". You provide the rubric and acceptance criteria; the partner provides the native-speaker workforce, QA, and dual annotation.

For evaluation benchmarks, build with the partner but keep the test set private. The eval benchmark is your moat against benchmark contamination and your only honest signal on real product quality. Treat it like your model weights.

Scoping an Arabic LLM dataset?

We've scoped SFT, RLHF and eval datasets for several MENA AI teams. Send us a 30-minute scoping call and we'll work through dataset size, dialect mix, budget and timeline together.

Book Scoping Call Arabic NLP datasets

FAQ

How much Arabic pre-training data do I need?

10-50B tokens for continued pre-training on top of an English-strong base. 500B+ for a from-scratch Arabic-primary foundation model.

What's a reasonable SFT dataset size?

10K-50K high-quality native pairs. 20K hand-crafted native pairs outperform 200K machine-translated examples on every benchmark we've measured.

ArabicMMLU vs AlGhafa — which one?

Both, plus a custom internal benchmark. ArabicMMLU is academic knowledge; AlGhafa is MENA cultural context. Neither alone is enough.

Can I use translated RLHF data?

No. Preference signal must come from native speakers in your target dialect, or the model learns English speakers' preferences expressed in Arabic.

What dialect mix should I target?

Default 70% MSA / 30% dialect. KSA-targeted weights heavily toward Khaleeji within the dialect 30%. Pan-Arab consumer spreads across Egyptian, Levantine and Khaleeji.

Free Sample · 24-48 hours

Scope your Arabic LLM dataset

Tell us your target dialect mix, dataset size and timeline. We'll respond with a tailored scoping doc within 24 hours.

Neel Bennett

AI Annotation Specialist at AI Taggers

Neel has over 8 years of experience in AI training data and machine learning operations. He specializes in helping enterprises build high-quality datasets for computer vision and NLP applications across healthcare, automotive, and retail industries.

Connect on LinkedIn

How to Build an Arabic LLM: Training Data Requirements & Pitfalls