Reinforcement learning from human feedback became the technique that turned capable base models into useful assistants. The InstructGPT paper (Ouyang et al., 2022) formalised it: supervised fine-tuning on demonstration data, reward model training on human comparisons, then PPO optimisation against that reward signal. GPT-4, Claude, Gemini, and every major production LLM since relies on some variant of this pipeline. The ML community spent three years refining the training mechanics. Very little of that attention went to the annotation infrastructure that makes the human feedback signal valid in the first place.
This guide covers the annotation side: how to design preference pair tasks that produce genuine training signal, realistic scale requirements, why translated RLHF data degrades model quality, how DPO changes your data collection strategy, and the annotator profile and cost benchmarks you need for production-grade preference datasets.
What RLHF Actually Needs From Your Annotation Team
Standard annotation assigns a label to a single item. Preference annotation asks annotators to compare two or more model-generated responses to the same prompt and determine which is better — and by how much, and on what dimensions. This comparison structure is harder in every dimension: more cognitively demanding, more subjective, more prone to annotator fatigue, and more sensitive to annotator domain expertise.
The InstructGPT pipeline used three distinct annotation tasks: writing demonstration prompts for SFT (requires creativity and task diversity), ranking multiple model completions for the reward model (requires consistent quality judgement), and binary comparisons for PPO calibration (requires stable pairwise preference decisions). Each task has different annotator requirements and different acceptable inter-annotator agreement thresholds.
Meta's Llama 2 paper (Touvron et al., 2023) used separate reward models for helpfulness and safety, trained on 1,418,091 preference annotations from an in-house team of trained annotators rather than crowd workers. That headcount investment reflects the real cost of reliable RLHF data: you cannot crowd-source your way to a reward model that generalises correctly. Annotation quality at this task type compounds through the training pipeline in a way that classification annotation does not — a biased reward model shapes every downstream PPO update.
Designing Preference Pair Tasks That Generate Real Signal
The most common RLHF annotation failure is task design that produces near-tie comparisons at scale. When annotators consistently say “both responses are fine”, the reward model learns a noisy signal and produces a model that's marginally better than SFT alone. Strong reward models are trained on comparisons where one response is clearly superior, with the contrast concentrated on the dimensions that matter most for your use case.
Effective preference task design requires four decisions upfront:
- Evaluation dimensions. Helpfulness, honesty, harmlessness, instruction-following fidelity, coherence, conciseness, and domain accuracy are the most common. Pick 2–3 primary dimensions specific to your use case and include them in annotator rubrics. A coding assistant should weight technical correctness above style; a customer-service chatbot should weight politeness and task completion above exhaustiveness.
- Response contrast strategy. Pair intentionally unequal responses: one from the current best model, one from an earlier checkpoint, one with a known failure mode (over-refusal, hallucinated confidence, irrelevant tangents). Systematic contrast generation produces more useful reward signal than random sampling across two current model generations.
- Prompt diversity and distribution. Reward models trained on narrow prompt distributions overfit and fail to generalise. Prompts should cover the full operational distribution of your deployment scenario, including adversarial cases, ambiguous requests, and multi-turn contexts if applicable.
- Tie handling. Allow annotators to mark genuine ties, but flag datasets where tie rates exceed 15–20%. High tie rates signal either poor response contrast or annotator reluctance to commit. Investigate both before scaling.
The annotator rubric should make explicit that “slightly better” is a valid and useful answer. Teams that only collect strong-preference pairs miss the calibration signal that helps reward models handle the actual production distribution, which contains many near-equal response pairs.
How Many Preference Pairs Do You Actually Need?
The answer ranges across three orders of magnitude depending on your goal. Here are the realistic benchmarks from published work and production annotation:
| Use case | Pairs needed | Reference |
|---|---|---|
| General-purpose assistant (full RLHF) | 500K–2M+ | Llama 2: 1.4M; InstructGPT: 33K (smaller scope) |
| Domain-specific fine-tune (DPO, 7B–13B) | 10K–50K | Zephyr-7B trained on ~60K UltraFeedback pairs |
| Single-capability improvement (DPO) | 5K–15K | High contrast pairs only; quality > volume |
| Safety / refusal tuning | 20K–100K | Requires adversarial coverage; hard to synthesise |
| Reward model for PPO (general) | 100K–500K | Generalisation requires broad prompt coverage |
The Zephyr-7B result (Tunstall et al., 2023) is the most instructive case for teams working with constrained annotation budgets: a 7B model trained on 60,000 DPO pairs from UltraFeedback (GPT-4-scored synthetic data with human adjudication on edge cases) achieved benchmark performance competitive with Llama 2 70B on MT-Bench. The takeaway is not that 60K pairs is always enough — UltraFeedback was specifically engineered for high contrast — but that quality of contrast and annotation discipline can substitute for raw volume up to a point. For models requiring generalisation across a wide instruction distribution, volume remains non-negotiable.
Why Translated RLHF Data Fails
The most common cost-cutting mistake in multilingual LLM development is translating an English RLHF dataset into the target language and using it as-is. This approach fails reliably — not because translation introduces factual errors, but because reward models learn preference patterns, not just content.
English-origin preference annotations encode English-language discourse norms: a preference for direct assertion over hedged language, a specific formality register, a particular structure in multi-step responses. When these annotations are applied to translated text, the reward model learns to associate those discourse patterns — not the underlying quality — with being preferred. The resulting model produces responses that read like translated English: grammatically correct, but tonal register is off, idiomatic phrases are wrong, and the model consistently reaches for formal registers where a native speaker would use a colloquial one.
For Arabic specifically, this manifests as models that produce Modern Standard Arabic (MSA) responses to clearly colloquial Khaleeji or Egyptian prompts — because English-origin RLHF data never trained a preference for dialectal appropriateness. The model learned “formal Arabic is preferred” as a proxy for “good English preferences translated to Arabic.”
The only solution is native-language prompt authorship and native-speaker annotation from the start. This applies to both the prompts used to generate responses and the annotators making preference decisions. Mixed-language annotation — Arabic prompts with English-trained annotators making preference calls — produces the same systematic biases.
Building an RLHF Dataset for a Non-English LLM?
We collect preference data with native-speaker annotators trained specifically for RLHF workflows — Arabic dialects, Hebrew, Turkish, and other high-demand languages. Calibration protocols, IAA reporting, and DPO-ready output formats included.
DPO vs PPO: What Changes in Your Data Collection Strategy
The canonical RLHF pipeline — train a reward model, then optimise with PPO — requires two data collection stages: preference pairs for the reward model, plus ongoing reward signal during PPO training. Direct Preference Optimisation (Rafailov et al., 2023) collapses this into a single stage: fine-tune the language model directly on (prompt, chosen, rejected) triplets without an explicit reward model. The data format is the same, but the operational implications differ.
With PPO + reward model, errors in the reward model propagate through every PPO update. A reward model that learns to prefer verbose responses will systematically push the language model toward verbosity, even if no individual annotation endorsed that pattern. This “reward hacking” requires ongoing monitoring and periodic reward model retraining with corrective annotation. Your data collection operation needs a feedback loop.
With DPO, there is no separate reward model to be hacked — the language model fine-tunes directly on the contrast signal. This is both more efficient and more sensitive to dataset quality at the edges. Near-tie pairs that a reward model would smooth over become direct training signal in DPO. This means DPO datasets should be filtered more aggressively to remove weak-contrast pairs — ties and marginal preferences — before training, even if that reduces dataset size.
Operationally: DPO workflows benefit from a post-annotation filtering step that removes pairs where annotator confidence ratings fall below a threshold. Collecting annotator confidence alongside the preference judgement adds about 15% to annotation time but significantly improves DPO dataset quality.
What Good Reward Model Training Data Looks Like
Reward models fail in predictable ways that can be addressed at data collection time. The four most common failure modes and their annotation-side countermeasures:
Length bias
Reward models trained on annotation from general audiences learn that longer responses appear more helpful — because annotators associate thoroughness with quality. The result is a model that pads responses with filler. Countermeasure: include explicit length-penalty examples in your comparison set; train annotators to distinguish exhaustiveness from relevance using rubric examples; randomly sample 10–15% pairs where the shorter response is definitively better.
Confident-sounding hallucination
Responses that state incorrect information confidently are frequently rated higher than responses that correctly hedge. Annotators without domain expertise cannot detect the error. Countermeasure: for any domain with factual content — medicine, law, finance, science — use credentialled domain-specialist annotators. Include calibration examples where annotators must identify factually incorrect but fluent responses as worse.
Sycophancy bias
When prompts include the user's stated opinion, annotators tend to prefer responses that validate that opinion — even if a neutral correction would be more useful. The model learns to agree with users rather than inform them. Countermeasure: include a deliberate sycophancy test set in your calibration batch; remove annotators who show systematic sycophancy preference before production annotation begins.
Over-refusal preference
Safety-focused annotation teams sometimes prefer refusals over helpful responses to ambiguous prompts, producing models that refuse far too broadly in production. This is among the most common complaints about commercial LLMs after major safety tuning cycles. Countermeasure: calibration examples should include borderline prompts where a helpful, bounded response is explicitly marked as preferred over blanket refusal. Safety and helpfulness rubrics should be explicitly balanced in annotator training.
Annotator Selection and IAA Targets for Preference Tasks
Preference annotation has structurally lower inter-annotator agreement than classification annotation. This is not a quality failure — it reflects genuine task subjectivity. Realistic IAA targets for preference tasks range from κ = 0.55 (borderline safety judgements by non-experts) to κ = 0.78 (binary helpfulness comparisons with clear contrast). The InstructGPT team screened annotators by requiring > 70% agreement with researcher-adjudicated examples before allowing production annotation — a sensible calibration gate that most annotation vendors skip.
Annotator selection by use case:
- General assistant tasks: College-educated native speakers with strong critical reading skills. Not crowd workers — the judgement demand exceeds what crowd platforms reliably source. Minimum 2-hour onboarding on rubric and calibration examples.
- Coding and reasoning tasks: Working software engineers or mathematicians. A non-programmer cannot reliably judge whether two code responses differ in correctness, security, or idiomatic quality. Budget for specialist annotators from the start.
- Medical and clinical tasks: Board-certified clinicians or credentialled medical writers with domain sign-off. Non-experts annotating medical preferences introduce systematic errors that reward models amplify. This is not negotiable for any model that will interact with clinical decision-making.
- Multilingual tasks: Native speakers of the target language and dialect. For Arabic, dialect match matters: a Levantine Arabic annotator rating Khaleeji responses will have lower agreement with other raters and systematically different preferences on formality and idiom. See our Arabic NLP annotation page for dialect-matched annotator sourcing.
Run calibration sets before production annotation regardless of annotator background. A calibration set of 50–100 pre-adjudicated examples lets you identify annotators whose preference judgements are systematically miscalibrated before they produce thousands of unusable data points. For more on IAA measurement approaches for annotation tasks, including when κ targets should differ by task type, see our guide to Cohen's kappa in annotation quality.
Cost Benchmarks and Scaling Practical RLHF Data Collection
Preference annotation costs more per item than classification annotation — the comparison task takes longer, requires more qualified annotators, and needs higher adjudication rates. Production benchmarks for 2026:
| Task type | Cost per pair (AUD) | Notes |
|---|---|---|
| English general assistant (trained annotators) | $0.60–$1.80 | Single annotation + adjudication on disagreements |
| English coding / reasoning (specialist) | $3.00–$7.50 | SW engineer or mathematician annotators |
| Medical domain (credentialled) | $6.00–$20.00 | Dual annotation + clinical adjudicator |
| Arabic (native speaker, general) | $4.00–$9.00 | Dialect-matched; MSA vs Khaleeji priced separately |
| Arabic (domain specialist) | $9.00–$22.00 | Legal, medical, fintech with domain sign-off |
Volume discounts of 15–30% are typical at scale above 50,000 pairs. For teams running iterative RLHF — collecting new preference data after each training cycle — a retainer model with pre-qualified annotators reduces cycle time significantly compared to re-onboarding each iteration.
Model-assisted annotation (using an early reward model to pre-rank responses, then having human annotators review borderline cases) can reduce annotation volume by 30–50% for the same effective training signal. This is worth implementing at scale above 100,000 pairs where the engineering overhead is justified. Below that threshold, full human annotation on all pairs remains the more reliable approach. For a broader look at annotation cost structures, see our data annotation pricing guide. For teams building Arabic LLMs specifically, our guide to Arabic LLM training data requirements covers the full data stack from pre-training corpus through RLHF.
Further Reading and Related Services
- → Native speaker annotators — dialect-matched annotators for RLHF tasks in Arabic, Hebrew, Turkish, and more
- → Arabic NLP annotation datasets — preference data, SFT datasets, and evaluation sets for Arabic LLM development
- → Annotation pricing — transparent per-task pricing for preference annotation, SFT data, and RLHF workflows
- → Cohen's kappa in annotation quality — IAA metrics for preference tasks, including why κ ≥ 0.68 is a realistic target for RLHF
- → Data annotation quality metrics — broader quality framework for ML training datasets
- → Building Arabic LLM training data — full data pipeline from pre-training through RLHF for Arabic foundation models
FAQ
How many preference pairs do I need to fine-tune an LLM with RLHF?
Scale ranges widely. InstructGPT used ~33K comparisons for a focused reward model; Llama 2 used 1.4M for broad generalisation. For domain-specific DPO fine-tuning on a 7B–13B model, 10K–50K high-contrast pairs are typically viable. Zephyr-7B achieved competitive benchmark results with 60K DPO pairs. Quality of contrast between chosen and rejected responses determines effectiveness more than raw volume above a minimum threshold.
What is the difference between RLHF annotation and standard classification annotation?
Standard classification assigns a label to a single item. RLHF preference annotation compares two or more responses to the same prompt. It is more cognitively demanding, more subjective, and produces lower inter-annotator agreement (κ 0.55–0.75 vs 0.75–0.85 for classification). IAA thresholds must be calibrated to the specific task; generic Landis & Koch bands don't apply to preference annotation.
Why does translated RLHF data fail for non-English LLMs?
Reward models learn the preference patterns encoded in annotation, not just content. Translated annotations carry English discourse norms — formality levels, directness preferences, response structure — that don't transfer correctly. Models fine-tuned on translated RLHF data produce responses that are grammatically correct but tonally wrong for native speakers. Native-language prompt authorship and annotation is non-negotiable for non-English LLM alignment.
What is DPO and how does it change preference data requirements?
Direct Preference Optimisation (Rafailov et al., 2023) fine-tunes the LLM directly on (prompt, chosen, rejected) triplets without a separate reward model. The data format is identical to standard RLHF but DPO is more sensitive to near-tie pairs, which become direct training signal rather than being smoothed by a reward model. DPO datasets should be filtered to remove weak-contrast pairs before training; collecting annotator confidence scores enables this filtering.
What annotator profile should I use for RLHF preference tasks?
General assistant tasks: college-educated native speakers with strong critical reading skills. Coding/reasoning tasks: working software engineers. Medical tasks: board-certified clinicians. Multilingual tasks: native speakers of the target language and dialect, not L2 speakers. Crowd workers are unsuitable for RLHF preference annotation — the judgement quality required exceeds what crowd platforms reliably deliver. Calibration on pre-adjudicated examples before production annotation is essential for all profiles.
How much does RLHF preference data annotation cost?
English general assistant pairs (trained annotators): approximately AUD $0.60–$1.80 per pair. Technical/coding pairs (specialist annotators): $3.00–$7.50. Medical domain with credentialled clinicians: $6.00–$20.00. Arabic native-speaker annotation: $4.00–$22.00 depending on dialect and domain. Volume discounts of 15–30% apply above 50,000 pairs. Retainer arrangements with pre-qualified annotators reduce per-cycle costs for iterative RLHF workflows.
Need RLHF Preference Data Collected?
We build preference datasets for fine-tuning and reward modelling — domain-specialist annotators, calibration protocols, IAA reporting, and DPO-ready output formats included.
Neel Bennett
AI Annotation Specialist at AI Taggers
Neel has over 8 years of experience in AI training data and machine learning operations. He specializes in helping enterprises build high-quality datasets for computer vision and NLP applications across healthcare, automotive, and retail industries.
Connect on LinkedIn