Quality May 2026 13 min read

Cohen's Kappa in Annotation Quality: When 80% Is Bad and 99% Is Worse

Most ML teams report a single inter-annotator agreement number and call it quality assurance. It isn't. Cohen's kappa, Fleiss's kappa, and Krippendorff's alpha each measure different things — and misreading any one of them lets real quality problems hide in plain sight.

Inter-annotator agreement (IAA) has become the annotation industry's most-cited and least-understood quality metric. Teams hit a target kappa, declare success, and ship training data that quietly underperforms in production. Others see kappa scores below their threshold, panic-revise guidelines, and accidentally constrain their annotators into mechanical label-stamping that produces high IAA and low dataset utility.

This guide covers the actual mechanics of Cohen's kappa, why it can mislead, when to use Fleiss's kappa or Krippendorff's alpha instead, realistic IAA targets by task type, and the diagnostic workflow for finding the real cause when IAA is off.

What Cohen's Kappa Actually Measures

Cohen's kappa (κ) was introduced in 1960 by Jacob Cohen as a way to measure agreement between two raters on a categorical classification task while correcting for agreement that would occur by chance. The formula:

κ = (Po − Pe) / (1 − Pe)

Po = observed agreement rate  |  Pe = expected agreement by chance

The chance-correction is the critical piece. If your annotation task has three classes distributed as 70% positive, 20% neutral, 10% negative, and two annotators independently pick labels at random according to those marginals, they'll agree on roughly 54% of items by chance alone. Raw percentage agreement of 54% looks catastrophic; κ = 0.00 correctly signals there's no real agreement above chance.

The widely cited Landis & Koch (1977) interpretation bands are: κ < 0.20 slight; 0.20–0.40 fair; 0.40–0.60 moderate; 0.60–0.80 substantial; 0.80–1.00 almost perfect. These bands were calibrated on clinical diagnostic agreement between medical professionals — not NLP annotation, not image labelling, not RLHF preference pairs. Applying them uncritically to annotation quality is a category error that has misled ML teams for two decades.

In production annotation, the right kappa target is always task-specific. A binary bounding-box presence/absence task on high-contrast images should exceed κ 0.90. A five-class Arabic dialectal sentiment task should target κ 0.65. Neither threshold is “better” — they reflect fundamentally different task difficulties.

The Agreement Paradox: Why High Kappa Can Signal Broken Annotation

Two scenarios where high IAA is a warning sign rather than a quality indicator:

Scenario 1: The Pre-Filtered Dataset

A team building a document relevance classifier pre-filters incoming items to remove anything ambiguous before sending to annotators. The remaining dataset is 92% clearly relevant, 8% clearly irrelevant. Two annotators apply “relevant” as the default and disagree only on the 8% boundary cases. Raw agreement: 94%. Cohen's κ on the irrelevant class: 0.61. The high overall κ masks poor agreement exactly where the model needs training signal most — the boundary cases.

What to do: Measure per-class kappa separately and ensure your dataset has sufficient representation of hard cases. If κ on your minority class is below 0.55, the model will learn that class poorly regardless of the overall IAA figure.

Scenario 2: The Over-Constrained Guideline

A team revises annotation guidelines after seeing κ = 0.68 on a toxicity classification task. They add decision trees, explicit keyword lists, and near-zero judgment zones. IAA jumps to 0.94. The model trained on the new data performs worse on production traffic than the model trained on the κ = 0.68 data. Why? The revised guidelines constrained annotators to label based on surface-level keyword matching, not contextual toxicity judgement. The training data now encodes rules, not the nuanced human assessment the model needed.

What to do: If revising guidelines causes IAA to jump more than 0.10 in a single revision cycle, audit whether the revision reduced genuine ambiguity or eliminated legitimate annotator judgment. Run both versions of guidelines on 200 held-out items and compare downstream model accuracy before committing to the stricter protocol.

Scenario 3: The 99% Kappa Red Flag

A well-designed multi-class classification task on diverse, real-world data should not produce κ above 0.92–0.94. If your annotation team reports κ > 0.95 on any task that involves genuine semantic judgement — sentiment analysis, intent classification, content moderation, instruction quality scoring — investigate before celebrating. This level of agreement typically means annotators are coordinating during annotation (explicitly or implicitly through shared reference resolution), the data distribution has been dramatically simplified, or the guidelines have collapsed into a lookup table.

What to do: Audit the annotation session logs. Check whether annotators worked on the same batches simultaneously. Rerun IAA on a fresh random sample drawn independently. If κ drops significantly on the unsupervised sample, you have a coordination problem.

Fleiss's Kappa: Three or More Annotators

Cohen's kappa is strictly pairwise — it handles exactly two annotators. When three or more annotators each label every item, Fleiss's kappa (1971) generalises the metric to the full panel. The key difference is in how expected agreement is computed: Fleiss's uses the proportion of all assignments to each category across all annotators, producing a single summary score across the panel.

The practical implications for annotation operations:

When using dual-annotation workflows (two annotators per item, adjudicator resolves disagreements), report both Cohen's κ for each annotator pair and the pairwise agreement rate with the adjudicator's final decision. This gives three data points: raw annotator agreement, chance-corrected agreement, and alignment with ground truth.

Krippendorff's Alpha: When Disagreement Has Degrees

Both Cohen's and Fleiss's kappa treat all disagreements as equally wrong. A clinical radiologist rating tumour burden as 3 when the ground truth is 4 (one step off on a 10-point scale) is treated identically to a rating of 9 (catastrophically wrong). This is the right behaviour for nominal categories with no intrinsic order. It is the wrong behaviour for ordinal, interval, or ratio data.

Krippendorff's alpha (α) uses a distance function that can be parameterised for the measurement level of your labels:

Label typeDistance functionAnnotation examples
NominalBinary (0 or 1)Entity type, intent class, topic category
OrdinalRank-based squared distanceSeverity rating (1–5), sentiment intensity
IntervalSquared differenceQuality score (0–100), reading age estimate
RatioSquared proportional differenceDuration (seconds), bounding box area

The practical impact is significant. On a 5-point pain-scale annotation task, Cohen's κ (nominal) might yield 0.55 while Krippendorff's α (ordinal) yields 0.74 — because most disagreements are one-step off, which the ordinal metric penalises lightly. The right metric to report is α, because it reflects the actual quality of annotator alignment for a task where near-misses are much less harmful than catastrophic misses.

Specific annotation tasks where Krippendorff's α should replace kappa: radiology severity grading (DR grades 0–4, ASPECTS scores), toxicity intensity scoring, audio and speech quality assessment (MOS scores), instruction-following quality on a numeric scale, and any preference ranking task with more than two options.

Realistic IAA Targets by Task Type

These targets are calibrated to production annotation in 2026, not research benchmarks. They assume native-speaker annotators (for language tasks), domain-specialist annotators (for medical tasks), and calibration-before-production workflows. Crowd worker annotation should add 0.05–0.10 buffer to these targets to achieve equivalent effective quality.

Task typeMetricTargetNotes
Binary classification (unambiguous)Cohen's κ≥ 0.82Spam, bounding box presence, OCR correctness
Multi-class (3–5 classes, English)Cohen's κ≥ 0.74Topic classification, intent detection
Named entity recognitionToken-level κ≥ 0.78Span boundary disagreements inflate error
Sentiment analysis (5-class, English)Cohen's κ≥ 0.74Social media harder than product reviews
Sentiment analysis (dialectal Arabic)Cohen's κ≥ 0.65Genuinely harder task, not annotator weakness
Medical image severity gradingKrippendorff's α (ordinal)≥ 0.75Radiologist pairs; binary finding uses κ ≥ 0.72
RLHF preference pairsCohen's κ≥ 0.68Instruction-following quality is inherently subjective
Toxicity / content safetyCohen's κ≥ 0.76Per-policy category; aggregate κ hides class gaps
Ranking / pairwise preferenceSpearman ρ≥ 0.80Kappa is wrong metric for ordinal rankings

IAA Problems Killing Your Training Data Quality?

We audit annotation datasets for IAA issues — per-class agreement gaps, annotator outliers, guideline drift — and deliver remediation recommendations with revised protocols. Fixed-cost audit engagements from five business days.

Diagnosing Low IAA: A Four-Step Framework

Low inter-annotator agreement has four root causes. They require different fixes and the symptoms overlap enough that misidentification is common:

1. Guideline Ambiguity

Signature: IAA is low across all annotator pairs simultaneously on the same item types. No single annotator is an outlier — everyone disagrees in the same places.

Fix: Identify the two or three item types with the most concentrated disagreements. Add decision examples and edge-case resolution rules specifically for those types. Rerun calibration on 50 new items before returning to production volume. Do not add general restrictions — only targeted guidance for the observed failure modes.

2. Annotator Domain Mismatch

Signature: One annotator's pairwise kappa with all others is consistently 0.10–0.20 below the rest of the team. The underperforming annotator disagrees with everyone, including on items other annotators agree on easily.

Fix: Reassign. A Gulf Arabic–native annotator placed on Moroccan Darija tasks, or a general medical annotator on pathology-specific grading, will have structurally lower IAA that calibration alone cannot fix. The domain gap is real.

3. Genuine Task Difficulty

Signature: Low IAA is concentrated on items that, on review, are genuinely ambiguous. Multiple expert adjudicators also disagree on the same items. IAA on unambiguous items is well above threshold.

Fix: Lower the threshold for this item type, document the ambiguity in your data card, and add an “ambiguous” flag rather than forcing a label. Ambiguous items with diverse labels can be used as calibration and uncertainty estimation examples rather than discarded.

4. Data Distribution Skew

Signature: Raw agreement looks acceptable but kappa is below threshold. Label distribution shows one class at 80%+ of the dataset.

Fix: Stratified sampling to rebalance classes before production annotation. Alternatively, report per-class κ separately and acknowledge that the minority class has higher uncertainty. Use Krippendorff's α if the imbalance is structural to the task rather than a sampling artefact.

IAA in Practice: The Metrics Stack That Actually Covers You

No single metric covers all the ways annotation quality can fail. A production-grade IAA reporting stack uses:

Document all five metrics in your data cards, not just the headline figure. Downstream model developers and compliance teams need the full picture — especially the per-class breakdown — to understand where training data uncertainty is highest and how that maps to expected model confidence on production traffic.

Further Reading and Related Services

FAQ

What is a good Cohen's kappa for annotation quality?

It depends on the task. Binary classification targets κ ≥ 0.82. Multi-class classification (3–5 classes) targets κ ≥ 0.72. Dialectal Arabic sentiment can accept κ ≥ 0.65 because genuine linguistic ambiguity is irreducible. The Landis & Koch thresholds from 1977 were calibrated on clinical diagnosis, not NLP — apply them to annotation only with task-specific calibration.

Why is very high inter-annotator agreement sometimes a problem?

κ > 0.95 on a non-trivial classification task typically signals over-constrained guidelines, pre-filtered easy data, or annotator coordination. Models trained on such data often underperform on production traffic because the training distribution doesn't represent genuine task difficulty.

When should I use Fleiss's kappa instead of Cohen's kappa?

Use Fleiss's kappa for team-level IAA reporting when three or more annotators label each item. Use individual Cohen's kappa pairs for annotator performance management — Fleiss's masks individual outliers that pairwise analysis surfaces.

What is Krippendorff's alpha and when should I use it?

Krippendorff's alpha weights disagreements by distance on ordinal or interval scales, so a one-step miss counts less than a five-step miss. Use it for severity ratings, toxicity scoring, MOS audio scores, and any task where near-misses are qualitatively different from gross errors.

How do I diagnose the root cause of low inter-annotator agreement?

Check whether low IAA is (a) concentrated across all pairs on the same items — guideline ambiguity; (b) one annotator consistently out of step — domain mismatch; (c) clustered on items that experts also disagree on — genuine task difficulty; or (d) coupled with a high-kappa/low-raw-agreement pattern — class imbalance. Each cause has a different fix.

Does Cohen's kappa work for continuous or ranking tasks?

No. For continuous ratings, use Krippendorff's α with an interval distance function. For ranking tasks like RLHF preference pairs, use Spearman rank correlation or Kendall's tau. Applying Cohen's kappa to ordinal data inflates disagreement estimates and misrepresents annotator alignment.

Free Sample · 24-48 hours

Need IAA Analysis or Annotation QA?

We audit existing annotation datasets for agreement gaps, annotator outliers, and guideline drift — and deliver remediated data with full IAA reporting.

No commitment. NDA available on request. We respond within 24 hours, often the same day for Gulf-region inquiries.

Neel Bennett

AI Annotation Specialist at AI Taggers

Neel has over 8 years of experience in AI training data and machine learning operations. He specializes in helping enterprises build high-quality datasets for computer vision and NLP applications across healthcare, automotive, and retail industries.

Connect on LinkedIn

Annotation Quality That Shows Up in Model Performance

IAA reporting, per-class breakdowns, adjudication workflows, and relabeling. Free dataset audit to start.

Request a Free Audit