What is annotation QA?

Annotation QA is the process of checking that labels in your training data are actually correct before that data trains a model. In practice it's a layered system: a spec everyone agrees on, a gold-standard set, a sampling rate, a relabelling loop for disagreements, and per-batch metric reporting. It is not 'have someone glance at 50 images at the end' — that's how bad datasets get shipped.

What does an annotation QA process actually look like?

A working process has six moving parts: a versioned annotation spec, a gold-standard reference set built by domain experts, a calibration round where every annotator labels the gold set and you measure inter-annotator agreement, blind double-annotation on a sample of every production batch, an adjudication queue where senior reviewers resolve disagreements, and per-batch QA reporting (accuracy vs gold, agreement, error class breakdown). Everything else is decoration.

How do you measure annotation quality?

The right metric depends on the task. For classification: accuracy against a gold set and Cohen's kappa for inter-annotator agreement. For bounding boxes and cuboids: IoU at agreed thresholds (0.5 / 0.7) plus orientation error for 3D. For segmentation: per-pixel IoU / mIoU. For NLP entity tasks: F1 against gold annotations. Underneath all of it is one rule — measure error against a real reference set, not against the annotators averaging each other.

What's a realistic gold-standard set size?

For most production projects, 200–500 examples per class, built and adjudicated by your most senior reviewer or a domain expert. Less than that and rare-class accuracy is noise; more than that and you're spending review hours that would be better used on production sampling. The exception is regulated domains (medical, autonomous driving, finance) where the regulator effectively sets the floor — there you build to satisfy the audit, not the model.

What sampling rate should QA use on production batches?

Start at 10% of every batch with blind double-annotation, plus a 100% pass against the gold set. Drop to 5% once a vendor has held above your quality bar for three consecutive batches. Push back up to 20% any time agreement drops or a new task type is introduced. Hard rule — never let the sampling rate get so low that a quiet regression goes a whole month without being caught.

How much does annotation QA add to project cost?

Honest answer — properly run QA usually sits at 15–25% of the production labelling cost. Skim QA (1–2%) costs less up front and is the most expensive thing you can do, because you only find the rot after the model fails and you're re-labelling at full rate. The teams who underestimate QA are also the teams who pay twice. The maths is genuinely not subtle.

Should you trust a vendor's self-reported QA numbers?

Not blindly. A vendor reporting 99.5% accuracy against their own gold set tells you almost nothing — they built the gold set, they QA against it, the number is partly marketing. What to ask for instead — let them QA against a small gold set you built and didn't share, and let them know up front that you'll be auditing 50–100 of their delivered labels independently. Vendors who push back on that arrangement are telling you something.

What are the warning signs an annotation dataset has quality problems?

A few signatures, all easy to spot: model accuracy plateaus surprisingly early and won't budge with more data; error analysis shows the model getting clean cases right and edge cases wildly wrong; inter-annotator agreement on the team is high but agreement with senior reviewers is low (group-think); rare classes have zero precision; and any vendor batch that arrives with a single combined 'quality score' and no per-class breakdown.

Annotation QA: The Honest Playbook for Catching Bad Labels Before They Wreck Your Model (2026)

Look, I'll be straight with you. The reason your model isn't hitting accuracy is probably not the architecture. It's probably that 12% of your bounding boxes have the heading flipped, your “medium severity” class is whatever the annotator felt that day, and the gold-standard set you're measuring against was built by the same person who wrote the guidelines — so the test is graded by the teacher. Classic.

Annotation QA is the boring, expensive, totally non-glamorous bit of training-data work that determines whether the model ships. This guide is what we actually do at AI Taggers when we run QA on a project — not the polished consultancy-deck version, the “here's what survives contact with real annotators on a Friday afternoon” version. If you take one thing away — QA is a system, not a step.

First, What QA Actually Is (And What It Isn't)

Annotation QA is the system of checks that catches bad labels before they end up in your training set. That's it. It's not a number on a delivery report. It's not “the senior reviewer glanced through”. It's a layered process — spec, gold set, calibration, sampling, adjudication, reporting — and skipping any one layer turns the rest into theatre.

The signature of a project without real QA is depressingly consistent. The dataset looks great on the delivery dashboard. The model trains. Accuracy plateaus weirdly early. Error analysis shows the model nailing easy cases and going to pieces on edge cases. Somebody re-labels a sample by hand and finds 8–15% of the labels are wrong in ways that quietly bias the model. Two months down the drain. We see this every quarter.

The Six Layers That Actually Make QA Work

A working QA stack has six layers. Skip one and the others can't cover for it. In order:

1. A versioned, written annotation spec

Not slack messages. Not “Jess will know”. A document, in git, that defines every class, every edge case, every “what do we do when the object is partially occluded”, every borderline that comes up. Versioned, because the spec WILL change in week three and you need to know which version a given batch was labelled under.

If your annotation team can't answer “what do we do with X” without messaging the project lead, the spec isn't finished. That's genuinely most projects on day one. Get it finished before you scale, not after.

2. A gold-standard reference set

200–500 examples per class, labelled and adjudicated by your most senior reviewer (or, in regulated domains, a credentialed expert). This is the truth file. Every annotator labels it on day one. Every production batch is QA'd against it. The gold set is the only reason any other metric on the project means anything.

Critically — the gold set is not allowed to be built by the same people who QA the production work. If the umpire wrote the rulebook AND ref's the match, the match isn't a match. We split gold-set construction and ongoing QA between different reviewers on principle.

3. Calibration before production

Every annotator labels the gold set blind. You measure their accuracy and inter-annotator agreement (Cohen's kappa for paired tasks, Fleiss for teams). Anyone below your agreement threshold doesn't go on production. They get retrained on the cases they missed and re-tested. Sounds harsh — it isn't. It's cheaper than discovering the same problem at batch five.

What “agreement threshold” means in practice — for most production projects we want κ ≥ 0.80 between annotators on the gold set. Above 0.90 is a sign the task is too easy or the spec is too constraining. Below 0.70 is a sign the spec is broken — fix the spec, not the annotators. For the deeper maths, see our Cohen's kappa guide.

4. Blind double-annotation on every batch

Sample of every production batch — start at 10%, drop to 5% only after a vendor or team has hit your bar for three consecutive batches. The sample is labelled blind by a second annotator. Disagreements get routed to the adjudication queue. Per-class agreement gets reported.

Why blind — because annotators who can see the first label will anchor to it almost every time. The point of double-annotation is to catch the cases where two competent annotators genuinely see the same thing differently. If they can see each other's work, you've measured peer pressure, not agreement.

5. An adjudication queue with a senior reviewer

Every disagreement from double-annotation ends up here. A senior reviewer makes the call, ideally adding a one-line note about why. Those notes are gold — they become the next version of the spec. Adjudication isn't overhead; it's how the spec gets sharper without anyone having to predict every edge case in advance.

The mistake teams make — letting the original annotator adjudicate their own work. Doesn't count. Adjudication must come from a different reviewer or it's just confirmation bias dressed up as process.

6. Per-batch QA reporting (the report that has to exist)

Every batch ships with a one-page report. Numbers we report on every batch we touch:

Accuracy vs gold set, overall and per class (per class — never just overall)
Inter-annotator agreement on the QA sample
Adjudication queue volume and resolution time
Top three error categories with examples
Any spec changes that landed during the batch

If a vendor delivers a batch with a single combined “99.x% accuracy” number and no per-class breakdown, that's a red flag. It almost always means the rare classes are tanking and the abundant classes are doing all the lifting. Ask for the breakdown. Watch the response.

The Stuff That Gets Quietly Ignored

Rare classes are where projects die

A medical imaging dataset that's 95% “normal” can hit 95% accuracy by labelling everything as normal. We've all done worse. Stratify your sampling — every QA batch must include rare classes in proportion that matches what the model needs to learn, not the dataset distribution. Otherwise rare-class performance is unmeasured, then it's zero, then the radiologists find out at the worst possible moment.

Domain expertise is not interchangeable with effort

A diligent crowd annotator cannot QA medical imaging, regardless of how many hours you give them. A diligent crowd annotator also cannot QA Arabic sentiment if they don't speak Khaleeji. The honest version of QA scoping is — match the reviewer to the domain, and if you can't, the project needs a different vendor or a clinical-expert overlay (which is what we do on medical work — see clinical expert annotation).

The vendor self-marking problem

Most annotation vendors report their own QA numbers. Their QA is graded against their own gold set, which they built. The number can be 99.5% and still be useless. The fix is straightforward and we recommend it on every engagement — you (the client) build a small hidden gold set we don't see, you audit 50–100 of our delivered labels independently, we agree on what the bar is up front. We have never had a client regret doing this. The vendors who balk at it are telling you everything.

What QA Actually Costs (And Why Skim QA Costs More)

Properly run QA — gold set construction, calibration, 10% double-annotation, adjudication, per-batch reporting — usually lands around 15–25% on top of production labelling cost. Yes, that's real money. No, you don't get to skip it.

Skim QA (a 1–2% spot check at the end) looks cheap until the model fails and you re-label the affected batches at full rate. We've been brought in to fix projects where the “cheap” option ended up costing 2–3× the proper QA budget once the rework was tallied. The cost-benefit isn't subtle. For broader pricing context, see our data annotation pricing guide.

Got a dataset you're not sure about?

Send us a sample — 1,000 labels, your hardest task, no commitment. We'll run our QA process on it and send back a per-class accuracy and error-class breakdown. Free.

See our QA & relabeling service

A Worked Example: How We'd Run QA on a Fresh Project

Quick walkthrough of how we'd set up a 50,000-image bounding-box project tomorrow morning:

Week 0: Spec finalised in git. 300-image gold set built and adjudicated by senior reviewer plus client representative. Per-class definitions locked.
Week 1: Calibration. Every annotator labels the gold set blind. Pass/fail bar at κ ≥ 0.80 vs gold. Two of eight annotators don't pass — they get an extra training day on the failure cases. One passes on retest; one is rotated off the project.
Weeks 2–8: Production. 10% of every batch double-annotated blind. Disagreements routed to a senior reviewer with sub-24-hour SLA. Batch reports go to the client Friday afternoon AEST every week.
Week 3 surprise: Per-class accuracy reveals “motorcyclist” is sitting at 87% (everything else above 96%). Spec didn't cover scooters vs e-bikes. We pause the class, the spec gets a new section, all motorcyclist labels from weeks 2–3 get re-checked under the new spec. Client signs the new spec version. Production resumes.
Week 8: Final delivery. Client's independent audit of 100 random labels — 97% accuracy. Their internal gold set agreement — κ 0.83. They train. The model works.

That's the boring, expensive, totally non-glamorous version. The “move fast and break things” version is the same six weeks plus another six weeks fixing what broke. We've been on both sides.

A Few Honest Things to Read Next

Free Sample · 24-48 hours

Send us a sample — we'll QA it free

1,000 labels of your hardest task. We run our 6-layer QA process and send back per-class accuracy and the top error categories. No commitment, no upsell.

Neel Bennett

AI Annotation Specialist at AI Taggers

Neel has over 8 years of experience in AI training data and machine learning operations. He specializes in helping enterprises build high-quality datasets for computer vision and NLP applications across healthcare, automotive, and retail industries.

Connect on LinkedIn