Look, I'll be straight with you. The reason your model isn't hitting accuracy is probably not the architecture. It's probably that 12% of your bounding boxes have the heading flipped, your “medium severity” class is whatever the annotator felt that day, and the gold-standard set you're measuring against was built by the same person who wrote the guidelines — so the test is graded by the teacher. Classic.
Annotation QA is the boring, expensive, totally non-glamorous bit of training-data work that determines whether the model ships. This guide is what we actually do at AI Taggers when we run QA on a project — not the polished consultancy-deck version, the “here's what survives contact with real annotators on a Friday afternoon” version. If you take one thing away — QA is a system, not a step.
First, What QA Actually Is (And What It Isn't)
Annotation QA is the system of checks that catches bad labels before they end up in your training set. That's it. It's not a number on a delivery report. It's not “the senior reviewer glanced through”. It's a layered process — spec, gold set, calibration, sampling, adjudication, reporting — and skipping any one layer turns the rest into theatre.
The signature of a project without real QA is depressingly consistent. The dataset looks great on the delivery dashboard. The model trains. Accuracy plateaus weirdly early. Error analysis shows the model nailing easy cases and going to pieces on edge cases. Somebody re-labels a sample by hand and finds 8–15% of the labels are wrong in ways that quietly bias the model. Two months down the drain. We see this every quarter.
The Six Layers That Actually Make QA Work
A working QA stack has six layers. Skip one and the others can't cover for it. In order:
1. A versioned, written annotation spec
Not slack messages. Not “Jess will know”. A document, in git, that defines every class, every edge case, every “what do we do when the object is partially occluded”, every borderline that comes up. Versioned, because the spec WILL change in week three and you need to know which version a given batch was labelled under.
If your annotation team can't answer “what do we do with X” without messaging the project lead, the spec isn't finished. That's genuinely most projects on day one. Get it finished before you scale, not after.
2. A gold-standard reference set
200–500 examples per class, labelled and adjudicated by your most senior reviewer (or, in regulated domains, a credentialed expert). This is the truth file. Every annotator labels it on day one. Every production batch is QA'd against it. The gold set is the only reason any other metric on the project means anything.
Critically — the gold set is not allowed to be built by the same people who QA the production work. If the umpire wrote the rulebook AND ref's the match, the match isn't a match. We split gold-set construction and ongoing QA between different reviewers on principle.
3. Calibration before production
Every annotator labels the gold set blind. You measure their accuracy and inter-annotator agreement (Cohen's kappa for paired tasks, Fleiss for teams). Anyone below your agreement threshold doesn't go on production. They get retrained on the cases they missed and re-tested. Sounds harsh — it isn't. It's cheaper than discovering the same problem at batch five.
What “agreement threshold” means in practice — for most production projects we want κ ≥ 0.80 between annotators on the gold set. Above 0.90 is a sign the task is too easy or the spec is too constraining. Below 0.70 is a sign the spec is broken — fix the spec, not the annotators. For the deeper maths, see our Cohen's kappa guide.
4. Blind double-annotation on every batch
Sample of every production batch — start at 10%, drop to 5% only after a vendor or team has hit your bar for three consecutive batches. The sample is labelled blind by a second annotator. Disagreements get routed to the adjudication queue. Per-class agreement gets reported.
Why blind — because annotators who can see the first label will anchor to it almost every time. The point of double-annotation is to catch the cases where two competent annotators genuinely see the same thing differently. If they can see each other's work, you've measured peer pressure, not agreement.
5. An adjudication queue with a senior reviewer
Every disagreement from double-annotation ends up here. A senior reviewer makes the call, ideally adding a one-line note about why. Those notes are gold — they become the next version of the spec. Adjudication isn't overhead; it's how the spec gets sharper without anyone having to predict every edge case in advance.
The mistake teams make — letting the original annotator adjudicate their own work. Doesn't count. Adjudication must come from a different reviewer or it's just confirmation bias dressed up as process.
6. Per-batch QA reporting (the report that has to exist)
Every batch ships with a one-page report. Numbers we report on every batch we touch:
- Accuracy vs gold set, overall and per class (per class — never just overall)
- Inter-annotator agreement on the QA sample
- Adjudication queue volume and resolution time
- Top three error categories with examples
- Any spec changes that landed during the batch
If a vendor delivers a batch with a single combined “99.x% accuracy” number and no per-class breakdown, that's a red flag. It almost always means the rare classes are tanking and the abundant classes are doing all the lifting. Ask for the breakdown. Watch the response.
The Stuff That Gets Quietly Ignored
Rare classes are where projects die
A medical imaging dataset that's 95% “normal” can hit 95% accuracy by labelling everything as normal. We've all done worse. Stratify your sampling — every QA batch must include rare classes in proportion that matches what the model needs to learn, not the dataset distribution. Otherwise rare-class performance is unmeasured, then it's zero, then the radiologists find out at the worst possible moment.
Domain expertise is not interchangeable with effort
A diligent crowd annotator cannot QA medical imaging, regardless of how many hours you give them. A diligent crowd annotator also cannot QA Arabic sentiment if they don't speak Khaleeji. The honest version of QA scoping is — match the reviewer to the domain, and if you can't, the project needs a different vendor or a clinical-expert overlay (which is what we do on medical work — see clinical expert annotation).
The vendor self-marking problem
Most annotation vendors report their own QA numbers. Their QA is graded against their own gold set, which they built. The number can be 99.5% and still be useless. The fix is straightforward and we recommend it on every engagement — you (the client) build a small hidden gold set we don't see, you audit 50–100 of our delivered labels independently, we agree on what the bar is up front. We have never had a client regret doing this. The vendors who balk at it are telling you everything.
What QA Actually Costs (And Why Skim QA Costs More)
Properly run QA — gold set construction, calibration, 10% double-annotation, adjudication, per-batch reporting — usually lands around 15–25% on top of production labelling cost. Yes, that's real money. No, you don't get to skip it.
Skim QA (a 1–2% spot check at the end) looks cheap until the model fails and you re-label the affected batches at full rate. We've been brought in to fix projects where the “cheap” option ended up costing 2–3× the proper QA budget once the rework was tallied. The cost-benefit isn't subtle. For broader pricing context, see our data annotation pricing guide.
Got a dataset you're not sure about?
Send us a sample — 1,000 labels, your hardest task, no commitment. We'll run our QA process on it and send back a per-class accuracy and error-class breakdown. Free.
See our QA & relabeling serviceA Worked Example: How We'd Run QA on a Fresh Project
Quick walkthrough of how we'd set up a 50,000-image bounding-box project tomorrow morning:
- Week 0: Spec finalised in git. 300-image gold set built and adjudicated by senior reviewer plus client representative. Per-class definitions locked.
- Week 1: Calibration. Every annotator labels the gold set blind. Pass/fail bar at κ ≥ 0.80 vs gold. Two of eight annotators don't pass — they get an extra training day on the failure cases. One passes on retest; one is rotated off the project.
- Weeks 2–8: Production. 10% of every batch double-annotated blind. Disagreements routed to a senior reviewer with sub-24-hour SLA. Batch reports go to the client Friday afternoon AEST every week.
- Week 3 surprise: Per-class accuracy reveals “motorcyclist” is sitting at 87% (everything else above 96%). Spec didn't cover scooters vs e-bikes. We pause the class, the spec gets a new section, all motorcyclist labels from weeks 2–3 get re-checked under the new spec. Client signs the new spec version. Production resumes.
- Week 8: Final delivery. Client's independent audit of 100 random labels — 97% accuracy. Their internal gold set agreement — κ 0.83. They train. The model works.
That's the boring, expensive, totally non-glamorous version. The “move fast and break things” version is the same six weeks plus another six weeks fixing what broke. We've been on both sides.
A Few Honest Things to Read Next
- → Annotation QA & relabeling service
- → Data QA & validation service
- → Cohen's kappa: when 80% is bad and 99% is worse
- → The metrics that actually matter
- → Annotation guidelines that don't need constant revision
- → How to choose a data annotation company
Send us a sample — we'll QA it free
1,000 labels of your hardest task. We run our 6-layer QA process and send back per-class accuracy and the top error categories. No commitment, no upsell.
Neel Bennett
AI Annotation Specialist at AI Taggers
Neel has over 8 years of experience in AI training data and machine learning operations. He specializes in helping enterprises build high-quality datasets for computer vision and NLP applications across healthcare, automotive, and retail industries.
Connect on LinkedIn