Radiology AI is one of the most clinically established corners of medical AI. FDA-cleared products for chest X-ray triage, mammography screening, lung CT nodule detection, intracranial haemorrhage flagging on emergency head CT — they all exist, they all work, and they all live or die on training-data quality and provenance documentation. The annotation behind a radiology model isn't labelling. It's clinical work supported by labelling tools.
This guide is what we'd hand a team scoping their first radiology AI training-data contract — what the work actually involves, the modality-specific differences between MRI, CT, X-ray and ultrasound, who has to do the labelling, the regulatory paperwork the annotation has to support, and the realistic cost. Written from our experience running clinical-grade annotation workflows for diagnostic and research clients.
DICOM Isn't Just a File Format
Every CT, MRI, X-ray and ultrasound from clinical equipment lands as DICOM. Each file is a single image plus a structured header carrying modality, acquisition parameters, patient context, and the study/series IDs that link slices together into 3D volumes. The header is what makes radiology data clinical-grade.
Generic image labellers strip the DICOM header and treat each slice as a flat PNG. The result — you lose volumetric context, lose acquisition metadata that the model needs for normalisation, and lose the audit trail the regulator wants. DICOM-native tooling is non-negotiable on clinical projects. The tooling we use preserves the DICOM structure end-to-end and exports annotations linked to the original study and series IDs.
Modality-Specific Annotation Notes
- X-ray — usually 2D, often paired projections (PA and lateral chest). Annotation is finding-level (bounding boxes or polygons around nodules, opacities, fractures) plus classification (normal vs abnormal, severity grading). Lung-RADS adjacent grading where the use case calls for it.
- CT — volumetric. Per-slice or per-volume segmentation for organs (liver, spleen, kidneys), tumour-region masks for oncology AI, intracranial haemorrhage detection on head CT, pulmonary nodule detection on chest CT. The volume context matters; per-slice labelling without volume-level QA is the single most common quality failure.
- MRI — multi-sequence (T1, T2, FLAIR, DWI, contrast-enhanced). Annotation has to be sequence-aware — a lesion may appear on FLAIR but not T1. Brain tumour segmentation, MS lesion tracking, knee and spine segmentation, prostate lesion grading (PI-RADS).
- Ultrasound — operator-dependent acquisition, single-slice or sweep clips. Annotation is finding-level (boxes or polygons) plus classification, with strong inter-operator variability that QA has to acknowledge.
- Mammography — BI-RADS classification, microcalcification clusters, mass margins. Sub-specialty radiologist oversight is essential here.
- Nuclear medicine and PET — quantitative analysis, SUV thresholds, fused PET-CT annotation. Specialist sub-domain.
Who Actually Does the Annotation
Layered teams, not single annotator types. What works in practice:
- Board-certified radiologists at the top of the loop. Build the spec, build the gold-standard reference set, adjudicate every borderline call, sign off on the QA process. Their time is the most expensive and where it counts most.
- Trained imaging annotators in the middle. Radiology technologists, imaging-science graduates, or medically-trained reviewers who have passed project-specific calibration. They handle the bulk of the volume on clearly-defined tasks.
- Specialist radiologists where the modality calls for it. Neuro-radiologists on brain MRI, breast radiologists on mammography, paediatric radiologists on paediatric work. Sub-specialty matters more than generalist seniority.
- QA on top. Double-annotation, gold-set checks every batch, adjudication queue back to the radiologists. See our broader annotation QA playbook and our clinical expert annotation service.
The wrong structure — generalist annotators with no medical training, no radiologist adjudication, “HIPAA-compliant because we have a BAA template” — produces datasets that look fine on the delivery report and fail at the FDA submission. We've been brought in to fix several of these and it's genuinely expensive.
HIPAA-Grade Handling: The Bar That Isn't Optional
For US-bound work, HIPAA isn't a checkbox — it's the floor. What that actually means in practice:
- De-identification before annotation. DICOM headers stripped of PHI per the Safe Harbor 18-identifier rule or expert-determination method.
- Business Associate Agreement (BAA) in place between the AI developer and the annotation vendor. Real BAA, not a templated checkbox.
- Encrypted in transit and at rest — TLS 1.2+ on the wire, AES-256 at rest, key management documented.
- Access controls and audit logging — who annotated which study, when, from where, under which protocol version.
- Secure infrastructure — VPC or private-cloud deployment available on request; some clients require on-prem.
Australian TGA-bound work and EU MDR/IVDR work have parallel requirements with their own specifics. The annotation contract has to support whichever regulatory regime your AI ships under — and if you might ship internationally, the strictest applicable regime usually wins. Build the documentation from day one; retrofitting it is painful and sometimes impossible.
Quality: Consensus Gold Standards, Per-Class Metrics
Radiologists genuinely disagree — inter-reader agreement on borderline lesions or BI-RADS calls is often weighted kappa around 0.6–0.8. That's not a failure; it's the nature of the task. A one-reader gold standard understates real-world variance and the model trained against it fails in clinical deployment. Consensus gold standards (two or three readers, blinded, defined resolution rule) are the bar. Reported metrics are concordance against consensus, per modality, per finding class, every batch. For grading tasks — weighted kappa. For segmentation — Dice and Hausdorff distance. For detection — sensitivity, specificity and AUC at agreed operating points. Per-class. Always per-class.
Scoping a radiology AI training-data project?
Send 5–10 representative studies from your hardest modality. We'll deliver radiologist-adjudicated annotation with consensus QA and a per-class concordance report in 72 hours. HIPAA-grade workflow, BAA on request.
See our radiology annotation serviceWhat It Costs
Radiology annotation is genuinely one of the most expensive categories in commercial annotation. Board-certified radiologist time is the single biggest cost driver; volumetric work (CT and MRI 3D) is slower per study than 2D X-ray; consensus protocols multiply that by the size of the reader panel. Pricing is per study, per slice, per finding, or per organ depending on the task — and the only number that matters is the one a pilot on your studies produces. A flat per-study rate quoted sight-unseen is a guess.
Related Reading
- → Radiology annotation service
- → MRI annotation service
- → CT scan annotation service
- → X-ray annotation service
- → Clinical expert annotation
- → Histopathology AI annotation guide
- → Ophthalmology AI annotation guide
Get a radiology pilot in 72 hours
Send 5-10 representative DICOM studies. We'll deliver radiologist-adjudicated labels with consensus QA and a per-class concordance report. HIPAA-grade, BAA available.
Neel Bennett
AI Annotation Specialist at AI Taggers
Neel has over 8 years of experience in AI training data and machine learning operations. He specializes in helping enterprises build high-quality datasets for computer vision and NLP applications across healthcare, automotive, and retail industries.
Connect on LinkedIn