AV annotation is its own discipline. The teams who treat it as “bounding boxes plus some LiDAR work” quote 2D rates and find out three months in that what they actually needed was fused, tracked, calibration-aware ground truth across six cameras and a LiDAR — at five to ten times the per-frame cost. We see this in incoming briefs every month, especially from ADAS teams scaling up to L3+ work for the first time.
This guide is what we'd hand an AV perception team scoping their first proper annotation contract. The sensor stack, the tasks, the sensor-fusion workflow, the formats, the edge-case discipline that separates safe models from optimistic ones, and the cost reality. No vendor-deck energy. Honest read.
The Sensor Stack You're Actually Annotating
Production AV annotation covers four sensor types in coordinated workflows:
- Surround cameras — typically 6 to 12 cameras producing 360 degree coverage, each running 10–30 fps. The most familiar modality and the cheapest per frame, but the volume of frames stacks up fast.
- LiDAR — one or more sensors, mechanical or solid-state, 32 to 128 channels. Produces dense point clouds with measured depth. Where the 3D ground truth lives. Covered in depth in the LiDAR & point cloud annotation guide.
- Radar — front-facing plus corner units, lower resolution than LiDAR but useful in rain and fog where LiDAR struggles. Annotation here is typically association with LiDAR or camera ground truth rather than standalone labelling.
- Ultrasonic / IMU / GNSS — close-range and ego-pose data that ties the rest of the sensor stack to vehicle position. Doesn't typically need direct annotation but has to be available so the annotator's view is consistent across modalities.
The Six Annotation Tasks That Actually Run
- 2D bounding boxes on cameras — vehicles, pedestrians, cyclists, traffic signs, on every camera view. Tight discipline matters.
- 3D cuboids on LiDAR — the 7-DOF box per object, with cross-frame tracking IDs maintained across hundreds of frames. The most expensive label type on the project, and the one safety actually depends on.
- Semantic and panoptic segmentation — driveable surface, lane area, free-space estimation, scene parsing. Per-pixel labels on camera frames.
- Polyline annotation for lanes — lane lines, road edges, curbs, sometimes overhead structures. Lane geometry feeds path planning and HD-map construction.
- Traffic-light and traffic-sign classification — fine-grained class taxonomies, multi-language sign content where the data spans countries.
- Behaviour and intent labelling — is the pedestrian about to cross, is the cyclist signalling a turn, is the vehicle in front about to merge. The frontier of AV annotation, where mistakes carry the highest downstream cost.
Sensor Fusion: The Workflow That Actually Catches Errors
Annotation in production AV pipelines doesn't happen in one modality at a time. The annotator sees the LiDAR point cloud, the synchronised camera images, and the radar returns in a single coordinated view. A 3D cuboid placed in LiDAR is verified against the projected camera image. A camera detection that has no corresponding LiDAR cluster gets flagged for review.
This catches errors no single modality can catch alone — but it depends entirely on the extrinsic calibration being right. If the camera-to-LiDAR transform is off by a degree, the projected cuboid won't line up with the image, annotators will “correct” the good LiDAR box to match the bad calibration, and the dataset is quietly poisoned. Calibration audit is part of every serious AV annotation contract, not an assumption.
The Edge-Case Discipline That Separates Safe Models
Long-tail driving scenarios — night rain, low-sun glare, construction zones, emergency vehicles, unusual pedestrians like delivery riders or mobility scooters, kids playing near the road — are rare in raw recordings and disproportionately important to model safety. A dataset that mirrors the natural frequency of these cases trains a model that fails on them in production.
- Stratified sampling — explicit allocation of frames to edge-case categories, oversampled relative to natural frequency.
- Senior reviewer adjudication on borderline calls (is that a child or a small adult, is that vehicle actually merging or just drifting in lane).
- Per-edge-case QA reporting — accuracy and IoU broken out by scenario class, not just overall.
- Hardest-data pilots — every vendor evaluation should include a sample of your night-rain, dense-urban, low-sun scenes. Easy-data pilots tell you nothing.
This is the discipline that matters more than any other for AV. A dataset 95% accurate on highway daylight is irrelevant if it's 60% on the cases the safety case actually depends on.
Formats: KITTI, nuScenes, Waymo (Pick Before Day One)
- KITTI — the original benchmark. Per-frame text label files with 3D dimensions, location in camera coordinates, single yaw rotation. Simple, but camera-frame coordinates and limited multi-sensor support trip up modern projects.
- nuScenes — relational JSON for full 360 degree multi-sensor scenes with proper cross-time tracking. The modern default for production AV work.
- Waymo Open Dataset — protocol buffers with 7-DOF labels, tight synchronisation and tracking IDs. The most rigorous and the heaviest to tool for.
Yaw sign and coordinate frame are not the same across these formats. Mid-project conversion is reliable but lossy and tedious. Lock the strictest format on day one, convert down if needed.
Pricing: Why Generic Per-Object Rates Mislead
AV annotation is one of the most expensive categories in commercial CV because every cost dimension is maximised — multi-sensor, sequence-level tracking, edge-case oversampling, high IoU and orientation discipline. Pricing is per object per frame for cuboids, per frame for segmentation, with strong premiums for cross-frame tracking and fused sensor work.
A flat per-object rate quoted sight-unseen is a guess. The teams who pay 2–3x what they expected are the teams who quoted at highway-daylight rates and paid at dense-urban-with-rain rates. The honest scoping move — pilot on your hardest data, including the long-tail edge cases. Broader cost framework in the data annotation pricing guide.
Scoping an AV / ADAS annotation contract?
Send a 30-second sample from your hardest scene — night rain, urban dense, construction. We'll deliver fused multi-camera plus LiDAR ground truth in KITTI / nuScenes / Waymo, with per-class accuracy and orientation error on a gold set. Free.
See our autonomous vehicle annotation serviceQuality and QA
Standard AV-grade quality metrics — 3D IoU at 0.5/0.7 thresholds, mAP per class, orientation error (AOE/AOS), per-class mIoU for segmentation, ID-switch rate and HOTA for tracking. Layered on top of that — inter-annotator agreement on a gold set sampled from the hardest data, per-batch QA reporting with per-class breakdowns, calibration audit on every batch. General framework in the annotation QA playbook; AV-specific discipline is the per-class and per-scenario stratification of every metric.
Related Reading
- → Autonomous vehicle annotation service
- → 3D cuboid annotation guide
- → LiDAR & point cloud annotation guide
- → Semantic segmentation guide
- → Bounding box annotation guide
- → Lane detection annotation service
Get an AV/ADAS pilot in 72 hours
Send a short sequence from your hardest scene — we'll return fused cuboids, segmentation and lane polylines in your target format with per-class metrics.
Neel Bennett
AI Annotation Specialist at AI Taggers
Neel has over 8 years of experience in AI training data and machine learning operations. He specializes in helping enterprises build high-quality datasets for computer vision and NLP applications across healthcare, automotive, and retail industries.
Connect on LinkedIn