What is autonomous vehicle data annotation?

Autonomous vehicle data annotation is the labelling of camera, LiDAR, radar, and ultrasonic sensor data so AV and ADAS perception models can learn to detect, classify, track and predict objects in real-world driving scenes. It spans 2D bounding boxes on camera, 3D cuboids in LiDAR, semantic and panoptic segmentation, lane and polyline annotation, behaviour and intent labelling, and sensor-fusion ground truth. AV annotation is the most demanding category in commercial computer vision.

What annotation tasks run on an AV / ADAS dataset?

Six common ones: 2D bounding boxes on each camera, 3D cuboids on LiDAR (with cross-frame tracking), semantic or panoptic segmentation for driveable area and scene parsing, polyline annotation for lane lines, road edges and curbs, traffic-light and traffic-sign classification, and behaviour/intent labelling for pedestrians and cyclists. Sensor-fusion projects layer all of these on the same recording, with calibration metadata so each modality lines up.

What sensor stack does autonomous vehicle annotation cover?

Surround cameras (typically 6-12 cameras producing 360 degree coverage), one or more LiDAR sensors (mechanical or solid-state, 32 to 128 channels), radar (front-facing plus corners), ultrasonic close-range sensors, and increasingly the IMU and GNSS streams that link sensor data to vehicle pose. Production AV annotation has to handle all of these in a single coordinated workflow because the model trains on the fused view, not any single sensor in isolation.

What formats are used for AV annotation?

Three benchmarks dominate. KITTI — the original, simple per-frame text labels with camera and Velodyne LiDAR. nuScenes — relational JSON for full 360 degree multi-sensor scenes with proper cross-time tracking; the de-facto modern default. Waymo Open Dataset — protocol buffers with 7-DOF labels, tight synchronisation and tracking IDs; the most rigorous and the heaviest to tool. Pick the strictest format that covers your sensor stack and convert down; mid-project conversion loses yaw conventions and coordinate frames in subtle ways.

How do you handle edge cases in AV annotation?

Deliberately, with stratified sampling. Long-tail driving scenarios — night rain, low-sun glare, construction zones, emergency vehicles, unusual pedestrians (delivery riders, mobility scooters, kids playing) — are rare in raw recordings but disproportionately important to model safety. The annotation budget has to allocate explicitly to these edge cases, oversampled relative to their natural frequency, with senior-reviewer adjudication on the borderline calls. Datasets that don't do this train models that fail on the cases that matter.

How much does autonomous vehicle annotation cost?

It is one of the most expensive annotation categories because every dimension that drives cost is maximised. Multi-sensor (LiDAR plus 6-12 cameras), sequence-level tracking across hundreds of frames, edge-case oversampling, high IoU and orientation discipline, regulatory documentation. Pricing is usually per object per frame for cuboids and per frame for segmentation, with strong premiums for tracking and fused sensor work. The reliable way to scope it is a pilot on your hardest data — night, rain, dense urban — never on highway daylight.

How is AV annotation quality measured?

For 3D cuboids — volumetric IoU at 0.5 and 0.7 thresholds, mAP per class, Average Orientation Error (nuScenes AOE) or Average Orientation Similarity (KITTI). For segmentation — per-class mIoU, never just overall. For tracking — ID-switch rate and Higher-Order Tracking Accuracy (HOTA). Layered on top — inter-annotator agreement on a gold set built from a representative slice of the hardest data, with per-batch QA reporting. Anything reporting a single overall accuracy is hiding the per-class problem cases.

Autonomous Vehicle Data Annotation: The Sensor Stack, The Formats, The Real Cost (2026)

AV annotation is its own discipline. The teams who treat it as “bounding boxes plus some LiDAR work” quote 2D rates and find out three months in that what they actually needed was fused, tracked, calibration-aware ground truth across six cameras and a LiDAR — at five to ten times the per-frame cost. We see this in incoming briefs every month, especially from ADAS teams scaling up to L3+ work for the first time.

This guide is what we'd hand an AV perception team scoping their first proper annotation contract. The sensor stack, the tasks, the sensor-fusion workflow, the formats, the edge-case discipline that separates safe models from optimistic ones, and the cost reality. No vendor-deck energy. Honest read.

The Sensor Stack You're Actually Annotating

Production AV annotation covers four sensor types in coordinated workflows:

Surround cameras — typically 6 to 12 cameras producing 360 degree coverage, each running 10–30 fps. The most familiar modality and the cheapest per frame, but the volume of frames stacks up fast.
LiDAR — one or more sensors, mechanical or solid-state, 32 to 128 channels. Produces dense point clouds with measured depth. Where the 3D ground truth lives. Covered in depth in the LiDAR & point cloud annotation guide.
Radar — front-facing plus corner units, lower resolution than LiDAR but useful in rain and fog where LiDAR struggles. Annotation here is typically association with LiDAR or camera ground truth rather than standalone labelling.
Ultrasonic / IMU / GNSS — close-range and ego-pose data that ties the rest of the sensor stack to vehicle position. Doesn't typically need direct annotation but has to be available so the annotator's view is consistent across modalities.

The Six Annotation Tasks That Actually Run

2D bounding boxes on cameras — vehicles, pedestrians, cyclists, traffic signs, on every camera view. Tight discipline matters.
3D cuboids on LiDAR — the 7-DOF box per object, with cross-frame tracking IDs maintained across hundreds of frames. The most expensive label type on the project, and the one safety actually depends on.
Semantic and panoptic segmentation — driveable surface, lane area, free-space estimation, scene parsing. Per-pixel labels on camera frames.
Polyline annotation for lanes — lane lines, road edges, curbs, sometimes overhead structures. Lane geometry feeds path planning and HD-map construction.
Traffic-light and traffic-sign classification — fine-grained class taxonomies, multi-language sign content where the data spans countries.
Behaviour and intent labelling — is the pedestrian about to cross, is the cyclist signalling a turn, is the vehicle in front about to merge. The frontier of AV annotation, where mistakes carry the highest downstream cost.

Sensor Fusion: The Workflow That Actually Catches Errors

Annotation in production AV pipelines doesn't happen in one modality at a time. The annotator sees the LiDAR point cloud, the synchronised camera images, and the radar returns in a single coordinated view. A 3D cuboid placed in LiDAR is verified against the projected camera image. A camera detection that has no corresponding LiDAR cluster gets flagged for review.

This catches errors no single modality can catch alone — but it depends entirely on the extrinsic calibration being right. If the camera-to-LiDAR transform is off by a degree, the projected cuboid won't line up with the image, annotators will “correct” the good LiDAR box to match the bad calibration, and the dataset is quietly poisoned. Calibration audit is part of every serious AV annotation contract, not an assumption.

The Edge-Case Discipline That Separates Safe Models

Long-tail driving scenarios — night rain, low-sun glare, construction zones, emergency vehicles, unusual pedestrians like delivery riders or mobility scooters, kids playing near the road — are rare in raw recordings and disproportionately important to model safety. A dataset that mirrors the natural frequency of these cases trains a model that fails on them in production.

Stratified sampling — explicit allocation of frames to edge-case categories, oversampled relative to natural frequency.
Senior reviewer adjudication on borderline calls (is that a child or a small adult, is that vehicle actually merging or just drifting in lane).
Per-edge-case QA reporting — accuracy and IoU broken out by scenario class, not just overall.
Hardest-data pilots — every vendor evaluation should include a sample of your night-rain, dense-urban, low-sun scenes. Easy-data pilots tell you nothing.

This is the discipline that matters more than any other for AV. A dataset 95% accurate on highway daylight is irrelevant if it's 60% on the cases the safety case actually depends on.

Formats: KITTI, nuScenes, Waymo (Pick Before Day One)

KITTI — the original benchmark. Per-frame text label files with 3D dimensions, location in camera coordinates, single yaw rotation. Simple, but camera-frame coordinates and limited multi-sensor support trip up modern projects.
nuScenes — relational JSON for full 360 degree multi-sensor scenes with proper cross-time tracking. The modern default for production AV work.
Waymo Open Dataset — protocol buffers with 7-DOF labels, tight synchronisation and tracking IDs. The most rigorous and the heaviest to tool for.

Yaw sign and coordinate frame are not the same across these formats. Mid-project conversion is reliable but lossy and tedious. Lock the strictest format on day one, convert down if needed.

Pricing: Why Generic Per-Object Rates Mislead

AV annotation is one of the most expensive categories in commercial CV because every cost dimension is maximised — multi-sensor, sequence-level tracking, edge-case oversampling, high IoU and orientation discipline. Pricing is per object per frame for cuboids, per frame for segmentation, with strong premiums for cross-frame tracking and fused sensor work.

A flat per-object rate quoted sight-unseen is a guess. The teams who pay 2–3x what they expected are the teams who quoted at highway-daylight rates and paid at dense-urban-with-rain rates. The honest scoping move — pilot on your hardest data, including the long-tail edge cases. Broader cost framework in the data annotation pricing guide.

Scoping an AV / ADAS annotation contract?

Send a 30-second sample from your hardest scene — night rain, urban dense, construction. We'll deliver fused multi-camera plus LiDAR ground truth in KITTI / nuScenes / Waymo, with per-class accuracy and orientation error on a gold set. Free.

See our autonomous vehicle annotation service

Quality and QA

Standard AV-grade quality metrics — 3D IoU at 0.5/0.7 thresholds, mAP per class, orientation error (AOE/AOS), per-class mIoU for segmentation, ID-switch rate and HOTA for tracking. Layered on top of that — inter-annotator agreement on a gold set sampled from the hardest data, per-batch QA reporting with per-class breakdowns, calibration audit on every batch. General framework in the annotation QA playbook; AV-specific discipline is the per-class and per-scenario stratification of every metric.

Autonomous Vehicle Data Annotation: The Sensor Stack, The Formats, The Real Cost