What file formats are used for 3D cuboid annotation output?

Common 3D cuboid annotation output formats include KITTI format (text files with 3D box parameters per frame), nuScenes JSON (with sensor calibration metadata), Waymo Open Dataset tfrecord format, and COCO-3D. Point cloud source data is typically stored as PCD, PLY, or binary .bin files. The choice of output format should be driven by your downstream training framework — most major AV model architectures have loaders for KITTI and nuScenes formats.

What Is 3D Cuboid Annotation and How Is It Used in Autonomous Driving?

Quick answer

3D cuboid annotation is the process of fitting a six-degree-of-freedom bounding box — specifying position (x, y, z), dimensions (length, width, height), and yaw angle — around objects in LiDAR point cloud data. In autonomous driving, cuboids train perception models to detect and localise vehicles, pedestrians, and cyclists in real-world 3D space, enabling safe trajectory planning and collision avoidance.

What 3D Cuboid Annotation Actually Is

A 3D cuboid is a rectangular box placed directly in three-dimensional point cloud space, not drawn on a camera image. Each cuboid carries seven parameters: three for position (x, y, z centre coordinates relative to the LiDAR sensor), three for physical dimensions (length along the vehicle's forward axis, width across the lateral axis, height), and one for heading angle (yaw — the rotation around the vertical axis). These parameters allow a perception model to answer the questions that matter for safe operation: where is this object, how far away is it, how big is it, and which way is it pointed?

In contrast to 2D bounding boxes drawn on RGB camera frames, 3D cuboids operate in metric space. A car detected 12.4 metres ahead with a width of 1.9 metres and a heading 10 degrees off the ego vehicle's path provides everything a planning module needs to compute a safe overtaking manoeuvre. A 2D box on a camera image gives only pixel coordinates — the depth and real-world geometry must be inferred, introducing additional uncertainty.

The annotation task itself involves loading a frame of LiDAR returns — typically 100,000–300,000 individual range measurements per frame — into a 3D annotation tool, then placing and sizing a cuboid around each object of interest. An annotator adjusts the cuboid's six parameters until the box tightly fits the visible point cloud returns of that object, with consistent conventions for handling partial occlusion and truncation at the sensor's field-of-view boundary.

Why AV Teams Use 3D Cuboids Instead of Other Label Types

Autonomous vehicle perception pipelines use several label types — 2D bounding boxes, semantic segmentation masks, polylines for lane markings, and 3D cuboids — but cuboids are the primary label type for object detection and tracking in 3D space. The reason is directness: a perception model trained on 3D cuboids learns to output predicted boxes in the same format the planning stack consumes. There is no depth estimation step, no monocular distance inference, no coordinate unprojection.

According to a 2024 industry analysis by McKinsey Global Institute, perception data annotation — including 3D cuboid labelling — accounts for 30–40% of total AV development cost at most teams past the prototype stage. The global autonomous vehicle market is projected to reach US$556 billion by 2030 (Statista), and the annotation workload scales linearly with the number of hours of sensor data collected: a single eight-hour data collection run with a LiDAR spinning at 10 Hz generates 288,000 frames requiring annotation.

Our 3D cuboid annotation services cover the full object class taxonomy typical of AV programmes: vehicles (passenger, commercial, motorcycle), vulnerable road users (pedestrians, cyclists, e-scooter riders), static obstacles (traffic cones, barriers, bollards), and miscellaneous objects (shopping trolleys, animals). Each class carries its own IoU acceptance threshold and handling convention for partial scans.

The Production 3D Cuboid Annotation Workflow

A production-quality 3D cuboid annotation workflow has four distinct stages. Skipping any one of them reliably produces training data that underperforms, particularly on small or partially occluded objects.

1. Point cloud pre-processing

Ground plane removal, intensity normalisation, and intensity-range filtering are applied before annotation begins. This eliminates the ground return clutter that confuses annotators and reduces the visible points they need to work with. Pre-processing also handles multi-return LiDAR artefacts (ghost points, rolling shutter distortion) that appear as false objects if not cleaned first.

2. Initial cuboid placement

Annotators place each cuboid by clicking the centroid of the object's point cluster, then adjust dimensions and heading angle. The single largest source of error at this stage is heading angle: annotators routinely misread yaw by 5–15 degrees when a vehicle is viewed from an oblique angle with sparse returns. Guidelines must specify the heading convention (front-facing positive x-axis) and provide worked examples for occluded or truncated objects.

3. Multi-frame consistency and tracking

Objects must carry consistent identifiers across frames — a pedestrian assigned ID 42 in frame 800 must remain ID 42 in frames 801–850, with smoothly interpolated positions. Annotation tools like Scale Sensor Fusion and SuperAnnotate 3D support track-linking; teams using CVAT or open-source tools typically build track consistency via post-processing scripts. Inconsistent object IDs are the most common reason AV multi-object tracking (MOT) metrics collapse during model evaluation.

4. QA review and IoU validation

A separate QA annotator reviews each frame against a gold standard cuboid set. Automated IoU calculation flags any cuboid with 3D IoU below the class threshold (typically 0.70 for vehicles, 0.50 for pedestrians) for rework. Systematic dimension errors — annotators consistently undersizing vehicle widths because the door returns are sparse — are caught through per-annotator dimension distribution analysis and corrected via calibration sessions.

Sensor Fusion: Aligning Cuboids with Camera Images

Most production AV datasets pair LiDAR cuboids with corresponding camera annotations — 2D bounding boxes or image-space projections of the 3D cuboids — to enable multi-modal perception training. Sensor fusion annotation requires calibrated extrinsic and intrinsic parameters for each camera-LiDAR pair, plus frame-level time synchronisation to within a few milliseconds at vehicle speeds.

The most common fusion annotation task is camera-LiDAR cross-verification: after cuboids are placed in 3D, they are projected into each camera's image space. Annotators check that the projected box aligns with the visible object in the image, catching heading errors and dimension mismatches that are harder to see in sparse point cloud data. A cuboid whose projection is consistently offset from the camera object by more than 5% of the object's image width indicates a systematic error — usually a heading angle bias or a stale calibration file.

For teams building models that fuse camera and LiDAR in a single architecture (PointPainting, BEVFusion, UniDet3D), annotation quality in the fusion alignment step directly determines whether the camera branch improves or degrades LiDAR-only baseline performance. Our guide to 3D point cloud annotation for AV teams covers the multi-modal alignment workflow in more detail.

Need 3D cuboid annotation for your AV or robotics programme?

AI Taggers provides production-grade 3D cuboid annotation services across all major AV object classes — vehicles, VRUs, static obstacles — with per-class IoU QA, multi-frame track consistency, and sensor fusion cross-validation.

See 3D cuboid annotation services

QA for 3D Cuboid Data: The Metrics That Separate Production from Research Grade

Three-dimensional Intersection over Union (3D IoU) is the primary QA metric for cuboid annotation quality. It measures the volumetric overlap between an annotated cuboid and a gold-standard reference cuboid. A 2023 analysis published in IEEE Transactions on Intelligent Transportation Systems found that pedestrian detection models trained on cuboid data with average 3D IoU below 0.65 had 2.4x higher false-negative rates at 25–40 m range compared to models trained on data with average IoU above 0.82 — a material safety difference for any vehicle operating near pedestrians.

Beyond per-cuboid IoU, production QA also tracks:

Object count consistency — the number of annotated objects per class should not vary by more than ±5% across frames with similar scene density
Per-class dimension distributions — car widths should cluster around 1.8–2.1 m; outliers indicate dimension errors
Track break rate — percentage of trajectories where object ID is lost mid-sequence; targets <3% for production datasets
Occluded object handling rate — percentage of partially visible objects that are correctly annotated versus skipped; should match guidelines specification

These metrics require comparison against a gold standard set, which in turn requires experienced LiDAR annotation specialists rather than general-purpose labellers. Our 3D cuboid annotation guide covers QA architecture and gold standard design in depth.

Case Study: Urban Perception Pipeline for an Australian AV Startup

An Australian autonomous last-mile delivery vehicle startup came to us with a perception dataset collected across suburban and inner-city Melbourne routes. They had been annotating internally using a small team of engineers who were proficient in LiDAR tools but had no formal annotation workflow or QA process.

Project parameters

Dataset volume

2.4 million LiDAR frames (64-beam rotating LiDAR)

Object classes

7 classes: car, truck, cyclist, pedestrian, e-scooter, traffic cone, bus

Sensor configuration

Single roof-mounted LiDAR + 4 surround cameras (fusion task)

Timeline

6 weeks to full production throughput

The problem: Internal QA measured 3D IoU averaging 0.71 on vulnerable road user classes (pedestrians, cyclists) — below the 0.80 threshold their model training pipeline required for acceptable downstream performance. Heading angle errors on cyclists ranged from 8 to 22 degrees, and track IDs broke on 11% of trajectories longer than 30 frames. Annotation throughput was approximately 420 cuboids per annotator-hour, which was unsustainable given the scale of the collected dataset.

The intervention: We implemented a three-stage workflow: (1) ground plane removal and intensity gating pre-processing applied batch-wise before annotation; (2) two-pass annotation — initial cuboid placement followed by a dedicated heading-angle refinement pass with camera fusion cross-check; (3) automated IoU scoring against a 500-frame gold standard set, with per-annotator feedback every 200 frames.

We also introduced a cyclist-specific heading convention — derived from the visible wheel returns rather than the torso, which is often occluded — and ran a two-day calibration workshop before full-scale production began.

Before and after

Before

Mean 3D IoU (VRU classes): 0.71
Cyclist heading error range: 8–22°
Track break rate: 11% of long trajectories
Annotation throughput: 420 cuboids/annotator-hour
Frame-to-frame ID consistency: 82%

After (Week 6)

Mean 3D IoU (VRU classes): 0.89
Cyclist heading error range: 1–5°
Track break rate: 2.4% of long trajectories
Annotation throughput: 1,100 cuboids/annotator-hour
Frame-to-frame ID consistency: 97%

The throughput gain — from 420 to 1,100 cuboids/hour — came primarily from the pre-processing step (eliminating ground clutter that annotators had previously been navigating around) and from specialised annotators who developed class-specific muscle memory over the first two weeks. The IoU gain came almost entirely from the heading convention fix and the camera fusion cross-check, which caught heading errors that were invisible in sparse point cloud views.

What 3D Cuboid Annotation Services Cost

Pricing for 3D cuboid annotation varies significantly with scene complexity, object density, and the required QA level. Highway scenarios with well-spaced vehicles and high point density are the cheapest to annotate. Dense urban intersections with overlapping pedestrians, cyclists, and partially occluded vehicles at close range are the most expensive.

Scenario type	Typical cost (AUD, per object)	Notes
Highway / rural (low density)	$0.15 – $0.28	Vehicles only, no VRUs
Suburban (mixed classes)	$0.30 – $0.50	Includes VRUs, moderate occlusion
Dense urban / intersection	$0.55 – $0.90	High occlusion, track complexity
Sensor fusion (LiDAR + camera)	Add 25–40%	Per-object cross-validation step

Volume discounts of 15–25% are standard at 500,000+ objects. Teams with existing pre-annotation models can reduce per-object cost by 35–50% by providing model predictions as a starting point for human annotators to review and correct. See our autonomous vehicle data annotation guide for a full cost breakdown across the AV sensor stack. For LiDAR-specific workflows, our LiDAR annotation guide covers format choices and QA architecture in depth.

Our 3D cuboid annotation service operates on both time-and-materials and fixed-price-per-object pricing models depending on project scope. Pilot datasets of 1,000–5,000 frames are typically turned around in five to seven business days, allowing teams to validate annotation quality before committing to full-scale production. We also support autonomous vehicle annotation programmes more broadly, including lane marking, lane detection annotation, and semantic segmentation of driveable surface. LiDAR annotation services for robotics and AV teams are also available.

Frequently Asked Questions

What is 3D cuboid annotation?▼

3D cuboid annotation is the process of fitting a six-degree-of-freedom bounding box around objects in LiDAR point cloud data. Each cuboid specifies position (x, y, z), dimensions (length, width, height), and heading angle (yaw). In autonomous driving, cuboids train perception models to detect and localise objects in real-world 3D space.

How does 3D cuboid annotation differ from 2D bounding boxes?▼

A 2D bounding box is drawn on a camera image and captures only pixel location. A 3D cuboid is placed in metric 3D space and captures real-world distance, physical size, and orientation — everything a planning module needs for trajectory prediction and collision avoidance.

What 3D IoU threshold is required for autonomous vehicle annotation?▼

Production AV datasets typically require 3D IoU ≥ 0.70 for vehicles and ≥ 0.50 for pedestrians/cyclists as the QA acceptance threshold. Safety-critical applications often raise the vehicle threshold to 0.85 or higher.

What does 3D cuboid annotation cost per object?▼

Highway scenes: AUD $0.15–$0.28 per object. Dense urban: AUD $0.55–$0.90. Sensor fusion cross-validation adds 25–40%. Volume discounts of 15–25% apply at 500,000+ objects.

Can 3D cuboid annotation be automated with AI pre-labelling?▼

Yes — AI pre-annotation can reduce human annotation time by 40–60% on well-represented classes. The automation benefit is lowest for rare object classes and edge-case scenes, which are precisely where annotation quality matters most for model safety.

What file formats are used for 3D cuboid annotation?▼

Common formats include KITTI text format, nuScenes JSON, and Waymo Open Dataset tfrecord. Most major AV training frameworks have loaders for KITTI and nuScenes. Format choice should match your downstream training pipeline.

Free Sample · 24-48 hours

Get a Quote for 3D Cuboid Annotation

Tell us about your LiDAR dataset — sensor type, object classes, frame volume — and we'll provide a per-object price estimate within one business day.

Neel Bennett

AI Annotation Specialist at AI Taggers

Neel has over 8 years of experience in AI training data and machine learning operations. He specializes in helping enterprises build high-quality datasets for computer vision and NLP applications across healthcare, automotive, and retail industries.

Connect on LinkedIn