What is 3D cuboid annotation?

3D cuboid annotation is the process of drawing a 3D bounding box (a cuboid) around an object in image or LiDAR point-cloud data so an AI model can learn the object's real-world position, dimensions, and orientation. Unlike a flat 2D box, a cuboid encodes seven degrees of freedom: 3D center position (x, y, z), size (length, width, height), and heading angle (yaw). This lets perception systems reason about depth and how objects are facing — essential for autonomous driving and robotics.

How is 3D cuboid annotation different from 2D bounding box annotation?

A 2D bounding box only marks where an object appears on the image plane (4 values: x, y, width, height). A 3D cuboid adds depth and orientation: it places the object in metric 3D space with a heading angle, so the model knows a car is 4.5m long, 12m ahead, and turning left. 2D boxes are cheaper and fine for detection-only tasks; 3D cuboids are required whenever the system needs distance, scale, or facing direction — collision avoidance, path planning, and robot manipulation.

What file formats are used for 3D cuboid annotation?

The dominant formats are KITTI (camera + Velodyne LiDAR, label files with location, dimensions, and rotation_y), nuScenes (JSON with sample_annotation linking 3D boxes across a multi-sensor scene and time), and Waymo Open Dataset (protocol buffers with 7-DOF labels and tracking IDs). Many teams also use a custom JSON schema. Decide your target format before annotation begins — converting between them after the fact is lossy, especially for coordinate-frame and yaw conventions.

Do you need LiDAR for 3D cuboid annotation?

Not always, but it helps enormously. With LiDAR, annotators place cuboids directly in the point cloud where depth and dimensions are measured, which is far more accurate. Camera-only (monocular or stereo) 3D cuboid annotation is possible and common for cost-sensitive projects, but depth is estimated rather than measured, so accuracy is lower. The highest-quality datasets fuse LiDAR and camera so annotators cross-check the cuboid in both views.

How is 3D cuboid annotation quality measured?

The core metrics are 3D IoU (intersection-over-union of the predicted and ground-truth cuboid volumes, typically thresholded at 0.5 or 0.7), mean Average Precision (mAP) per class, and orientation error (the heading-angle difference, often reported as Average Orientation Similarity or nuScenes' AOE). Production QA also tracks translation error, scale error, and inter-annotator agreement on a gold-standard set.

How much does 3D cuboid annotation cost?

Pricing is usually per object (per cuboid) and depends on sensor setup and tracking. As a 2026 rule of thumb: simple static-scene image cuboids start low, while LiDAR or fused LiDAR-camera cuboids with cross-frame tracking cost more because each object must stay consistent across the sequence. Volume, occlusion density, and the number of classes drive the final rate. The honest way to scope it is a paid or free pilot on a representative sample.

What tools are used for 3D cuboid annotation?

Specialist 3D tools let annotators rotate the scene, snap cuboids to point-cloud clusters, and propagate boxes across frames. Common platforms include the open-source SUSTechPOINTS and CVAT's 3D mode, plus commercial tools such as Segments.ai, Deepen, and the 3D editors inside larger labeling suites. The tool matters less than the workflow: ground-plane fitting, one-click cuboid fitting to clusters, and interpolation across frames are the features that drive both speed and consistency.

3D Cuboid Annotation: The Complete Guide to 3D Bounding Boxes (2026)

3D cuboid annotation is the workhorse label type behind autonomous vehicles, warehouse robots, and any AI that has to operate in physical space. If your model needs to know that the car ahead is 12 metres away, 4.5 metres long, and turning into your lane — not just that “there is a car somewhere in this image” — you need cuboids, not flat boxes.

This guide covers what a 3D cuboid actually encodes, how it differs from 2D bounding boxes, the formats and tools teams use in 2026, how sensor fusion changes the workflow, and how to measure quality. It's written for ML engineers and data leads scoping a perception dataset for the first time — and for teams whose first 3D dataset came back unusable and want to know why.

What a 3D Cuboid Actually Encodes: 7 Degrees of Freedom

A 2D bounding box is four numbers — x, y, width, height — on a flat image. A 3D cuboid is a box in the real world, and it carries seven degrees of freedom (7-DOF):

Position (x, y, z): the 3D centre of the object in metres, relative to the sensor or vehicle frame.
Dimensions (length, width, height): the metric size of the object — a sedan is roughly 4.5 × 1.8 × 1.5 m.
Heading / yaw: the rotation around the vertical axis — which way the object is facing. This single value is what lets a planner know whether a pedestrian is about to step into the road or walk along it.

Most road and warehouse datasets annotate yaw only (objects sit flat on the ground), but full 9-DOF annotation adds pitch and roll for drones, aircraft, and uneven terrain. Lock this decision early: a dataset annotated yaw-only cannot be retro-fitted to full orientation without re-labelling.

2D Bounding Box vs 3D Cuboid: When You Actually Need Cuboids

Cuboids cost more to annotate than 2D boxes, so the honest first question is whether you need them at all. Use this test:

Detection only (“is there a car in frame?”) → a 2D bounding box is enough.
Distance, scale, or facing direction matters (collision avoidance, path planning, robot grasping, parking) → you need a 3D cuboid.
Pixel-exact shape matters (driveable area, lane surface) → that's segmentation, a different task.

In practice, most autonomous-driving stacks use all three: 2D boxes for cheap high-recall detection, cuboids for the objects that drive control decisions, and segmentation for the road surface. The cuboid is the expensive, high-value label — spend your QA budget there.

How 3D Cuboid Annotation Works, Step by Step

A production cuboid workflow looks like this:

1. Ground-plane fitting. The annotation tool estimates the road/floor plane so cuboids snap flat to the ground. Skipping this is the most common source of tilted, low-IoU boxes.
2. Cuboid placement. The annotator fits a box to the object — in LiDAR, by snapping to the point-cloud cluster; in camera-only, by aligning edges to the visible object and estimating depth.
3. Orientation. Heading is set so the box's long axis matches the object's direction of travel. This is where amateur datasets fail — a 90° yaw error is invisible in a top-down quick-check but breaks the model.
4. Cross-frame tracking. For video or LiDAR sequences, each object keeps a stable track ID and the box is interpolated between keyframes, so the same vehicle is “object 14” across all 200 frames.
5. Occlusion & truncation flags. Partially hidden objects are still boxed to their full real-world extent and tagged with an occlusion level, so the model learns to reason about hidden geometry.
6. QA & adjudication. A reviewer checks 3D IoU against gold standards, orientation consistency, and track continuity before delivery.

The unglamorous truth: speed and quality both come from the first and fourth steps. Good ground-plane fitting and good interpolation are what separate a 2×-cost vendor from a 0.5×-cost one at the same accuracy.

Formats: KITTI, nuScenes, and Waymo

Pick your output format before the first label is drawn. The three standards dominate, and their conventions differ in ways that cause silent bugs:

KITTI: the original benchmark. Per-frame text label files with object class, 3D dimensions, location in camera coordinates, and a single rotation_y yaw. Simple, widely supported, but camera-frame coordinates trip up teams expecting LiDAR-frame.
nuScenes: a relational JSON schema where sample_annotation records link 3D boxes to a multi-sensor sample and to each other across time. Built for full 360° multi-camera + LiDAR + radar scenes and proper tracking. The de-facto choice for modern AV datasets.
Waymo Open Dataset: protocol buffers with 7-DOF labels, per-object tracking IDs, and tightly synchronised sensor data. The most rigorous, and the heaviest to tool for.

The conversion trap: yaw sign and coordinate frame are not the same across these formats. A dataset that looks perfect in KITTI can land every box at the wrong heading when naively converted to nuScenes. If you might train across formats, annotate in the strictest one (usually nuScenes-style) and convert down.

Sensor Fusion: Why LiDAR + Camera Beats Either Alone

You can annotate 3D cuboids from camera images alone — depth is estimated from object size and ground contact. It's cheaper and fine for many ADAS use cases. But the gold standard fuses LiDAR point clouds with camera imagery:

LiDAR gives measured depth and dimensions — the annotator snaps the cuboid to actual returns rather than guessing.
Camera gives semantics — is that cluster a parked car or a skip bin? The image disambiguates what the point cloud alone cannot.
Cross-validation — placing the box in LiDAR and confirming it in the projected camera view catches errors neither sensor catches alone.

Multi-sensor calibration has to be correct for this to work. If the extrinsics are off, the projected cuboid won't line up with the image and annotators will “correct” good LiDAR boxes to match a miscalibrated camera — quietly poisoning the dataset.

Measuring Quality: 3D IoU, mAP, and Orientation Error

“Looks right” is not a metric. Production 3D datasets are QA'd against:

3D IoU: volumetric overlap between the annotated cuboid and a gold-standard box. Thresholds of 0.5 (general) and 0.7 (vehicles, strict) are standard.
mAP per class: mean Average Precision, computed separately for cars, pedestrians, cyclists, etc., because small/rare classes behave very differently.
Orientation error: the heading difference, reported as Average Orientation Similarity (KITTI) or Average Orientation Error (nuScenes). This is the metric amateur datasets fail.
Translation & scale error: how far off the centre and dimensions are — the nuScenes ATE/ASE components.
Inter-annotator agreement: have multiple annotators box the same gold set and measure consistency. Below a project-defined threshold, the protocol — not the annotators — needs fixing.

Demand metric reporting on every delivery batch, not just a final-acceptance check. Quality drifts over a long project; per-batch IoU and orientation reporting catches the drift while it's cheap to fix.

Need production-grade 3D cuboid annotation?

Free pilot in 72 hours. LiDAR + camera fusion, KITTI / nuScenes / Waymo output, 3D IoU and orientation QA on every batch.

See our 3D cuboid annotation service

Where 3D Cuboids Get Used

Autonomous vehicles & ADAS: vehicle, pedestrian, and cyclist detection for perception and planning — the largest use case by volume.
Robotics & warehouse automation: pick-and-place, autonomous forklifts, and pallet/parcel localisation where grasp pose depends on object orientation.
Drones & aerial: obstacle and asset detection, often needing full 9-DOF orientation.
Construction & mining: heavy-equipment localisation and site safety zones.
Smart cities & traffic: intersection monitoring and flow analysis from fixed LiDAR/camera rigs.

What It Costs — and How to Scope It

3D cuboid annotation is priced per object. The cost drivers, in order of impact: whether you need cross-frame tracking (sequences cost far more than single frames), sensor setup (fused LiDAR-camera > LiDAR > camera-only), occlusion density (crowded urban scenes are slower), and the number of classes and attributes.

The only reliable way to scope a 3D project is a pilot on a representative slice of your data — ideally including your hardest scenes (night, rain, dense traffic). A vendor who quotes a firm per-object rate sight-unseen is guessing. See our data annotation pricing guide for how the per-object maths works across task types.

3D Cuboid Annotation: The Complete Guide to 3D Bounding Boxes