What is bounding box annotation?

Bounding box annotation is drawing a rectangle around an object in an image or video frame so an AI model can learn to detect that object. Each box is four numbers — top-left x, top-left y, width, height — plus a class label. It is the most common and most cost-effective form of image annotation, and the foundation of nearly every object-detection model in production.

What is a tight bounding box vs a loose one?

A tight bounding box has its edges touching the actual object boundary — no padding, no gap. A loose box has whitespace between the box edge and the object. Tight boxes train better models because the network learns 'object pixels are inside the box, background pixels are outside'; loose boxes teach it to include background as part of the object. The rule on every project we run — boxes hug the object.

What is an oriented bounding box?

An oriented (or rotated) bounding box adds a rotation angle so the box can tilt with the object. Standard axis-aligned boxes are always upright; an oriented box can lean. This matters for aerial and satellite imagery (planes, ships, cars from above), text detection (rotated signs and labels), and any scene where objects appear at arbitrary angles. The annotation cost is moderately higher; the accuracy gain on rotated objects is large.

When should you use a bounding box vs polygon vs segmentation vs cuboid?

Bounding box — detection and counting in 2D, where rough position is enough. Polygon — when the object shape matters (logos, garments, irregular product photos). Segmentation — pixel-exact (lane surface, organ outlines, fashion try-on). 3D cuboid — when depth and orientation matter (autonomous driving, robotics). The trap teams fall into is over-spec'ing — using segmentation when boxes would be fine doubles cost for no model gain.

What formats are used for bounding box annotation?

Three dominate. COCO JSON — the standard for general object detection, used by most modern frameworks. YOLO TXT — one file per image, normalised coordinates, perfect for YOLOv5/v8/v9 training. Pascal VOC XML — older, still common in academia and some industrial pipelines. Lock the target format before labelling; mid-project conversion is annoying but doable, mid-project schema changes are painful.

How much does bounding box annotation cost?

Pricing is per box (or per image with a box-count cap). For straightforward axis-aligned boxes on clear imagery the rate is low. The cost moves with class count, occlusion density (crowded scenes are slower), oriented vs axis-aligned (oriented adds time per box), tight-box discipline (tighter takes longer), and tracking (video tracking across frames is much more than single-frame). The honest way to scope it is a pilot on your hardest data, not your easy data.

What's the difference between 2D bounding boxes and 3D cuboids?

A 2D box has four values (position and size on the image). A 3D cuboid has seven (3D position, 3D dimensions, plus heading angle) and lives in real-world coordinates. 2D is enough when you just need 'where on screen'; 3D is required when the model needs distance, scale, or direction — collision avoidance, path planning, robot grasping. They are not interchangeable for autonomous-driving perception. See our 3D cuboid annotation guide for the full picture.

How is bounding box annotation quality measured?

IoU (intersection over union) against a gold-standard box, usually thresholded at 0.5 for general object detection and 0.7 for vehicles and tight detection. Per-class mean Average Precision (mAP) on a held-out set. Inter-annotator agreement on the gold set. The thing amateur dashboards skip — per-class accuracy, broken out. A 99% overall accuracy can hide a 60% accuracy on a rare critical class.

Bounding Box Annotation: What It Is, When To Use It, What It Costs (2026 Guide)

If you've ever trained an object detector, you've trained on bounding boxes. They're the workhorse of computer vision — a simple rectangle around an object, four numbers and a class. Easy to explain, easy to draw, easy to undercost, and absolutely everywhere — from the parking-camera ANPR in your local Westfield to the YOLO model behind every Aussie startup's “count something on a video” demo.

This guide is the one we wish we'd had handed to us early on — when to use a box, when not to, how to spec the work so you don't pay twice, and the things we see go wrong every quarter on incoming projects. No vendor brochure energy. Just what actually matters.

What a Bounding Box Actually Is

A bounding box is four numbers — top-left x, top-left y, width, height — and a class label. That's it. The box says “an object of this class is somewhere in this rectangle”. The model learns to predict the rectangle and the class.

The simplicity is the point. Boxes don't encode shape, depth, or orientation. They're a cheap, fast way of saying “here, in this part of the image” — and for most detection tasks that's genuinely all the model needs. The art is knowing when it isn't enough.

The Four Flavours of Box You'll Actually See

Lump “bounding box annotation” into one bucket and you'll quote the wrong price every time. There are four meaningfully different jobs:

Axis-aligned 2D box. The classic. Always upright, four values. The default for general object detection (COCO, OpenImages, every YOLO tutorial you've seen). Cheapest and fastest.
Tight 2D box. Same shape, but disciplined — edges hug the object, no whitespace padding. Costs a bit more per box because the annotator works to a stricter standard. Trains noticeably better models. Worth it on almost any production project.
Oriented (rotated) 2D box. Adds a rotation angle so the box tilts with the object. The right call for aerial/satellite imagery, rotated text, and any “objects at arbitrary angles” problem. Costs more per box because rotation is one more thing to get right.
3D cuboid. Not really a 2D box at all — it's a box in real-world space with 7 degrees of freedom. Required when the model needs distance, scale, and heading. We covered this in depth in the 3D cuboid guide.

Picking between them — match the box to the question. “Is the car there?” — axis-aligned. “How close is the car?” — cuboid. “Where exactly is the painted lane on a rotated drone shot?” — oriented. Spending money on a more complex box than you need is one of the most common ways to blow an annotation budget.

Tight Boxes: The Quiet Quality Lever

Here's a thing that doesn't make it into vendor brochures — the difference between a model trained on tight boxes and one trained on loose, padded boxes can be five to ten points of mAP. Same number of labels. Same data. Just whether the annotator hugged the object or splashed whitespace around it.

Why — because a loose box teaches the network that “everything inside this rectangle is the object”, which includes background pixels. The model learns to include background, then over-predicts, then drops in precision. Every dataset we audit where the box edges have a visible halo of background is a dataset that trained a weaker model than it should have. Make tight boxes a non-negotiable line in your spec.

When to Pick Something Other Than a Box

Boxes aren't always the right tool. Three honest signals you need something else:

The object shape matters → use a polygon. Garments on a model, irregular fruit, branded logos with weird outlines. The box would include 40% background; the polygon traces what the model actually needs.
You need pixel-exact area → use segmentation. Lane surface for autonomous driving, organ outlines for medical imaging, fashion try-on masks. Boxes can't express area; segmentation does.
You need depth or orientation → use a 3D cuboid. Self-driving perception, robot grasping, drone obstacle avoidance. A 2D box gives you screen position; the cuboid gives you metres.

The other direction is just as common — using segmentation when boxes were enough. Pixel-level labelling on a “is the car there” task is double the cost for almost no model gain. We talk teams down from over-spec'ing every week.

Formats: COCO, YOLO, Pascal VOC — Pick Before You Start

COCO JSON. The current general-purpose default. Used by most modern detection frameworks, supports multi-class images, segmentation, and keypoints alongside boxes. Choose this unless you have a reason not to.
YOLO TXT. One file per image, normalised coordinates (0–1). Native for YOLOv5/v8/v9 training pipelines. Conversion to and from COCO is well-supported.
Pascal VOC XML. Older but still alive — common in academic projects and some industrial pipelines that haven't migrated. Verbose, well-understood, supported everywhere.

All three are convertible. None are equal effort to convert. Lock the target format on day one. The teams who don't end up paying a couple of hundred labelling hours to re-convert mid-project — money that should have stayed in the budget.

Quality: IoU, mAP, And The Number Vendors Don't Want You To See

For boxes, quality is measured against a gold-standard set:

IoU (Intersection over Union) — how much the annotator's box overlaps the gold-standard box. Threshold 0.5 for general object detection, 0.7 for vehicles and tight-discipline projects.
mAP per class — mean Average Precision, computed separately per class. Cars, pedestrians, cyclists, rare classes — all reported individually, never combined.
Inter-annotator agreement on the gold set, plus per-batch double-annotation. The maths is well-covered in our annotation QA playbook.

The number vendors love is “overall accuracy”. The number that actually predicts model performance is per-class accuracy on rare and edge cases. If a delivery report shows one combined number and no class breakdown, that's the vendor hiding the bit that matters. Ask. If they push back, that tells you something.

What It Costs — And What You're Actually Paying For

Bounding box annotation is priced per object (or per image with a box-count cap). The real cost drivers, in order:

Box type — axis-aligned cheap, tight a bit more, oriented more again, cuboid more again.
Class count and class similarity — distinguishing sedan vs hatchback is slower than car vs truck.
Occlusion density — a crowded Sydney CBD frame at 5pm is genuinely several times slower than an empty highway frame.
Single frame vs video tracking — keeping the same object ID across 200 frames is not the same job as boxing 200 frames independently. Much more expensive.
Quality bar — 90% IoU vs 70% IoU is a different job.

Most teams get burned because they quote at the easy-frame rate and pay at the hard-frame rate. The honest scoping move is a paid or free pilot on your hardest data, not your easiest. Anything else is a guess dressed up as a quote. For broader pricing context, see our data annotation pricing guide.

Need bounding boxes done properly?

Free 50-image pilot in 48 hours. Tight boxes, per-class accuracy reporting, COCO / YOLO / Pascal VOC — your format. No commitment.

See our bounding box service

Where Bounding Boxes Get Used

Autonomous driving and ADAS — vehicle, pedestrian and cyclist detection (paired with cuboids for the 3D side).
Retail and e-commerce — product detection, shelf monitoring, planogram compliance, automatic checkout.
Security and surveillance — people and vehicle counting, intrusion detection, ANPR.
Agriculture — fruit counting, livestock identification, pest detection. We covered the agriculture angle in detail in the agriculture data annotation guide.
Aerial / drone / satellite — vehicle and asset detection from above (almost always oriented boxes, not axis-aligned).
Manufacturing — defect detection, asset tracking, safety zone monitoring.

Bounding Box Annotation: What It Is, When To Use It, What It Costs