If you've ever trained an object detector, you've trained on bounding boxes. They're the workhorse of computer vision — a simple rectangle around an object, four numbers and a class. Easy to explain, easy to draw, easy to undercost, and absolutely everywhere — from the parking-camera ANPR in your local Westfield to the YOLO model behind every Aussie startup's “count something on a video” demo.
This guide is the one we wish we'd had handed to us early on — when to use a box, when not to, how to spec the work so you don't pay twice, and the things we see go wrong every quarter on incoming projects. No vendor brochure energy. Just what actually matters.
What a Bounding Box Actually Is
A bounding box is four numbers — top-left x, top-left y, width, height — and a class label. That's it. The box says “an object of this class is somewhere in this rectangle”. The model learns to predict the rectangle and the class.
The simplicity is the point. Boxes don't encode shape, depth, or orientation. They're a cheap, fast way of saying “here, in this part of the image” — and for most detection tasks that's genuinely all the model needs. The art is knowing when it isn't enough.
The Four Flavours of Box You'll Actually See
Lump “bounding box annotation” into one bucket and you'll quote the wrong price every time. There are four meaningfully different jobs:
- Axis-aligned 2D box. The classic. Always upright, four values. The default for general object detection (COCO, OpenImages, every YOLO tutorial you've seen). Cheapest and fastest.
- Tight 2D box. Same shape, but disciplined — edges hug the object, no whitespace padding. Costs a bit more per box because the annotator works to a stricter standard. Trains noticeably better models. Worth it on almost any production project.
- Oriented (rotated) 2D box. Adds a rotation angle so the box tilts with the object. The right call for aerial/satellite imagery, rotated text, and any “objects at arbitrary angles” problem. Costs more per box because rotation is one more thing to get right.
- 3D cuboid. Not really a 2D box at all — it's a box in real-world space with 7 degrees of freedom. Required when the model needs distance, scale, and heading. We covered this in depth in the 3D cuboid guide.
Picking between them — match the box to the question. “Is the car there?” — axis-aligned. “How close is the car?” — cuboid. “Where exactly is the painted lane on a rotated drone shot?” — oriented. Spending money on a more complex box than you need is one of the most common ways to blow an annotation budget.
Tight Boxes: The Quiet Quality Lever
Here's a thing that doesn't make it into vendor brochures — the difference between a model trained on tight boxes and one trained on loose, padded boxes can be five to ten points of mAP. Same number of labels. Same data. Just whether the annotator hugged the object or splashed whitespace around it.
Why — because a loose box teaches the network that “everything inside this rectangle is the object”, which includes background pixels. The model learns to include background, then over-predicts, then drops in precision. Every dataset we audit where the box edges have a visible halo of background is a dataset that trained a weaker model than it should have. Make tight boxes a non-negotiable line in your spec.
When to Pick Something Other Than a Box
Boxes aren't always the right tool. Three honest signals you need something else:
- The object shape matters → use a polygon. Garments on a model, irregular fruit, branded logos with weird outlines. The box would include 40% background; the polygon traces what the model actually needs.
- You need pixel-exact area → use segmentation. Lane surface for autonomous driving, organ outlines for medical imaging, fashion try-on masks. Boxes can't express area; segmentation does.
- You need depth or orientation → use a 3D cuboid. Self-driving perception, robot grasping, drone obstacle avoidance. A 2D box gives you screen position; the cuboid gives you metres.
The other direction is just as common — using segmentation when boxes were enough. Pixel-level labelling on a “is the car there” task is double the cost for almost no model gain. We talk teams down from over-spec'ing every week.
Formats: COCO, YOLO, Pascal VOC — Pick Before You Start
- COCO JSON. The current general-purpose default. Used by most modern detection frameworks, supports multi-class images, segmentation, and keypoints alongside boxes. Choose this unless you have a reason not to.
- YOLO TXT. One file per image, normalised coordinates (0–1). Native for YOLOv5/v8/v9 training pipelines. Conversion to and from COCO is well-supported.
- Pascal VOC XML. Older but still alive — common in academic projects and some industrial pipelines that haven't migrated. Verbose, well-understood, supported everywhere.
All three are convertible. None are equal effort to convert. Lock the target format on day one. The teams who don't end up paying a couple of hundred labelling hours to re-convert mid-project — money that should have stayed in the budget.
Quality: IoU, mAP, And The Number Vendors Don't Want You To See
For boxes, quality is measured against a gold-standard set:
- IoU (Intersection over Union) — how much the annotator's box overlaps the gold-standard box. Threshold 0.5 for general object detection, 0.7 for vehicles and tight-discipline projects.
- mAP per class — mean Average Precision, computed separately per class. Cars, pedestrians, cyclists, rare classes — all reported individually, never combined.
- Inter-annotator agreement on the gold set, plus per-batch double-annotation. The maths is well-covered in our annotation QA playbook.
The number vendors love is “overall accuracy”. The number that actually predicts model performance is per-class accuracy on rare and edge cases. If a delivery report shows one combined number and no class breakdown, that's the vendor hiding the bit that matters. Ask. If they push back, that tells you something.
What It Costs — And What You're Actually Paying For
Bounding box annotation is priced per object (or per image with a box-count cap). The real cost drivers, in order:
- Box type — axis-aligned cheap, tight a bit more, oriented more again, cuboid more again.
- Class count and class similarity — distinguishing sedan vs hatchback is slower than car vs truck.
- Occlusion density — a crowded Sydney CBD frame at 5pm is genuinely several times slower than an empty highway frame.
- Single frame vs video tracking — keeping the same object ID across 200 frames is not the same job as boxing 200 frames independently. Much more expensive.
- Quality bar — 90% IoU vs 70% IoU is a different job.
Most teams get burned because they quote at the easy-frame rate and pay at the hard-frame rate. The honest scoping move is a paid or free pilot on your hardest data, not your easiest. Anything else is a guess dressed up as a quote. For broader pricing context, see our data annotation pricing guide.
Need bounding boxes done properly?
Free 50-image pilot in 48 hours. Tight boxes, per-class accuracy reporting, COCO / YOLO / Pascal VOC — your format. No commitment.
See our bounding box serviceWhere Bounding Boxes Get Used
- Autonomous driving and ADAS — vehicle, pedestrian and cyclist detection (paired with cuboids for the 3D side).
- Retail and e-commerce — product detection, shelf monitoring, planogram compliance, automatic checkout.
- Security and surveillance — people and vehicle counting, intrusion detection, ANPR.
- Agriculture — fruit counting, livestock identification, pest detection. We covered the agriculture angle in detail in the agriculture data annotation guide.
- Aerial / drone / satellite — vehicle and asset detection from above (almost always oriented boxes, not axis-aligned).
- Manufacturing — defect detection, asset tracking, safety zone monitoring.
Related Reading
- → Bounding box annotation service
- → 3D cuboid annotation guide
- → Image segmentation annotation guide
- → The annotation QA playbook
- → Polygon annotation service
- → Instance segmentation service
Get a 50-image bounding box pilot in 48 hours
Send a representative sample of your hardest data — we'll deliver tight boxes in your target format with per-class accuracy on the gold set.
Neel Bennett
AI Annotation Specialist at AI Taggers
Neel has over 8 years of experience in AI training data and machine learning operations. He specializes in helping enterprises build high-quality datasets for computer vision and NLP applications across healthcare, automotive, and retail industries.
Connect on LinkedIn