Segmentation has had a Cinderella decade. Driveable-area detection in autonomous cars, organ outlining in medical imaging, hair and skin masks in fashion AR, every kind of “remove the background” consumer feature — all built on per-pixel labels. The model side has gotten dramatically better with DeepLabv3+, transformer-based architectures, and Segment Anything blurring the line between annotation and prediction.
The annotation side has not gotten cheaper at anywhere near the same rate. Semantic segmentation is still genuinely expensive — 8 to 15 times the per-image cost of bounding boxes on production projects — and the way to get the budget right is to be honest about whether you actually need it. This guide is the honest version: what semantic segmentation is, when it's the right tool, when it isn't, what it costs, and where projects quietly go off the rails.
What Semantic Segmentation Actually Is
Semantic segmentation assigns every pixel in an image to a class. A driving scene becomes a paint-by-numbers map — road in one colour, sidewalk in another, vehicles, pedestrians, vegetation, sky, each as its own region. The model trained on it learns the area and shape of each class, not just where bounding boxes around objects sit.
The output isn't a list of objects. It's a mask the same size as the image. That mask is what feeds downstream tasks — driveable area, scene parsing, fashion try-on, organ-aware diagnostic AI. The annotation cost is real because every pixel is a labelling decision; the model benefit is real because some downstream tasks genuinely can't be solved any other way.
Semantic vs Instance vs Panoptic — The Honest Differences
- Semantic segmentation — every pixel is a class, no individual object tracking. Five overlapping cars become one “car” region. Use when class is what matters (driveable area, organ outlines, vegetation cover).
- Instance segmentation — every pixel gets a class AND an instance ID. The five cars become five separate regions you can count. Use when individual objects matter (counting, tracking, occlusion-aware perception).
- Panoptic segmentation — best of both. Stuff classes (road, sky, building) get semantic treatment; thing classes (cars, people) get instance treatment. Every pixel is covered exactly once, no overlap. The default for full-scene autonomous-driving perception today.
Pick by the question the model has to answer. “What kind of surface is here?” — semantic is fine. “How many distinct objects are here?” — instance. “Both, across the whole scene?” — panoptic. Cost climbs in that order; over-spec'ing semantic when instance is needed wastes the annotation, over-spec'ing panoptic when semantic suffices doubles cost for no model gain.
When Segmentation Beats Polygons (and When It Doesn't)
We talked about this from the polygon side in the polygon annotation guide. Here's the segmentation-side version of the same call:
- Segmentation wins when boundaries are genuinely pixel-irregular — hair, fur, vegetation, fluid organic shapes, irregular tumour edges, lane surface that needs to follow lane paint exactly.
- Segmentation wins when area or coverage is the model output — driveable area percentage, crop canopy coverage, organ volume from 2D slices.
- Polygons win when the object has a clear silhouette that 12–30 vertices can trace efficiently. You save 3–5x on cost for no meaningful model loss.
- Boxes win when only object presence and rough position matter. See the bounding box guide.
The honest test we apply on every incoming segmentation brief — would a 20-vertex polygon capture this shape? If yes, polygon. If no, segmentation. If unclear, run a small pilot with both and let the downstream model performance decide.
Formats: PNG Masks, COCO RLE, Cityscapes
- PNG class-index masks — one greyscale or RGB-encoded image per source image, each pixel value is a class index. Used by Cityscapes, ADE20K, and most academic pipelines. Simple, lossy on sub-pixel boundaries.
- COCO panoptic / segmentation JSON — masks stored either as polygons (efficient for simple shapes) or RLE (run-length-encoded bitmaps, exact). The de-facto general-purpose format.
- Mask R-CNN binary masks — per-instance binary masks alongside detection outputs. Common for instance/panoptic pipelines.
Lock the format up front. Converting RLE to PNG masks loses precision at object edges, and converting back doesn't restore it. If you might train across formats, annotate in the strictest one (usually COCO RLE) and convert down.
Pricing: The 8x–15x Reality Check
Rough per-image cost ratios on production projects (varies with class density and image complexity):
- Bounding box — baseline (1x).
- Polygon (12–20 vertex) — 3–5x.
- Semantic segmentation — 8–12x.
- Instance segmentation — 10–15x.
- Panoptic segmentation — 12–18x.
The ratios shift with scene density and rare-class precision requirements. The point isn't the exact number — the point is that segmentation is the most expensive type of annotation in mainstream CV. Half the segmentation briefs we see could ship on polygons or boxes for a fraction of the cost. Worth pressure-testing before scoping. Broader context in the data annotation pricing guide.
Scoping a segmentation project?
Free 25-image pilot — semantic, instance or panoptic. Per-class mIoU on our QA gold set, COCO RLE or PNG mask output, 72-hour turnaround. We'll also tell you honestly if polygons would have been enough.
See our segmentation serviceQuality: mIoU per Class, or You're Hiding the Problem
Mean IoU is the standard metric for semantic segmentation, but the discipline that matters is reporting it per class, every batch. A 92% mIoU averaged across 20 classes can hide a 35% IoU on the one rare class your downstream model actually depends on. Per-class reporting is the same discipline we'd demand on any high-stakes task — see the annotation QA playbook. For instance segmentation, the equivalent is Mask AP per class at thresholds 0.5 / 0.75 / 0.5:0.95.
Where Segmentation Gets Used
- Autonomous driving and ADAS — driveable area, lane surface, free-space estimation. Cross-link with the 3D cuboid guide on the 3D side of AV perception.
- Medical imaging — organ segmentation, tumour-region masks for radiology and pathology. See the histopathology guide and ophthalmology guide for the clinical-grade specifics.
- Fashion and retail AR — garment masks for try-on, hair and skin masks for beauty filters.
- Agriculture — crop-vs-weed pixel maps for autonomous sprayers, canopy coverage from drone imagery. See the agriculture annotation guide.
- Geospatial and aerial — building footprint extraction, land cover classification.
- Industrial and manufacturing — defect segmentation, asset boundary detection.
Related Reading
- → Semantic segmentation service
- → Instance segmentation service
- → Image segmentation annotation guide
- → Polygon annotation guide
- → Bounding box annotation guide
- → The annotation QA playbook
Get a 25-image segmentation pilot in 72 hours
Send representative imagery — semantic, instance or panoptic. We'll return per-class mIoU on our QA gold set in COCO RLE or PNG masks. Honest pricing, no upsell.
Neel Bennett
AI Annotation Specialist at AI Taggers
Neel has over 8 years of experience in AI training data and machine learning operations. He specializes in helping enterprises build high-quality datasets for computer vision and NLP applications across healthcare, automotive, and retail industries.
Connect on LinkedIn