How Does Instance Segmentation Annotation Work? (Use Cases + Case Study)

Quick answer

Instance segmentation annotation assigns every pixel in an image both a class label and a unique instance identifier — so an AI model can distinguish, count, and interact with individual objects of the same class. Five garments on a mannequin, five cells in a biopsy slide, five vehicles in a car park each become separately labelled regions. It is required whenever the model's output depends on telling individual objects apart, not just classifying the scene.

What Instance Segmentation Annotation Is and How It Differs from Semantic Segmentation

Instance segmentation annotation produces two pieces of information per pixel: the class it belongs to, and which specific instance of that class it belongs to. In a standard COCO-format output, each instance has a unique annotation ID, a category ID, a bounding box, and a pixel mask.

The distinction from semantic segmentation is structural. Semantic segmentation asks: 'What class is this pixel?' Instance segmentation asks: 'What class is this pixel, and which specific object does it belong to?' For tasks where individual objects must be distinguished — counting people in a crowd, identifying which garment is being touched in a shopping interface, segmenting individual tumour colonies in a biopsy — semantic segmentation produces one undifferentiated region per class, which is useless. Instance segmentation produces one region per object.

Panoptic segmentation combines both: semantic segmentation for 'stuff' classes (road, sky, background) and instance segmentation for 'things' (cars, people, products). Panoptic labels cover every pixel exactly once — the most information-dense label type and the most expensive to produce.

Primary Use Cases for Instance Segmentation

Instance segmentation is required in four categories of application where distinguishing individual objects matters to the downstream output.

Retail visual search and garment separation

'Shop the look' and visual search features require isolating individual garments, accessories, or products on a model or mannequin — even when items overlap. Bounding boxes around overlapping garments produce significant bounding-box overlap that degrades visual embedding quality. Instance masks for each garment, even partially occluded, allow the embedding model to learn from clean crop regions.

Medical pathology and cell-level AI

Counting individual tumour cells, distinguishing tumour cell clusters from stroma, identifying mitotic figures in a field of view, and quantifying tumour-infiltrating lymphocyte density all require instance-level annotation. Semantic segmentation that labels 'tumour' as a region cannot support cell density analysis. Instance masks per cell or cluster are required for pathology AI that outputs quantitative biomarker values.

Autonomous vehicle pedestrian and cyclist tracking

In crowded urban environments, pedestrians and cyclists regularly overlap from the camera's perspective. Instance segmentation allows each individual to be tracked across frames even when their bounding boxes overlap entirely. For AV perception models that need to predict individual trajectories — not just pedestrian density — instance masks are the label type that enables it.

Agriculture: individual plant and fruit counting

Yield estimation AI requires counting individual fruit or heads of grain per plant, not just detecting a 'fruit' region. Instance segmentation of individual apples, mangoes, or wheat heads in drone imagery allows per-plant yield estimation at a resolution that semantic segmentation cannot provide.

Industrial quality control and defect isolation

Manufacturing QC AI that detects multiple surface defects on a single product needs to count, classify, and localise each defect separately — because the pass/fail decision depends on defect count and size, not just whether any defect region exists. Instance masks per defect allow downstream rules to be applied independently to each one.

The Annotation Pipeline for Instance Segmentation

Instance segmentation annotation is more complex than bounding box or semantic segmentation annotation because it requires resolving inter-instance boundaries — where does one object end and another begin when they overlap? A well-run pipeline has five stages.

Stage 1 — Instance definition and occlusion rules

Define which part of an occluded object gets labelled — only the visible portion, or the inferred full boundary? For retail products, the visible-portion-only rule is standard (the model needs to learn from what the camera sees). For AV pedestrians, many pipelines label the inferred full silhouette to support trajectory prediction. Write this rule explicitly before annotation begins — it changes both annotator behaviour and model training semantics.

Stage 2 — Instance ordering for overlapping objects

When two objects overlap, the foreground object's mask must cover the background object's pixels in that region. COCO format handles this with a crowd bit for unannotatable groups, but production pipelines need explicit rules: annotate front-to-back, one instance at a time, with the background instance's mask completed as if the foreground object were not there.

Stage 3 — AI-assisted pre-labelling

SAM 2 (Segment Anything Model 2) generates high-quality instance masks from point or box prompts with no task-specific fine-tuning. For standard object categories well-represented in its training data (people, vehicles, common products), SAM 2 pre-labelling at 0.70+ IoU reduces human annotation time by 35–55%. Pre-labels must be reviewed and corrected by annotators — not accepted passively.

Stage 4 — Human annotation and boundary review

Annotators trace or correct instance boundaries, assign instance IDs, and flag any images where instances are too occluded to label confidently (crowd bit in COCO). Calibration annotation — the same images labelled by multiple annotators independently — should run at the start of every project phase to measure per-class Mask AP before production scale.

Stage 5 — QA with Mask AP stratified by occlusion level

Standard QA sampling misses quality problems on heavily occluded instances — the hardest to annotate correctly and the most consequential for tracking tasks. Stratify QA by occlusion level (none, partial, heavy) and measure Mask AP separately for each tier. Heavily occluded instances should be oversampled at 3–5× the baseline QA rate.

Market Context: Instance Segmentation in Production AI

According to a 2024 Grand View Research report, AI-powered visual search adoption in e-commerce platforms grew at a CAGR of 43% from 2021 to 2024, with retailers reporting average conversion rate improvements of 25–35% in A/B tests where visual search replaced text-only product discovery. Instance segmentation is the label type that enables these features — bounding-box-based visual search systems produce significantly lower search precision on occluded or multi-item images.

A 2023 study published in the proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) benchmarked detection-versus-segmentation approaches across eight fashion retail datasets. Switching from axis-aligned bounding box detection to instance segmentation masks improved visual product search precision by an average of 29 percentage points — the improvement was most pronounced on datasets where garments routinely overlapped in catalogue images.

For computer vision teams building perception systems where instances must be tracked across frames, see our companion post on autonomous vehicle perception annotation which covers multi-frame instance tracking in LiDAR and camera fusion pipelines.

Need instance segmentation annotation for retail, medical, or AV AI?

AI Taggers provides instance segmentation annotation in COCO format with SAM 2-assisted pre-labelling, per-class Mask AP reporting, and occlusion-stratified QA. From 500 to 500,000+ images.

Discuss your instance segmentation project

Case Study: 30-Point Precision Gain for an Australian Fashion Retailer's Visual Search

In early 2025, an Australian fashion retailer with approximately 45,000 SKUs attempted to deploy a 'shop the look' visual search feature on their e-commerce platform. The initial implementation used a YOLO-based object detection model trained on bounding box annotations of garments. Performance in production fell well below the A/B test target.

Baseline model performance before annotation rebuild:

Visual search precision (top-5 relevant results): 54.2%
'Shop the look' click-through rate: 1.8%
Add-to-cart conversion from visual search: 0.6%
Primary failure mode: multi-item outfit images where bounding boxes overlapped by 40–70%, causing the embedding model to include pixels from adjacent garments in the product crop — resulting in poor visual similarity scores
Secondary failure mode: accessories (jewellery, bags, shoes) with bounding boxes that captured significant background or overlapping garment pixels

The root cause was clearly the label type: bounding boxes in multi-garment images overlapped, meaning product embeddings were computed from mixed-garment crops rather than clean single-garment crops. The team decided to rebuild with instance segmentation masks.

The annotation rebuild covered 62,000 images across 14 garment and accessory categories:

Phase 1 — Category taxonomy and occlusion rules (weeks 1–2)

The taxonomy was expanded to 14 categories including sub-types for tops, bottoms, outerwear, dresses/jumpsuits, footwear (3 subtypes), and accessories (4 subtypes). Occlusion rule: annotate only the visible pixel area of each garment — no inference of hidden areas. Overlap resolution rule: foreground garment receives the pixels in the overlap zone; the background garment's mask is completed as if no overlap existed (used for training the background object branch).

Phase 2 — SAM 2 pre-labelling (weeks 3–6)

SAM 2 was used with bounding-box prompts for each garment instance, generating mask candidates. Pre-labelling Mask AP at IoU=0.5 on a calibration set: 0.71 overall (0.79 for outerwear, 0.64 for accessories). Human annotators reviewed and corrected every pre-label. Acceptance rate without correction: 58% of instances. Net throughput with SAM 2 plus correction: 2.3× faster than full manual annotation.

Phase 3 — QA and delivery (weeks 7–9)

QA stratified by occlusion level: non-occluded instances sampled at 10%, partially occluded at 20%, heavily occluded at 40%. Per-category Mask AP measured against gold-set references. Two accessory sub-categories (rings, earrings) required annotator retraining after QA identified systematic background inclusion in the mask area for small accessories.

Results after annotation rebuild and model retrain:

Visual search precision (top-5): 54.2% → 84.6% — a 30.4 percentage-point improvement
'Shop the look' click-through rate: 1.8% → 4.1% — a 2.3× uplift
Add-to-cart conversion from visual search: 0.6% → 1.9% — a 3.2× uplift
Accessory category precision: 31% → 71% — the largest per-category gain, driven by clean mask crops eliminating background and adjacent-garment pixels

The annotation rebuild cost approximately AUD $78,000 compared to AUD $22,000 for the original bounding box annotation. At the retailer's scale, the 3.2× add-to-cart conversion improvement on visual search produced incremental revenue significantly exceeding the annotation cost differential within the first quarter of deployment. For teams building multi-label product taxonomy systems, see our guide on product tagging and visual search annotation in e-commerce.

Cost and Throughput: What to Budget For

Instance segmentation annotation cost depends on object count per image, object complexity and occlusion level, and whether AI-assisted pre-labelling is feasible for your object categories. Per-instance cost ranges from AUD $0.15 for simple, non-occluded objects with AI-assistance to AUD $0.80 for complex partially-occluded objects without AI assistance.

Per-image cost is driven by object density. A retail image with three to five non-overlapping garments: AUD $0.50–1.50. A street scene with 15–25 pedestrians and vehicles with partial occlusion: AUD $4.00–$10.00. A pathology slide with 200+ individual cell instances: AUD $12.00–$30.00 per tile.

AI-assisted pre-labelling with SAM 2 reduces throughput cost by 35–55% for object categories well-represented in its training corpus (people, vehicles, garments, common objects). For highly specific object types — microscopy cells, custom industrial components, specialised medical instruments — SAM 2 pre-labels fall below the 0.65 IoU threshold needed for efficient human correction, and full manual annotation is faster overall. See our semantic segmentation annotation case study for a direct comparison of how different label types affect model performance and annotation budget in parallel perception projects.

How to Choose: Bounding Box, Polygon, Semantic Segmentation, or Instance Segmentation

The decision tree is straightforward when stated as a series of model requirements:

Model Requirement	Label Type	Cost Ratio
Detect where objects are (position only)	Bounding box	1×
Detect object shape (non-overlapping objects)	Polygon (15–30 vertices)	3–5×
Classify entire scene at pixel level (classes only)	Semantic segmentation	8–15×
Detect, count, and distinguish individual objects (including overlapping)	Instance segmentation	8–20×
Full scene understanding: classify background + count foreground objects	Panoptic segmentation	15–30×

The single most common mistake in segmentation project scoping: choosing instance segmentation when polygon annotation would have produced equivalent downstream model performance. Before specifying instance masks, ask: 'Could a polygon with 20–30 vertices per object train an equivalent model?' If the objects do not overlap and the model output does not depend on pixel-exact shape, the answer is usually yes — and polygon annotation is 50–60% cheaper. For a detailed breakdown of polygon versus bounding box cost-quality trade-offs, see our bounding box annotation cost and speed case study.

Frequently Asked Questions

What is instance segmentation annotation?▾

Instance segmentation annotation assigns every pixel in an image both a class label and a unique instance ID — so AI models can distinguish, count, and track individual objects of the same class. Five vehicles in a parking scene become five separately labelled regions, each with its own mask and instance identifier. It is required for any task where counting or distinguishing individual objects matters to the model output.

How is instance segmentation different from semantic segmentation?▾

Semantic segmentation assigns every pixel a class but treats all objects of that class as one region. Instance segmentation assigns each pixel both a class and a unique instance ID, making each individual object a separate labelled region. Semantic segmentation is sufficient for driveable-surface detection or background classification. Instance segmentation is required when the model must count, track, or interact with individual objects — retail products, individual pedestrians, cells in a biopsy.

What format is used for instance segmentation annotation?▾

COCO JSON is the dominant standard, with per-instance polygon annotations or RLE binary masks alongside bounding boxes and category IDs. SAM 2 outputs are typically converted to COCO RLE format before training. For some medical imaging pipelines, TIFF stacks with per-instance channels are used. Lock the format to your training architecture before annotation begins.

When should I choose instance segmentation over bounding boxes or polygons?▾

Choose instance segmentation when: objects of the same class overlap (garments on a mannequin, pedestrians in a crowd, cells in a tissue section); the application uses mask shape at inference time (visual search, try-on, surgical tool tracking, precise crop estimation); or downstream rules depend on individual object properties rather than class region properties. If objects do not overlap and shape is not needed, polygons or bounding boxes are 60–90% cheaper.

How much does instance segmentation annotation cost?▾

Per-instance: AUD $0.15–$0.80 depending on complexity, occlusion, and AI assistance. Per-image: AUD $0.50–$1.50 for retail images with three to five garments; AUD $4.00–$10.00 for crowded street scenes; AUD $12.00–$30.00 per tile for dense cell pathology slides. SAM 2-assisted pre-labelling reduces cost by 35–55% for standard object categories.

What AI architectures use instance segmentation training data?▾

The main architectures are Mask R-CNN (two-stage, high accuracy), YOLO with segmentation head (v5–v11, faster inference, good for production deployment), SOLOv2 (centroid-based, strong on dense scenes), and QueryInst or Sparse R-CNN (transformer-based, state of the art on COCO benchmarks). SAM 2 is widely used for pre-labelling but not typically used as the production segmentation backbone. Your architecture choice determines the required output format.

Free Sample · 24-48 hours

Get Accurate Instance Segmentation Annotation for Your AI Project

Tell us your object categories, image volume, occlusion level, and output format — we'll scope an instance segmentation project with SAM 2-assisted pre-labelling within 48 hours.

Neel Bennett

AI Annotation Specialist at AI Taggers

Neel has over 8 years of experience in AI training data and machine learning operations. He specializes in helping enterprises build high-quality datasets for computer vision and NLP applications across healthcare, automotive, and retail industries.

Connect on LinkedIn