What is multilingual audio annotation?

Multilingual audio annotation is the labelling of speech and sound data across multiple languages so AI models can transcribe, understand, and act on what was said. The unit of work is usually a transcript with extra layers — speaker tags, timestamps, intent, sentiment, code-switching markers and accent metadata — produced by native speakers of each target language. The 'multilingual' bit is the hard part, not 'audio'.

What annotation tasks are used for multilingual speech AI?

The common stack: verbatim or clean transcription, speaker diarization (who spoke when), code-switch tagging (Arabic-English in GCC, Hindi-English in India), sentiment and intent classification, named-entity recognition on the transcript, sound-event labelling for non-speech audio, and pronunciation / accent metadata. Real multilingual projects almost always combine three or four of these on the same audio.

Do you need native speakers for multilingual audio annotation?

Yes. Non-native annotators miss dialect-specific phrasing, sarcasm, code-switching boundaries and culturally-loaded vocabulary that change the model's training signal. For high-stakes use cases (banking, healthcare, government) you need not just native speakers but native speakers of the target dialect — Khaleeji is not Egyptian Arabic, Cantonese is not Mandarin. Generic 'Arabic annotators' produces generic Arabic AI.

What languages and dialects are hardest to annotate well?

Anything with code-switching at scale (Arabic-English in GCC, Hindi-English in India, Spanish-English in the US Southwest), tonal languages (Mandarin, Cantonese, Vietnamese) where intonation changes meaning, low-resource languages with few qualified annotators, and dialects that diverge sharply from the written standard (Khaleeji vs MSA, colloquial vs literary Egyptian Arabic). The honest scoping move is to budget more annotator hours for these, not the same as English.

How do you handle code-switching in transcription?

Code-switching is when a speaker mixes languages mid-sentence — extremely common in GCC business audio, Indian English call centre data, and bilingual customer support across most markets. The annotation guideline has to be explicit: tag each switch with a language marker, decide on a script convention (Arabic in Arabic script even if spoken inside an English sentence), and document how to handle borrowed words that have crossed into one of the languages permanently. Without a guideline, every annotator makes different calls.

What audio annotation formats are used?

For transcription — JSON with word-level timestamps, or CTM / WebVTT for some pipelines. For diarization — RTTM or speaker-tagged JSON. For sentiment and intent — typically a per-utterance label appended to the transcript JSON. The audio itself is usually WAV (PCM) for training, sometimes FLAC for archival. Lock the schema early; mid-project conversion is annoying and the timestamps don't always survive cleanly.

How is multilingual audio annotation quality measured?

For transcription — Word Error Rate (WER) per language, never just overall (an aggregate 5% WER can hide a 25% WER on one of the languages). For diarization — Diarization Error Rate (DER). For classification (sentiment, intent) — F1 per class and per language. For all of it — inter-annotator agreement on a held-out gold set, measured separately per language. The general QA discipline is the same as our annotation QA playbook, with language-stratified metrics on top.

Multilingual Audio Annotation: Speech, Transcription & Diarization Across Languages (2026)

We get a lot of incoming briefs that read like “we need multilingual transcription, Arabic and English, 1,000 hours, what's the rate”. The honest answer to that brief is always “tell us which Arabic”. Standard MSA spoken by a news presenter, Khaleeji business chat, Egyptian street-speech in a TikTok creator's footage, and Saudi WhatsApp voice notes are not the same task. None of them are the “Arabic” that a generic transcription vendor will quote.

This guide is what we wish the speech-AI teams who land on our doorstep already knew — what multilingual audio annotation actually includes, what makes it harder than monolingual work, the tasks you'll actually run, and what a sensible scope looks like for MENA, APAC, and AU bilingual markets. None of it is rocket science. Almost all of it is missed by generalist annotation vendors.

The Tasks You'll Actually Run

Real multilingual audio projects mix several of these on the same recordings:

Transcription — verbatim (preserves um, uh, false starts) or clean (readable, normalised). Word-level timestamps are usually the deliverable. Verbatim costs more and is what speech models actually want to train on.
Speaker diarization — who spoke when. Two-speaker phone calls are easy; eight-speaker boardroom audio with overlap is genuinely hard and gets priced accordingly.
Code-switch tagging — marking which language each segment is in. Essential for GCC business audio, Indian and Filipino call-centre data, and any market where bilingual speech is the norm.
Sentiment and intent — per-utterance labels feeding conversational AI and contact-centre analytics. Hugely language-dependent — sarcasm, hyperbole and politeness conventions don't transfer.
Sound-event tagging — non-speech audio (laughter, baby cry, ambient noise classes). Often paired with transcription for media and accessibility AI.
Pronunciation / accent metadata — for ASR and TTS work where the model needs to handle non-native English, Singlish, Hinglish, GCC English, and so on. Tagged at the speaker level.

Why “Just Add Another Language” Falls Over

English speech AI is well-resourced — millions of hours of training data, mature evaluation benchmarks, deep pools of qualified annotators. Most other languages don't have any of that. The implication for annotation scope is concrete and underestimated:

The annotator pool is much smaller. Finding 50 qualified Khaleeji native speakers with the patience for verbatim transcription is genuinely harder than finding 50 qualified English ones. Lead time and rate both go up.
The reference is unstable. For English you have decades of ASR benchmarks. For Khaleeji, Sinhala, Tagalog or Hausa you may have one academic paper and a handful of model checkpoints. The gold standard has to be built from scratch on most projects.
The dialect spread is real. MSA “Arabic” transcription has almost no overlap with Khaleeji business conversation. Mandarin transcription does not transfer to Cantonese. Hindi is not Urdu, despite the spoken overlap.

None of this means it's impossible. It means the scoping has to be honest about the language, not the “language family”. We covered the specific MSA-vs-dialect tradeoff for Arabic AI teams in the Khaleeji vs MSA dialect strategy guide.

Code-Switching: The Default Reality In Bilingual Markets

Walk into a Dubai boardroom. The CFO greets in Arabic, presents in English, slips into Arabic for an aside to a colleague, switches back to English for the next slide. That single five-minute clip needs to be transcribed in both scripts, with language tags per segment, and the speech model needs to handle the transitions without dropping a syllable. This is normal, not unusual.

The annotation guideline has to be explicit, or every annotator quietly does something different. The rules we lock down on every code-switch project:

Script convention. Each language stays in its own script. Arabic spoken inside an English sentence is transcribed in Arabic script, not romanised.
Switch boundaries. Mark the exact word where the language changes — the model needs this signal to learn segment boundaries.
Borrowed words. Decide once. “Email” spoken in an Arabic sentence — is it English or Arabic loan? Document the call.
Mixed lexemes. “Whatsapping”, “chai latte” — fixed convention so labels are consistent across the dataset.

Get this guideline wrong and you'll re-label the dataset later. Get it right and the model handles real bilingual speech instead of the sanitised corporate version.

The Specific Bilingual Markets That Drive Demand

GCC business audio — Arabic-English code-switching in banking, government and large enterprise. Our Saudi banking AI guide goes deep on the KSA fintech side.
Indian English call centres — Hindi-English code-switching at scale. Massive volume, demanding accent and pronunciation work.
Filipino contact centres — Tagalog-English with strong regional variation.
Australian multicultural support — particularly Mandarin and Cantonese-English for the AU SE-Asian customer base.
UAE government and citizen services — Emirati Arabic, English, Hindi, Urdu, Tagalog. See the UAE government AI annotation guide.
Multilingual media and entertainment — transcription plus sentiment plus sound-event tagging for content moderation and accessibility.

Quality: Per-Language Metrics, Or You're Lying To Yourself

The biggest QA mistake on multilingual projects is reporting a single overall Word Error Rate. A 5% overall WER can hide a 25% WER on one of the languages — and 25% WER means the model trained on that side is going to fail in production. The discipline is per-language WER, per-language DER, per-language F1, every batch. Inter-annotator agreement measured separately per language pair. The general framework is in the annotation QA playbook; the multilingual specifics are the per-language stratification.

Need multilingual audio annotation?

Free pilot — 30 minutes of audio in your hardest language pair, returned with transcription, diarization and code-switch tags. Native speakers, dialect-specific, no commitment.

See our multilingual speech service

What It Costs — and Why Generic Per-Hour Rates Mislead

Multilingual audio is priced per audio hour or per utterance, with strong language and dialect premiums. Standard English is the cheap end; low-resource languages, dialect-specific work, and bilingual code-switching audio cost several times that rate because the qualified annotator pool is much smaller and the per-minute throughput is lower. Anyone quoting a flat “multilingual transcription rate” without naming the languages, the dialects and the level (verbatim vs clean) is giving you a number that won't survive contact with your actual data. Run a pilot on your hardest audio first. See our data annotation pricing guide for the broader framework.

Multilingual Audio Annotation: Speech, Transcription & Diarization Across Languages