We get a lot of incoming briefs that read like “we need multilingual transcription, Arabic and English, 1,000 hours, what's the rate”. The honest answer to that brief is always “tell us which Arabic”. Standard MSA spoken by a news presenter, Khaleeji business chat, Egyptian street-speech in a TikTok creator's footage, and Saudi WhatsApp voice notes are not the same task. None of them are the “Arabic” that a generic transcription vendor will quote.
This guide is what we wish the speech-AI teams who land on our doorstep already knew — what multilingual audio annotation actually includes, what makes it harder than monolingual work, the tasks you'll actually run, and what a sensible scope looks like for MENA, APAC, and AU bilingual markets. None of it is rocket science. Almost all of it is missed by generalist annotation vendors.
The Tasks You'll Actually Run
Real multilingual audio projects mix several of these on the same recordings:
- Transcription — verbatim (preserves um, uh, false starts) or clean (readable, normalised). Word-level timestamps are usually the deliverable. Verbatim costs more and is what speech models actually want to train on.
- Speaker diarization — who spoke when. Two-speaker phone calls are easy; eight-speaker boardroom audio with overlap is genuinely hard and gets priced accordingly.
- Code-switch tagging — marking which language each segment is in. Essential for GCC business audio, Indian and Filipino call-centre data, and any market where bilingual speech is the norm.
- Sentiment and intent — per-utterance labels feeding conversational AI and contact-centre analytics. Hugely language-dependent — sarcasm, hyperbole and politeness conventions don't transfer.
- Sound-event tagging — non-speech audio (laughter, baby cry, ambient noise classes). Often paired with transcription for media and accessibility AI.
- Pronunciation / accent metadata — for ASR and TTS work where the model needs to handle non-native English, Singlish, Hinglish, GCC English, and so on. Tagged at the speaker level.
Why “Just Add Another Language” Falls Over
English speech AI is well-resourced — millions of hours of training data, mature evaluation benchmarks, deep pools of qualified annotators. Most other languages don't have any of that. The implication for annotation scope is concrete and underestimated:
- The annotator pool is much smaller. Finding 50 qualified Khaleeji native speakers with the patience for verbatim transcription is genuinely harder than finding 50 qualified English ones. Lead time and rate both go up.
- The reference is unstable. For English you have decades of ASR benchmarks. For Khaleeji, Sinhala, Tagalog or Hausa you may have one academic paper and a handful of model checkpoints. The gold standard has to be built from scratch on most projects.
- The dialect spread is real. MSA “Arabic” transcription has almost no overlap with Khaleeji business conversation. Mandarin transcription does not transfer to Cantonese. Hindi is not Urdu, despite the spoken overlap.
None of this means it's impossible. It means the scoping has to be honest about the language, not the “language family”. We covered the specific MSA-vs-dialect tradeoff for Arabic AI teams in the Khaleeji vs MSA dialect strategy guide.
Code-Switching: The Default Reality In Bilingual Markets
Walk into a Dubai boardroom. The CFO greets in Arabic, presents in English, slips into Arabic for an aside to a colleague, switches back to English for the next slide. That single five-minute clip needs to be transcribed in both scripts, with language tags per segment, and the speech model needs to handle the transitions without dropping a syllable. This is normal, not unusual.
The annotation guideline has to be explicit, or every annotator quietly does something different. The rules we lock down on every code-switch project:
- Script convention. Each language stays in its own script. Arabic spoken inside an English sentence is transcribed in Arabic script, not romanised.
- Switch boundaries. Mark the exact word where the language changes — the model needs this signal to learn segment boundaries.
- Borrowed words. Decide once. “Email” spoken in an Arabic sentence — is it English or Arabic loan? Document the call.
- Mixed lexemes. “Whatsapping”, “chai latte” — fixed convention so labels are consistent across the dataset.
Get this guideline wrong and you'll re-label the dataset later. Get it right and the model handles real bilingual speech instead of the sanitised corporate version.
The Specific Bilingual Markets That Drive Demand
- GCC business audio — Arabic-English code-switching in banking, government and large enterprise. Our Saudi banking AI guide goes deep on the KSA fintech side.
- Indian English call centres — Hindi-English code-switching at scale. Massive volume, demanding accent and pronunciation work.
- Filipino contact centres — Tagalog-English with strong regional variation.
- Australian multicultural support — particularly Mandarin and Cantonese-English for the AU SE-Asian customer base.
- UAE government and citizen services — Emirati Arabic, English, Hindi, Urdu, Tagalog. See the UAE government AI annotation guide.
- Multilingual media and entertainment — transcription plus sentiment plus sound-event tagging for content moderation and accessibility.
Quality: Per-Language Metrics, Or You're Lying To Yourself
The biggest QA mistake on multilingual projects is reporting a single overall Word Error Rate. A 5% overall WER can hide a 25% WER on one of the languages — and 25% WER means the model trained on that side is going to fail in production. The discipline is per-language WER, per-language DER, per-language F1, every batch. Inter-annotator agreement measured separately per language pair. The general framework is in the annotation QA playbook; the multilingual specifics are the per-language stratification.
Need multilingual audio annotation?
Free pilot — 30 minutes of audio in your hardest language pair, returned with transcription, diarization and code-switch tags. Native speakers, dialect-specific, no commitment.
See our multilingual speech serviceWhat It Costs — and Why Generic Per-Hour Rates Mislead
Multilingual audio is priced per audio hour or per utterance, with strong language and dialect premiums. Standard English is the cheap end; low-resource languages, dialect-specific work, and bilingual code-switching audio cost several times that rate because the qualified annotator pool is much smaller and the per-minute throughput is lower. Anyone quoting a flat “multilingual transcription rate” without naming the languages, the dialects and the level (verbatim vs clean) is giving you a number that won't survive contact with your actual data. Run a pilot on your hardest audio first. See our data annotation pricing guide for the broader framework.
Related Reading
- → Multilingual speech transcription service
- → Multilingual annotation & localization
- → Audio annotation service
- → Native speaker annotators
- → Arabic data labelling
- → Khaleeji vs MSA dialect strategy
Get a 30-minute multilingual pilot
Send a sample of your hardest audio — your dialect, your code-switching, your accent mix. We'll return transcription, diarization and code-switch tags with per-language QA.
Neel Bennett
AI Annotation Specialist at AI Taggers
Neel has over 8 years of experience in AI training data and machine learning operations. He specializes in helping enterprises build high-quality datasets for computer vision and NLP applications across healthcare, automotive, and retail industries.
Connect on LinkedIn