Transcription for ASR & voice agents

We don’t just transcribe.
We structure your speech data for whatever comes next.

Image-prompted speech, call-like data, farmer conversations — we had to fix transcription for our own datasets first. That meant going beyond plain text and building three layers: a clean transcript, rich metadata, and a token-level map of how people actually speak.

You still get the formats you expect (generic transcripts, VAANI-style {} exports, subtitles), but under the hood everything is wired into a schema you can reuse for ASR, TTS, voice agents and future research.

New: transcription as a standalone, schema-first service

Why “plain” transcripts aren’t enough for speech & ML

Even with good vendors and style guides, most standard transcripts are designed for readability and reporting, not for ASR or voice agents. As soon as you try to use them for model training or evaluation across Indian languages and dialects, well-known cracks appear: dialects are flattened, code-switching is cleaned up, and noise is barely labelled.

Dialects & code-switching

Language labels stopped at “Hindi” or “English”

Bhojpuri-influenced Hindi, Maithili, Assamese Hindi, Hinglish — transcripts usually flattened them into one bucket. For captions that’s okay; for training and evaluation, we couldn’t see how models behaved on the dialects that actually matter.

Mixed speech

Code-switching was “cleaned up” for readability

English loans and switches were often normalised away or forced into one script. Great for nice-looking transcripts, but it removed exactly the signal we needed to test how models handle real Hinglish and mixed speech.

Real-world noise

Noise and overlaps weren’t consistently labelled

TV, children, traffic, cross-talk — most pipelines ignored them or used a few free-text notes. When a model failed, we couldn’t tell whether it was the language, the environment, or both.

Learning over time

QA didn’t feed back into the data model

Quality checks were one-shot: sample, correct, close project. We wanted something different — a schema and guideline set that gets sharper as we see more edge cases across projects and languages.

The three layers our transcription stack can produce

We still give you the transcript file you asked for. But behind it, your audio lives in a three-layer structure that you can plug into ASR, voice agents, TTS or analytics without starting from scratch next year.

Canonical transcript

Layer 1: ML-ready text per utterance

One clean transcript string per utterance, using consistent spelling and the target tokens your models should learn (e.g. try rather than twenty variant spellings) This is the base layer for almost every project.

  • Standardised orthography per language where applicable
  • Light inline tags only where really needed
  • Ready for WER, intent, NLU evaluation
Utterance metadata

Layer 2: Who, where and how it was spoken

When you need to go beyond a single “Hindi” label, we add per-utterance fields for language mix, dialect, region, device, channel and noise. This is what lets you slice performance by “Bihar Hindi + TV noise” instead of just “Hindi”.

  • Configurable Language(s) & dialect tags
  • Optional Device, channel, environment, noise bands
  • Speaker and scenario attributes where available and in scope
Token-level map

Layer 3: How each word is written and understood

For projects that need it, we build a token-level table that keeps both the surface form and the canonical form: टराई → ट्राई → {try}, सीस्टम → सिस्टम → {system}, and so on. This is the layer that unlocks rich lexicons, VAANI-style exports and detailed evaluation slices.

  • One row per token (or at least per code-switched token)
  • Fields for surface_native, lemma, language, norm_native
  • Lets you regenerate VAANI-style native {english}, build lexicons and create targeted evaluation subsets later

One schema, many ways to use the same audio

Different teams want different views: generic transcripts, VAANI formats, ASR training text, lexicons, evaluation sets, analytics. The three-layer schema lets you serve all of them without recollecting or re-annotating audio.

🎧 ASR & voice-bot transcripts
🧩 VAANI-style native {lemma} exports
📚 Lexicons & normalisation rules
📊 Dialect & noise evaluation slices
🔍 Analytics & speech insights

Designing the token schema now costs a little. It gives you optionality later: new products, new exports and better evaluations without touching the raw audio again.

How a transcription project runs with Data Taskers

We treat each project as part of your speech data pipeline, not a one-off file drop. The same stack we use for our own datasets runs behind your work too.

01

Ingest audio & specs

You share audio, languages, target use cases (ASR, agent, TTS) and any required formats (e.g. VAANI-style braces, subtitle files, JSON).

02

Design guidelines & schema

We lock the three-layer schema for your project and adapt guidelines for code-switching, dialects, noise tags and symbols.

03

Human-in-loop transcription & QA

Trained taskers work in our tooling; internal reviewers and language leads run calibration, spot checks and re-reviews.

04

Delivery & views

You get the immediate deliverable you asked for plus the structured layers for future models, evaluation and analysis.

Who this transcription layer is built for

If all you need is one-off meeting notes, we’re probably overkill. If you’re building or evaluating speech tech for messy, multilingual realities, this is for you.

ASR & voice teams

Model builders & voice AI teams

Teams training ASR, voice agents or TTS for India and the Global South who need dialect-aware, code-switched, noisy data — structured properly from day one.

Corpora & research

Corpus, evaluation & research groups

Universities, labs and consortia building national corpora, benchmarks and public datasets who care about token-level richness, not just a single transcript file.

Enterprises & govt

Enterprises & public programmes

Banks, telcos, agriculture, citizen helplines and government AI initiatives that want one speech dataset that can serve multiple internal teams over time.

What changes when you have three layers instead of one file

The text might look similar at first glance. The difference is that with the token schema, you can slice by dialect, export VAANI-style braces, or train a bot — all from the same base.

Before

Plain transcript

"सर मैं कल से ट्राई कर रहा हूँ बट सिस्टम वर्किंग नहीं है, आप प्लीज़ चेक कर दीजिए। टीवी की आवाज़ और बच्चों की बातें आ रही हैं। कॉल ड्रॉप हो गया, हेलो?... हेलो?"
After

Data Taskers dataset view

transcript_norm: "सर मैं कल से ट्राई कर रहा हूँ, बट सिस्टम वर्किंग नहीं है, आप प्लीज़ चेक कर दीजिए। …" metadata: { langs_present: ["hi","en"], dialect: "hi_Bihar", noise_tags: ["tv","children"], device: "android_phone" } tokens: [ { surface_native: "टराई", lemma_en: "try", lang: "en", norm_native: "ट्राई" }, { surface_native: "सीस्टम", lemma_en: "system", lang: "en", norm_native: "सिस्टम" }, … ]

Tell us what you’re building and we’ll suggest a schema

Share a little about your audio, languages and downstream uses. We’ll respond with a suggested three-layer schema, a rough effort estimate and a pilot plan.