Image-prompted speech, call-like data, farmer conversations — we had to fix transcription for our own datasets first. That meant going beyond plain text and building three layers: a clean transcript, rich metadata, and a token-level map of how people actually speak.
You still get the formats you expect (generic transcripts, VAANI-style
{} exports, subtitles), but under the hood everything is
wired into a schema you can reuse for ASR, TTS, voice agents and future
research.
Even with good vendors and style guides, most standard transcripts are designed for readability and reporting, not for ASR or voice agents. As soon as you try to use them for model training or evaluation across Indian languages and dialects, well-known cracks appear: dialects are flattened, code-switching is cleaned up, and noise is barely labelled.
Bhojpuri-influenced Hindi, Maithili, Assamese Hindi, Hinglish — transcripts usually flattened them into one bucket. For captions that’s okay; for training and evaluation, we couldn’t see how models behaved on the dialects that actually matter.
English loans and switches were often normalised away or forced into one script. Great for nice-looking transcripts, but it removed exactly the signal we needed to test how models handle real Hinglish and mixed speech.
TV, children, traffic, cross-talk — most pipelines ignored them or used a few free-text notes. When a model failed, we couldn’t tell whether it was the language, the environment, or both.
Quality checks were one-shot: sample, correct, close project. We wanted something different — a schema and guideline set that gets sharper as we see more edge cases across projects and languages.
We still give you the transcript file you asked for. But behind it, your audio lives in a three-layer structure that you can plug into ASR, voice agents, TTS or analytics without starting from scratch next year.
One clean transcript string per utterance, using consistent spelling and the target tokens your models should learn (e.g. try rather than twenty variant spellings) This is the base layer for almost every project.
When you need to go beyond a single “Hindi” label, we add per-utterance fields for language mix, dialect, region, device, channel and noise. This is what lets you slice performance by “Bihar Hindi + TV noise” instead of just “Hindi”.
For projects that need it, we build a token-level table that keeps both the surface form and the canonical form: टराई → ट्राई → {try}, सीस्टम → सिस्टम → {system}, and so on. This is the layer that unlocks rich lexicons, VAANI-style exports and detailed evaluation slices.
native {english}, build lexicons and create
targeted evaluation subsets later
Different teams want different views: generic transcripts, VAANI formats, ASR training text, lexicons, evaluation sets, analytics. The three-layer schema lets you serve all of them without recollecting or re-annotating audio.
native {lemma} exports
Designing the token schema now costs a little. It gives you optionality later: new products, new exports and better evaluations without touching the raw audio again.
We treat each project as part of your speech data pipeline, not a one-off file drop. The same stack we use for our own datasets runs behind your work too.
You share audio, languages, target use cases (ASR, agent, TTS) and any required formats (e.g. VAANI-style braces, subtitle files, JSON).
We lock the three-layer schema for your project and adapt guidelines for code-switching, dialects, noise tags and symbols.
Trained taskers work in our tooling; internal reviewers and language leads run calibration, spot checks and re-reviews.
You get the immediate deliverable you asked for plus the structured layers for future models, evaluation and analysis.
If all you need is one-off meeting notes, we’re probably overkill. If you’re building or evaluating speech tech for messy, multilingual realities, this is for you.
Teams training ASR, voice agents or TTS for India and the Global South who need dialect-aware, code-switched, noisy data — structured properly from day one.
Universities, labs and consortia building national corpora, benchmarks and public datasets who care about token-level richness, not just a single transcript file.
Banks, telcos, agriculture, citizen helplines and government AI initiatives that want one speech dataset that can serve multiple internal teams over time.
The text might look similar at first glance. The difference is that with the token schema, you can slice by dialect, export VAANI-style braces, or train a bot — all from the same base.
Share a little about your audio, languages and downstream uses. We’ll respond with a suggested three-layer schema, a rough effort estimate and a pilot plan.