
Accurate datasets
for Indian Languages
Josh Talks has the largest natural conversational voice dataset
covering 120,000+ hours across 25 Indian languages

Accurate datasets
for Indian Languages
Josh Talks has the largest natural conversational voice dataset
covering 120,000+ hours across 25 Indian languages
Our Story
At Josh Talks we are driven by the pursuit to reduce WER of Indian languages to under 5%. We arm ASR teams with the highest accuracy datasets for Indian Languages at scale.
With a 10+ years experience of working with Indian languages and reach across all 19,000+ pincodes in India Josh Talks' portfolio of products engages 200Mn unique people every month. Our proprietary transcription process sets the industry benchmark, consistently delivering transcriptions with over 96% accuracy.
Why choose Josh Talks?

Diversity & Representation
Our diverse network of contributors ensures comprehensive and accurate representation of people, capturing a wide range of real-world perspectives.

Scale
Our scale allows us to deliver data at an unmatched level, meeting even the most demanding requirements.

Indic Languages
Comprehensive coverage of Indian languages, offering diverse datasets that accurately reflect the linguistic richness of the region.

Ethical Sourcing
Data sourced with integrity, ensuring privacy and ethical standards are upheld at every stage.

Quality
Our rigorous standards ensure data of the highest quality, every time.
Data Scarcity to Abundance
Our patented means of data production and annotation allows us to generate and label 5 Million Hours of voice data every year
Indic Language Datasets
Train your AI models using data from the grassroots of India through Josh Talks’ network of Training Data Specialists

Multi-lingual Datasets

Multilingual and code-switching Voice Datasets

Low Resource, Rare Language and Dialect Voice Datasets

Accented English Speech Dataset

Voice Datasets in - English, Hindi, Tamil, Marathi, Telugu, Bangla, Kannada, Malayalam, Punjabi, Oriya, Gujarati, Assamese

Multi-Speaker Datasets

Multiple Speaker spontaneous conversations

Diarized speaker stems for up to 16 speakers

Multi-Speaker debate datasets

Emotion Rich Data

Emotionally Aware and Annotated Conversations

Datasets covering 9 emotions - Neutral, Angry, Happy, Sad, Fear, Anxious, Surprised, Confused, Excited

Other Datasets

Noisy and Adverse Environment Datasets

Privacy Preserving Highly Personalized Voice Datasets

Voice Datasets for Accessibility

Child Speech Datasets
Data For India, From India
Through our portfolio of products, we reach 200 Million people across India, including the most remote villages, covering all districts and every one of the 19000+ pin codes . This extensive reach offers unparalleled diversity and scale in data production, essential for driving AI progress in India
Customers across our portfolio of products









