Background

Accurate datasets
for Indian Languages

Josh Talks has the largest natural conversational voice dataset
covering 120,000+ hours across 25 Indian languages

Background

Accurate datasets
for Indian Languages

Josh Talks has the largest natural conversational voice dataset
covering 120,000+ hours across 25 Indian languages

Our Story

At Josh Talks we are driven by the pursuit to reduce WER of Indian languages to under 5%. We arm ASR teams with the highest accuracy datasets for Indian Languages at scale.

With a 10+ years experience of working with Indian languages and reach across all 19,000+ pincodes in India Josh Talks' portfolio of products engages 200Mn unique people every month. Our proprietary transcription process sets the industry benchmark, consistently delivering transcriptions with over 96% accuracy.

Why choose Josh Talks?

sheidl

Diversity & Representation

Our diverse network of contributors ensures comprehensive and accurate representation of people, capturing a wide range of real-world perspectives.

sheidl

Scale

Our scale allows us to deliver data at an unmatched level, meeting even the most demanding requirements.

sheidl

Indic Languages

Comprehensive coverage of Indian languages, offering diverse datasets that accurately reflect the linguistic richness of the region.

sheidl

Ethical Sourcing

Data sourced with integrity, ensuring privacy and ethical standards are upheld at every stage.

sheidl

Quality

Our rigorous standards ensure data of the highest quality, every time.

Data Scarcity to Abundance

Our patented means of data production and annotation allows us to generate and label 5 Million Hours of voice data every year

Indic Language Datasets

Train your AI models using data from the grassroots of India through Josh Talks’ network of Training Data Specialists

trans

Multi-lingual Datasets

check

Multilingual and code-switching Voice Datasets

check

Low Resource, Rare Language and Dialect Voice Datasets

check

Accented English Speech Dataset

check

Voice Datasets in - English, Hindi, Tamil, Marathi, Telugu, Bangla, Kannada, Malayalam, Punjabi, Oriya, Gujarati, Assamese

trans

Multi-Speaker Datasets

check

Multiple Speaker spontaneous conversations

check

Diarized speaker stems for up to 16 speakers

check

Multi-Speaker debate datasets

trans

Emotion Rich Data

check

Emotionally Aware and Annotated Conversations

check

Datasets covering 9 emotions - Neutral, Angry, Happy, Sad, Fear, Anxious, Surprised, Confused, Excited

trans

Other Datasets

check

Noisy and Adverse Environment Datasets

check

Privacy Preserving Highly Personalized Voice Datasets

check

Voice Datasets for Accessibility

check

Child Speech Datasets

Data For India, From India

Through our portfolio of products, we reach 200 Million people across India, including the most remote villages, covering all districts and every one of the 19000+ pin codes . This extensive reach offers unparalleled diversity and scale in data production, essential for driving AI progress in India

Customers across our portfolio of products