Indic Vision and Language
Datasets
Improve model performance by using our off the shelf datasets
Language
All
14
Hindi
1
English
1
Bengali
1
Telugu
1
Haryanvi
1
Bodo
1
Bhojpuri
1
Malayalam
1
Punjabi
1
Maithili
1
Data Type
All
14
Conversational
10
Vision
4
Sample Rate
All
14
48 kHz
14
STEM Visual Reasoning Dataset
System visual reasoning
Off-The-Shelf
Our STEM Reasoning Dataset is a curated collection of image-based and diagram-driven tasks spanning Physics, Chemistry, and Mathematics. This dataset is designed to train and evaluate a model’s ability to perform logical deduction, symbolic matching, and visual-spatial inference across scientific disciplines.
Indian Cultural Practices in Motion (Videos) Dataset
Indian cultural video
Off-The-Shelf
Our Indian Rituals Video Dataset offers a rich collection of videos capturing live rituals and culturally significant actions from across India. This dataset is built to benchmark multimodal understanding and enable models to ground temporal reasoning within diverse cultural contexts.
IIndian Cultural Scenes and Ritual Objects Image Dataset
Indian cultural Image
Off-The-Shelf
Our Cultural Imagery Dataset offers a diverse and visually rich collection of still photographs capturing the essence of India’s cultural identity. This dataset features images representing festivals, traditional attire, rituals, religious symbols, and everyday cultural objects from across the country. It is curated to support applications in cultural grounding, fine-grained image classification, and multilingual captioning.
Indian Clinical Imaging and Health Visuals
Clinical Image
Off-The-Shelf
Our Medical Health Imagery Dataset is a curated collection of anonymized medical images that reflect real-world clinical contexts from across India. This dataset is designed to support the development and benchmarking of AI-powered diagnosis-support systems and vision-language models in the healthcare domain.
10,004 Hours - Maithili - Conversational Audio Dataset
Maithili
Conversational Audio
Off-The-Shelf
Our Conversational Data in Maithili offers comprehensive and authentic dialogues of Indians conversing in Maithili. This dataset features conversations that span a wide range of topics, including daily life, business, education, and more. It includes diverse speakers from different regions of India, capturing various accents and dialects to provide a rich linguistic resource. <br><br> The data is collected from natural, spontaneous conversations to ensure authenticity, and each conversation is accurately transcribed with annotations for contextual understanding. Additionally, we offer the flexibility to tailor the topics, conversations, and scenarios according to the specific needs of your company, ensuring that the dataset aligns perfectly with your requirements.
11,472 Hours - Haryanvi - Conversational Audio Dataset
Haryanvi
Conversational Audio
Off-The-Shelf
Our Conversational Data in Haryanvi offers comprehensive and authentic dialogues of Indians conversing in Haryanvi. This dataset features conversations that span a wide range of topics, including daily life, business, education, and more. It includes diverse speakers from different regions of India, capturing various accents and dialects to provide a rich linguistic resource. <br><br> The data is collected from natural, spontaneous conversations to ensure authenticity, and each conversation is accurately transcribed with annotations for contextual understanding. Additionally, we offer the flexibility to tailor the topics, conversations, and scenarios according to the specific needs of your company, ensuring that the dataset aligns perfectly with your requirements.
10,012 Hours - Bodo - Conversational Audio Dataset
Bodo
Conversational Audio
Off-The-Shelf
Our Conversational Data in Bodo offers comprehensive and authentic dialogues of Indians conversing in Bodo. This dataset features conversations that span a wide range of topics, including daily life, business, education, and more. It includes diverse speakers from different regions of India, capturing various accents and dialects to provide a rich linguistic resource. <br><br> The data is collected from natural, spontaneous conversations to ensure authenticity, and each conversation is accurately transcribed with annotations for contextual understanding. Additionally, we offer the flexibility to tailor the topics, conversations, and scenarios according to the specific needs of your company, ensuring that the dataset aligns perfectly with your requirements.
11,031 Hours - Bhojpuri - Conversational Audio Dataset
Bhojpuri
Conversational Audio
Off-The-Shelf
Our Conversational Data in Bhojpuri offers comprehensive and authentic dialogues of Indians conversing in Bhojpuri. This dataset features conversations that span a wide range of topics, including daily life, business, education, and more. It includes diverse speakers from different regions of India, capturing various accents and dialects to provide a rich linguistic resource. <br><br> The data is collected from natural, spontaneous conversations to ensure authenticity, and each conversation is accurately transcribed with annotations for contextual understanding. Additionally, we offer the flexibility to tailor the topics, conversations, and scenarios according to the specific needs of your company, ensuring that the dataset aligns perfectly with your requirements.
10,616 Hours - Bengali - Conversational Audio Dataset
Bengali
Conversational Audio
Off-The-Shelf
Our Conversational Data in Bengali offers comprehensive and authentic dialogues of Indians conversing in Bengali. This dataset features conversations that span a wide range of topics, including daily life, business, education, and more. It includes diverse speakers from different regions of India, capturing various accents and dialects to provide a rich linguistic resource. <br><br> The data is collected from natural, spontaneous conversations to ensure authenticity, and each conversation is accurately transcribed with annotations for contextual understanding. Additionally, we offer the flexibility to tailor the topics, conversations, and scenarios according to the specific needs of your company, ensuring that the dataset aligns perfectly with your requirements.
11,001 Hours - Malayalam - Conversational Audio Dataset
Malayalam
Conversational Audio
Off-The-Shelf
Our Conversational Data in Malayalam offers comprehensive and authentic dialogues of Indians conversing in Malayalam. This dataset features conversations that span a wide range of topics, including daily life, business, education, and more. It includes diverse speakers from different regions of India, capturing various accents and dialects to provide a rich linguistic resource. <br><br> The data is collected from natural, spontaneous conversations to ensure authenticity, and each conversation is accurately transcribed with annotations for contextual understanding. Additionally, we offer the flexibility to tailor the topics, conversations, and scenarios according to the specific needs of your company, ensuring that the dataset aligns perfectly with your requirements.