Indic Vision and Language
Datasets

Improve model performance by using our off the shelf datasets

Language

All

14

Hindi

1

English

1

Bengali

1

Telugu

1

Haryanvi

1

Bodo

1

Bhojpuri

1

Malayalam

1

Punjabi

1

Maithili

1

Data Type

All

14

Conversational

10

Vision

4

Sample Rate

All

14

48 kHz

14

STEM Visual Reasoning Dataset

System visual reasoning

Off-The-Shelf

Our STEM Reasoning Dataset is a curated collection of image-based and diagram-driven tasks spanning Physics, Chemistry, and Mathematics. This dataset is designed to train and evaluate a model’s ability to perform logical deduction, symbolic matching, and visual-spatial inference across scientific disciplines.

Indian Cultural Practices in Motion (Videos) Dataset

Indian cultural video

Off-The-Shelf

Our Indian Rituals Video Dataset offers a rich collection of videos capturing live rituals and culturally significant actions from across India. This dataset is built to benchmark multimodal understanding and enable models to ground temporal reasoning within diverse cultural contexts.

IIndian Cultural Scenes and Ritual Objects Image Dataset

Indian cultural Image

Off-The-Shelf

Our Cultural Imagery Dataset offers a diverse and visually rich collection of still photographs capturing the essence of India’s cultural identity. This dataset features images representing festivals, traditional attire, rituals, religious symbols, and everyday cultural objects from across the country. It is curated to support applications in cultural grounding, fine-grained image classification, and multilingual captioning.

Indian Clinical Imaging and Health Visuals

Clinical Image

Off-The-Shelf

Our Medical Health Imagery Dataset is a curated collection of anonymized medical images that reflect real-world clinical contexts from across India. This dataset is designed to support the development and benchmarking of AI-powered diagnosis-support systems and vision-language models in the healthcare domain.

10,004 Hours - Maithili - Conversational Audio Dataset

Maithili

Conversational Audio

Off-The-Shelf

Our Conversational Data in Maithili offers comprehensive and authentic dialogues of Indians conversing in Maithili. This dataset features conversations that span a wide range of topics, including daily life, business, education, and more. It includes diverse speakers from different regions of India, capturing various accents and dialects to provide a rich linguistic resource. <br><br> The data is collected from natural, spontaneous conversations to ensure authenticity, and each conversation is accurately transcribed with annotations for contextual understanding. Additionally, we offer the flexibility to tailor the topics, conversations, and scenarios according to the specific needs of your company, ensuring that the dataset aligns perfectly with your requirements.

11,472 Hours - Haryanvi - Conversational Audio Dataset

Haryanvi

Conversational Audio

Off-The-Shelf

Our Conversational Data in Haryanvi offers comprehensive and authentic dialogues of Indians conversing in Haryanvi. This dataset features conversations that span a wide range of topics, including daily life, business, education, and more. It includes diverse speakers from different regions of India, capturing various accents and dialects to provide a rich linguistic resource. <br><br> The data is collected from natural, spontaneous conversations to ensure authenticity, and each conversation is accurately transcribed with annotations for contextual understanding. Additionally, we offer the flexibility to tailor the topics, conversations, and scenarios according to the specific needs of your company, ensuring that the dataset aligns perfectly with your requirements.

10,012 Hours - Bodo - Conversational Audio Dataset

Bodo

Conversational Audio

Off-The-Shelf

Our Conversational Data in Bodo offers comprehensive and authentic dialogues of Indians conversing in Bodo. This dataset features conversations that span a wide range of topics, including daily life, business, education, and more. It includes diverse speakers from different regions of India, capturing various accents and dialects to provide a rich linguistic resource. <br><br> The data is collected from natural, spontaneous conversations to ensure authenticity, and each conversation is accurately transcribed with annotations for contextual understanding. Additionally, we offer the flexibility to tailor the topics, conversations, and scenarios according to the specific needs of your company, ensuring that the dataset aligns perfectly with your requirements.

11,031 Hours - Bhojpuri - Conversational Audio Dataset

Bhojpuri

Conversational Audio

Off-The-Shelf

Our Conversational Data in Bhojpuri offers comprehensive and authentic dialogues of Indians conversing in Bhojpuri. This dataset features conversations that span a wide range of topics, including daily life, business, education, and more. It includes diverse speakers from different regions of India, capturing various accents and dialects to provide a rich linguistic resource. <br><br> The data is collected from natural, spontaneous conversations to ensure authenticity, and each conversation is accurately transcribed with annotations for contextual understanding. Additionally, we offer the flexibility to tailor the topics, conversations, and scenarios according to the specific needs of your company, ensuring that the dataset aligns perfectly with your requirements.

10,616 Hours - Bengali - Conversational Audio Dataset

Bengali

Conversational Audio

Off-The-Shelf

Our Conversational Data in Bengali offers comprehensive and authentic dialogues of Indians conversing in Bengali. This dataset features conversations that span a wide range of topics, including daily life, business, education, and more. It includes diverse speakers from different regions of India, capturing various accents and dialects to provide a rich linguistic resource. <br><br> The data is collected from natural, spontaneous conversations to ensure authenticity, and each conversation is accurately transcribed with annotations for contextual understanding. Additionally, we offer the flexibility to tailor the topics, conversations, and scenarios according to the specific needs of your company, ensuring that the dataset aligns perfectly with your requirements.

11,001 Hours - Malayalam - Conversational Audio Dataset

Malayalam

Conversational Audio

Off-The-Shelf

Our Conversational Data in Malayalam offers comprehensive and authentic dialogues of Indians conversing in Malayalam. This dataset features conversations that span a wide range of topics, including daily life, business, education, and more. It includes diverse speakers from different regions of India, capturing various accents and dialects to provide a rich linguistic resource. <br><br> The data is collected from natural, spontaneous conversations to ensure authenticity, and each conversation is accurately transcribed with annotations for contextual understanding. Additionally, we offer the flexibility to tailor the topics, conversations, and scenarios according to the specific needs of your company, ensuring that the dataset aligns perfectly with your requirements.

1

2