IndiaVLUE v0.9 PREVIEW (beta)

India Vision Language Understanding Evaluation

Introducing IndiaVLUE – A Benchmark for Cultural Understanding

IndiaVLUE is a comprehensive multimodal benchmark focused on Indian Cultural Context . It goes beyond generic vision-language tasks to test if AI models truly “See India”– understanding India’s people, traditions, symbols and multilingual nuances in images. Recent studies have shown that even state-of-the-art vision-language models struggle with culture-specific content – excelling on everyday images but stumbling on region-specific scenarios.

The explosion of vision-language models (VLMs) raises two Indian-specific evaluation gaps:

Benchmark contamination. Public Indian images (e.g., Taj Mahal) may have leaked into training, inflating scores.
Shallow accuracy metrics. Pure “right / wrong” on generic datasets hides whether the model actually knows Indian context (e.g., why a bride wears red).

IndiaVLUE addresses this gap by providing culturally-rich visual tasks (drawn from real Indian settings). The goal is to spur development of AI that is culturally aware and inclusive, much as benchmarks like GLUE did for language understanding. In summary, IndiaVLUE represents a significant step toward evaluating and improving AI models in diverse, real- world settings . It challenges models with everything from basic perception to complex “why” reasoning, revealing that while today’s top models are powerful, they are not infallible – they come close to human-level on some tasks yet falter on others requiring deep cultural insight or precise grounding.

Dataset Collection

The IndiaVLUE dataset was created through a careful process emphasizing quality and cultural authenticity. A team of human annotators from across India – fluent in local languages and culture – curated the visual prompts and answers to ensure they are accurate and culturally appropriate. We followed strict ethical guidelines: avoiding derogatory stereotypes, respecting privacy, and removing any potentially sensitive or harmful content . To include diverse regional contexts often missing from internet data, a participatory approach was used: community contributors helped create realistic questions and answers reflecting both well-known and lesser-known local traditions. The result is a one-of-a- kind dataset spanning all Indian regions and major languages, assembled with respect for the people and cultures portrayed. In total, IndiaVLUE comprises ~183 visual tasks each accompanied by queries in English and detailed reference answers. By benchmarking on IndiaVLUE, researchers can be confident in the data’s integrity and its alignment with real-world Indian scenarios.

Region Wise Split	Images
Central	21
East	22
North	35
North-East	28
South	24
West	23
Pan-India	30
Total	183

Arts, Dance & Clothing

Family Traditions & Daily Life

Festivals & Rituals

Occupations & Local Scenes

Symbolism & Objects

Tasks Overview

IndiaVLUE consists of six complementary tasks, each targeting a different aspect of visual-language understanding in an Indian context. By evaluating models across these diverse tasks, we obtain a multi- faceted view of their abilities – from basic recognition and description to higher-level reasoning and prediction. The tasks were designed to cover both vision-centric skills (object recognition, spatial reasoning) and knowledge-centric reasoning (commonsense and cultural logic), simulating real-world AI use cases in India. All tasks are single-turn with a single image per query. The evaluation is primarily zero-shot: models are tested on these tasks without further fine-tuning, to gauge out-of-the-box performance on culturally novel inputs.

A brief overview of each task and its evaluation metric:

Visual Question Answering (VQA) – Open-ended questions about an image’s content. Models must identify or describe visual elements or simple inferences (e.g. “What is the man holding?”). Metric: Accuracy (% of questions answered correctly).
Image Captioning – Generating a descriptive caption for an image. Often there are multiple salient objects or actions; a good caption should mention the important elements and their relations. Metric: CIDEr score (measures similarity to reference captions; higher is better)
Referring Expression Grounding – Identifying an object or person in the image given a referring phrase. The model might be asked to, for example, “Identify the woman in a red sari” or “Where is the stop sign?”. Metric: Grounding Accuracy (% of queries where the model correctly identifies or locates the referred entity)
Cultural Knowledge QA – Answering questions that require cultural or commonsense knowledge beyond what is directly visible. These are often factual queries about the scene’s cultural elements. For example, “What is the traditional lamp called?” or “What is the mark on her forehead called?” Metric: Accuracy (% correct). (This tests a model’s knowledge of Indian cultural terms and practices.)
Causal/“Why” Reasoning – Answering why or how questions about the image, requiring the model to explain motivations or context. These prompts probe deeper commonsense reasoning – often culturally inflected (e.g. “Why are these people throwing colored powder on each other?”). Metric: Accuracy (% of explanations that correctly match the reference reasoning). (Each “why” question has an expected answer)
Next-Step Prediction – Inferring likely future actions or events from an image. For example, “What will the person do next?” if an image shows someone blowing out candles. Models must use context and common sense to continue the scene. Metric: Accuracy (% of predictions matching the ground-truth outcome). Lower scores indicate difficulty with temporal or causal inference.

Each task has its own input-output format and evaluation criteria, as detailed above. Importantly, the Composite Achievement Score (CAS) is defined as an aggregate of all task performances – providing an overall single measure (average of task scores) for ranking models. Higher CAS means better all- around performance.

Evaluation Methodology

Models are evaluated using a rigorous dual-assessment approach, combining automated metrics with human(LLM)-driven judgment for robust scoring Each model’s output is assessed in two ways:

Free-form Answer Accuracy: For tasks with a clear correct answer (VQA, cultural questions, etc.), we use a jury of two Large Language Model (LLM) judges to decide if the model’s response is correct or incorrect . Each judge independently reads the model’s answer and the ground- truth “ideal” answer, then votes Yes/No on whether the model’s answer reaches the same conclusion . The majority vote determines correctness for that query. This mimics a human evaluation while reducing bias (no single judge’s quirks dominate) Metric: Accuracy (0-1), representing the fraction of queries answered correctly by majority vote. Higher accuracy means more answers exactly matched the expected result, and 0.0 would mean all answers wrong. This free-form evaluation is a strict measure – partial credit is not given; an answer must be entirely correct to count.
Rubric-based Partial Scoring: To get finer insight into answer quality, especially for complex prompts, IndiaVLUE employs rubric-based evaluation For each task, a set of yes/no sub- questions (a rubric) is defined to check key aspects of an ideal answer. For example, a “why” question’s rubric might include: (1) Did the answer correctly identify the festival? (2) Did it mention the reason for the celebration? Each model response is run through the LLM jury on each rubric question, yielding a score for each condition (Yes=1/No=0) We then compute the model’s Task Accuracy = proportion of rubric checks satisfied for that prompt, and average these across all prompts to get an Average Task Accuracy (ATA). Metric: Average Task Accuracy, in [0,1], where higher means the model’s answers meet more of the correctness criteria on average. This provides credit for being “partially correct” – e.g. a model might identify what’s happening in an image but miss the cultural term, getting 0.5 ATA for fulfilling one of two criteria.

In practice, we observe that advanced models sometimes achieve similar binary accuracy, but rubric scores differentiate how completely they satisfy all aspects of the question. For instance, a model might get the right final answer (counting as correct) but omit part of the reasoning – the rubric would flag that difference. We report both metrics in the evaluation report. (Note: Image captioning is evaluated purely by automatic metrics like CIDEr, and referring expressions by ground-truth matching, but these too can be complemented by rubric checks for attributes like detail or localization correctness.)

All evaluations are repeated with three different LLM judges (OpenAI GPT-4o’s text judge, Anthropic Claude-Vision, and Google Gemini judges) for robustness. The results are averaged, and we report mean scores (with negligible variance observed among judges). This jury system and repetition ensure no single judging bias or error unduly affects a model’s score. In addition, a small human evaluation was conducted to establish an upper bound: human volunteers answered a subset of tasks under similar conditions (with time limits and only image inputs). Their performance provides a reference point – currently, humans still outperform even the best model on IndiaVLUE, especially on cultural reasoning and fine-grained perception .

IndiaVLUE Leaderboard

The table below presents leading model results on each task and the overall Composite Achievement Score (CAS), which reflects a model’s accuracy across Visual QA, cultural reasoning, spatial grounding, and captioning tasks. Each output was graded on a 1–10 scale, where 10 means “completely correct” and 1 means “totally wrong.” After scoring, we simply stretched that 1–10 range to 1–100 for easy comparison—higher numbers are always better. VQA, Cultural-Commonsense, and Why-QA answers earn partial credit through word-overlap with the reference, while caption scores come from our cultural-awareness + grounding rubric.

Highlights:

As of now, Gemini 2.5 tops the leaderboard with a CAS of 60.175 ± 3.86, demonstrating strong overall performance across tasks. It shows particular strength in spatial reasoning and grounding, likely due to its advanced vision module and training that enable precise alignment between text and visual inputs.

GPT-4o follows closely with a CAS of 58.050 ± 1.11, maintaining consistently high scores in visual QA and cultural understanding. It sets a solid baseline for closed models and showcases reliable multimodal capabilities.

In the upper tier, Gemini 1.5 Flash and GPT o3 (both scoring 56.100) also deliver robust performance, with qwen2.5vl_72b (55.850) and Pixtral-12b-2409 (50.175) rounding out the top six.

Among open-source models, llama4_16x17b (49.000 ± 0.35) and Mistral Medium 3 (48.325 ± 7.82) perform competitively, although they still trail behind their proprietary counterparts in composite scores. Claude Opus 4, despite its reputation, ranks lower with a score of 45.025 ± 4.32, suggesting weaker results on some culturally-grounded or visual reasoning tasks.

At the lower end, llava_7b (25.175 ± 1.81) struggles significantly, indicating limitations in either visual understanding or grounding capabilities.

This leaderboard reflects a continuing trend: large, closed models like Gemini and GPT-4o lead in accuracy and consistency. However, some open models—like qwen2.5vl and Mistral—are making strides, showing promise in specialized tasks such as spatial grounding or cultural alignment.

This emphasizes the value of composite evaluation over single-task metrics, as performance often varies sharply between captioning, QA, and grounded reasoning.

Model	VQA Score	Cultural Commonsense Score	Captioning Score	Why-QA Score	AVG
Gemini 2.5	62.1	65	54.6	59	60.175
GPT-4o	57.8	57.9	56.7	59.8	58.05
GPT-o3	58.6	52	54.6	59.2	56.1
Gemini 1.5	55.7	53.2	55.9	59.6	56.1
qwen2.5vl_72b	55.7	57.5	55.4	54.8	55.85
Pixtral-12b-2409	51.7	50.4	52.1	46.5	50.175
llama4_16x17b	48.8	48.9	48.7	49.6	49
Mistral Medium ∞	48.8	42.3	60.9	41	48.325
Claude Opus 4	46.5	38.4	54.3	50.1	45.075
gemma3_27b	39	56.7	43.2	45	45.975
llava_7b	25.2	27.6	22.5	25.4	25.175

Human vs. Model Performance

How do the best models compare to human experts on IndiaVLUE? In short: humans still hold a comfortable lead, especially on complex reasoning. In a controlled evaluation, human participants (fluent in the content) were able to solve ~ 75 % of the tasks correctly on average – significantly higher than Gemini 2.5’s performance of ~ 60 %. For cultural “why” questions, virtually all humans answered correctly with detailed reasons, whereas even GPT-4o hovered around ~ 60 % accuracy. On referring expressions and fine-grained localization, humans achieved ~ 83 %+, far above models that max out around ~ 62 %. This gap highlights areas where current AI falls short of common-sense human perception. When averaged across all tasks, the human baseline CAS is roughly 74.9, versus Gemini 2.5’s 60.2.

The figure below (left) illustrates this gap: on each task, the average model’s accuracy is below human accuracy. The shortfall is most pronounced in cultural reasoning – e.g. humans almost always know why something is happening in an image given context, while models often guess incorrectly or give generic answers. It’s narrowest on straightforward VQA and captions, where years of training on web data have pushed models closer to human performance. These results underline that there is still “significant room for improvement” to reach parity with human visual reasoning. Bridging this gap will require advances in grounding, commonsense knowledge, and perhaps incorporating human- like cultural learning into models.

(Figure : Human vs. Model Performance – on average, humans outperform the current best model by ≈ 25 points in accuracy)

Example Tasks from IndiaVLUE

Data Sample

1 / 5

Category: Festivals & Rituals

Location: Uttar Pradesh, Central India

Prompt: “I need help figuring out what event is happening in this photo. Can you tell me what the occasion is?”

Correctness Rubric

Does the response state that the event is Janmashtami?
- Expected response: Yes
Does the response use visual and cultural cues (e.g., traditional dress, religious elements) to support the answer?
- Expected response: Yes

Each example above demonstrates a distinct challenge: from identifying an action, generating a description, pinpointing an object from a phrase, naming a cultural artifact, explaining a situation, to predicting an outcome. IndiaVLUE is designed to ensure models are truly multi-skilled – handling all of these in a culturally aware manner.

Model Analysis and Outlook

Our analysis of current results reveals clear strengths and weaknesses of today’s vision-language models. On the positive side, the top models exhibit remarkable capabilities: GPT-4o and Gemini can understand complex scenes. They handle many visual questions at near-human accuracy and produce detailed captions that often rival human descriptions. It’s encouraging also to see open-source models like InternVL performing so well – this democratizes access to high-performing multimodal AI, which is important for researchers and applications in the community.

That said, no model has mastered IndiaVLUE. Each has blind spots: GPT-4o, for example, sometimes struggles with fine-grained localization when multiple similar objects are present.

Claude-Vision is very consistent (rarely giving nonsense), but it can be too cautious and occasionally misses visual details, leading to errors of omission. All models show noticeable drops on deep cultural reasoning –. explaining the significance of a regional ritual or answering a question that requires niche cultural knowledge. They often fall back on generic statements or get the reasoning partially correct at best.

Vision-Language Models (VLMs), for all their computational sophistication, demonstrate a striking deficiency when it comes to understanding Indian cultural contexts. Their poorest performance was observed in categories such as festivals, rituals, symbolism, and traditional objects — core pillars of India's rich visual heritage.

Consider a quintessential Indian wedding moment: a bride, eyes gently closed, receiving chooda bangles from an elder in a deeply intimate, ceremonial gesture symbolizing blessings and familial reverence. Rather than recognizing the cultural specificity of this scene, the model defaulted to a generic interpretation — “a woman praying” — suggesting it merely latched onto the visual cue of closed eyes and invoked the most statistically common association: prayer or meditation. The nuance, emotion, and ritual context of the scene were entirely lost.

Similarly, when tested on imagery involving symbolic and mythological elements — from iconic deities to ritual artifacts — the model consistently failed to identify even widely recognized figures, often producing vague or inaccurate descriptions. This isn't a case of struggling with subtle cultural nuance; these are widely prevalent, almost archetypal representations that the models could not decode accurately.

Geographically, the performance disparity is equally telling. The model exhibits relatively higher accuracy with North Indian visual content — perhaps a byproduct of overrepresentation in its training data — but stumbles significantly when interpreting visuals from Western India. This indicates not just a lack of cultural generalization, but an uneven familiarity even within Indian regions.

The gap between the top models and humans, particularly in cultural context comprehension, reminds us that there is ample room – and need – for further research.

In essence, these failures underscore a fundamental limitation: VLMs, in their current form, are neither culturally aware nor contextually sensitive when it comes to decoding one of the world’s most visually and symbolically rich societies. They miss not just the complexity — but often, even the clarity.

Going forward, IndiaVLUE will help track progress in this area. We expect that improvements will come from several directions: (1) incorporating more culturally diverse training data (so models see examples of Indian festivals, attire, symbols, etc.), (2) using structured knowledge (e.g. knowledge graphs of cultural facts) to augment models’ understanding, and (3) engaging community annotators or feedback to fine-tune models on what they currently get wrong . The benchmark is also a tool for model developers – by analyzing failures on IndiaVLUE, one can identify if a model lacks certain perceptual skills or knowledge, and then target those areas.

In conclusion, IndiaVLUE’s impact is in pushing the vision-language field toward greater cultural and multilingual intelligence. By including content and scenarios overlooked by Western-centric datasets, it encourages AI to expand its understanding and be truly globally aware . We anticipate that leaderboard competition on IndiaVLUE – from both industry and open-source contributors – will spur the development of models with better cultural competence. The advancements here will not only improve AI’s usefulness in India but also set a precedent for building more inclusive AI for all cultures and regions.

IndiaVLUE v0.9 PREVIEW (beta)

Gemini 2.5 (March 2025)

60.175±3.86

GPT-4o (April 2025)

58.050±1.11

GPT-o3 (April 2025)

56.100±2.95

Gemini 1.5 (April 2025)

56.100±2.28

qwen2.5vl_72b (April 2025)

55.850±1.01

Pixtral-12b-2409 (April 2025)

50.175±2.21

llama4_16x17b (April 2025)

49.000±0.35

Mistral Medium 3 (April 2025)

48.325±7.82

Claude Opus 4 (April 2025)

45.025±4.32

gemma3_27b (April 2025)

45.975±6.56

llava_7b (April 2025)

25.175±1.81