What is the Visual Cognition Gap between Humans and Multimodal LLMs? (2024)

Xu Cao¹ Bolin Lai² Wenqian Ye³ Yunsheng Ma⁴ George Heintz¹
Jintai Chen¹ Jianguo Cao⁵ James M. Rehg¹
¹Health Care Engineering Systems Center, University of Illinois Urbana-Champaign
²College of Computing, Georgia Institute of Technology
³Department of Computer Science, University of Virginia
⁴College of Engineering, Purdue University
⁵Department of Rehabilitation Medicine, Shenzhen Children’s Hospital
{xucao2,jrehg}@illinois.edu

Abstract

Recently, Multimodal Large Language Models (MLLMs) have shown great promise in language-guided perceptual tasks such as recognition, segmentation, and object detection. However, their effectiveness in addressing visual cognition problems that require high-level reasoning is not well-established. One such challenge is abstract visual reasoning (AVR) – the cognitive ability to discern relationships among patterns in a set of images and extrapolate to predict subsequent patterns. This skill is crucial during the early neurodevelopmental stages of children. Inspired by the AVR tasks in Raven’s Progressive Matrices (RPM) and Wechsler Intelligence Scale for Children (WISC), we propose a new dataset MaRs-VQA and a new benchmark VCog-Bench containing three datasets to evaluate the zero-shot AVR capability of MLLMs and compare their performance with existing human intelligent investigation. Our comparative experiments with different open-source and closed-source MLLMs on the VCog-Bench revealed a gap between MLLMs and human intelligence, highlighting the visual cognitive limitations of current MLLMs. We believe that the public release of VCog-Bench, consisting of MaRs-VQA, and the inference pipeline will drive progress toward the next generation of MLLMs with human-like visual cognition abilities. The code and datasets for our benchmark are at GitHub.com/IrohXu/VCog-Bench

AI is the science and engineering of making machines do tasks they have never seen and have not been prepared for beforehand. by John McCarthy[1, 2]

1 Introduction

Abstract visual reasoning (AVR) is a crucial ability in human perception and cognition, essential for nonverbal, culture-reduced intelligence measurements as it can minimize the influence of acquired knowledge and skills[3]. Common AVR problems consist of images with simple shapes governed by underlying abstract rules[4] (see Figure1). Participants have to identify and comprehend the rules based on a few provided patterns, and then reason about the next pattern following the same rules. AVR ability is an important reflection of many fundamental capabilities of human intelligence, such as processing speed and working memory, that emerge in the early stage of children’s neurodevelopment[5]. To quantitatively measure human’s AVR abilities, many assessment methods have been proposed as a part of fluid intelligence tests. The two most famous assessments are Wechsler Intelligence Scale for Children (WISC)[6] and Raven’s Progressive Matrices (RPM)[7].

With the development of artificial intelligence algorithms, AVR tasks have emerged as an ideal testbed for investigating whether deep learning models can match or even surpass human cognitive abilities, motivating the creation of diverse problem settings and datasets[8, 4, 9, 10, 11]. However, previous research on AVR assessments applied typical machine learning settings – finetuning models on training sets and evaluating the performance on test sets[12, 13, 14]. This setting makes current AVR assessment an ill-posed problem because such tests accurately reflect reasoning capability only when subjects engage without prior training, i.e., in zero-shot inference settings. Thus, establishing an AVR benchmark tailored for deep learning models remains an unsolved problem. Recently, Multimodal Large Language Models (MLLMs) have shown surprising understanding and reasoning capabilities, marking an important milestone towards Artificial General Intelligence (AGI)[8, 15]. However, current MLLMs remain inadequate in visual problems that require higher-level inductive reasoning. An example is their poor performance on the RAVEN IQ-test[16, 17], which heavily relies on AVR skills. The RAVEN IQ-test itself also has some limitations, including a small dataset of only 50 samples[16], which may introduce randomness and fail to comprehensively and robustly evaluate MLLMs. Besides, it doesn’t include a comparative study with human performance.

To address the ill-posed AVR assessment and the deficiencies of existing cognitive testing benchmarks, we introduce VCog-Bench, a new abstract visual reasoning benchmark. The benchmark aggregates diverse visual question-answering (VQA) data from two existing AVR datasets (RAVEN[10] and CVR[20]). We additionally collect a new dataset – MaRs-VQA including 1,440 examples, which is the largest psychologists-designed benchmark dataset for AVR assessment. We also conduct thorough evaluation and comparison across 16 existing MLLMs (including their variants) and human performance under a zero-shot inference setting (no prior knowledge) on the three datasets. In our experiments, we observe that MLLMs with a higher number of parameters generally performed better on our benchmark, adhering to established scaling laws. However, even the largest open-source MLLMs and GPT-4o fall short of surpassing human performance in AVR tasks. Furthermore, many MLLMs have a mismatch in performance between AVR tasks and other general VQA problems, which provides some insights into the drawbacks of existing models. We will release our data for future studies. In conclusion, our contributions are summarized as follows:

•
We introduce a new AVR benchmark dataset – MaRs-VQA, containing 1,440 image instances designed by psychologists, which is the largest dataset for AVR evaluation.
•
We propose VCog-Bench, the most comprehensive visual cognition benchmark to date, which evaluates the AVR performance of 15 existing MLLMs rigorously following the zero-shot setting.
•
Our thorough experiments qualitatively reveal the gap between MLLMs and humans in AVR problems. We also show additional insights of deficiencies in MLLMs, which can inspire more future investigations.

2 Related Works

Large Language Models (LLMs) for Visual Cognition

The rise of LLMs has aroused interest in exploring human-like AI in psychology and cognition[21]. Recent works tested LLMs’ cognitive abilities in causal reasoning[22], abstract reasoning[23], analogical reasoning[24], and systematic reasoning[25], theory of mind[26]. Their observation showed that LLMs like GPT-4[27] have been proven successful in most cognitive tests related to language-based reasoning. Despite this success, only limited research has been conducted on the areas of MLLMs and visual cognition. Visual cognition involves the process by which the human visual system interprets and makes inferences about a visual scene using partial information. Buschoff et al. observed that while LLMs demonstrate a basic understanding of physical laws and causal relationships, they lack deeper insights into intuitive human preferences and reasoning. Almost all existing visual cognition benchmarks focus on testing MLLMs’ cognitive abilities in simple tasks[28, 29, 30], and ignore testing complex abstract reasoning and logical reasoning ability related to fluid intelligence. Therefore, new and challenging benchmarks based on the theory of visual cognition are needed to assess and improve AI systems’ capabilities for human-like visual understanding.

Abstract Visual Reasoning

AVR is often used to determine human intelligence related to visual cognition and working memory[31, 32, 33]. Matrix reasoning and compositional visual relation reasoning are two of the most representative AVR problems that are widely used by RPM[7, 34], WISC[6, 35] to evaluate human’s ability to detect the underlying conceptual relationship among visual objects and use reasoning to find visual cues. Early research indicated that deep learning models can be trained with large-scale AVR datasets to solve simple matrix reasoning[36, 13, 4, 37, 38] and compositional visual relation tasks[33, 20, 39, 40], achieving human-level accuracy. Several datasets and benchmarks are also proposed, such as PGM[9], RAVEN[10], RAVEN-I[12], RAVEN-FAIR[41], CVR[20]. However, these works have a key limitation. They ignore that humans can solve these problems by zero-shot reasoning without explicitly learning from large-scale data. After the blooming of LLMs, researchers are keen on testing whether LLMs reached the same abstract reasoning capabilities as humans. Webb et al.[24] encode matrix reasoning into a symbolic problem based on human’s prior and validate LLM can understand this task. Recently, there are also some useful zero-shot visual reasoning inference datasets containing AVR samples have been proposed in the AI/ML community, such as RAVEN-IQ[16] containing 50 instances, Visual Reasoning Benchmark[42] containing 241 instances in total, and ConceptARC[43] containing 480 instances but all of them are limited by lacking rigorous human experiments as reference and conducting experiments on relatively small datasets without psychometrical validation.

Vision-Language Models

Researchers have been actively investigating the utility of Vision-Language Models (VLMs) for addressing vision reasoning tasks[44, 45]. These latest VLMs are constructed using a combination of the CLIP vision encoder, pretrained LLMs, and a connected adapter to align visual features with language space[46, 47, 48, 17]. Notably, methodologies such as MiniGPT-4[49], InstructBLIP[50], LLaVA[51], CogVLM[52] underscore the significance of employing high-quality visual instruction tuning data. Additionally, tool learning methods have also explored the potential of integrating code generation pipelines with visual inference[53]. Nevertheless, current VLMs encounter challenges in adapting to high-resolution and visually complex images. These problems stem from the absence of a robust visual search mechanism[54], few-shot reasoning[55], compositional understanding[56] and the constrained visual grounding capabilities inherent in CLIP[57].

3 Visual Cognition Benchmark (VCog-Bench)

What is the Visual Cognition Gap between Humans and Multimodal LLMs? (6)

3.1 Problem Settings

The first step to define the zero-shot AVR problem is to build a structure to represent relationships between input images and abstract concepts. Inspired by [58, 9, 59, 10], we formulate the structure $K$ of AVR as a combination of four components, $K=\{[r,a,o,s]|r\in\mathcal{R},a\in\mathcal{A},o\in\mathcal{O},s\in\mathcal{S}\}$ . $\mathcal{R}$ is a set of rules of how the pattern changes along each row and column (e.g., rotating by a fixed angle and shifting by a fixed distance); $\mathcal{A}$ is a set of attributes in each pattern (e.g., color, shape, and size); $\mathcal{O}$ is how to integrate objects in each cell (e.g., spatial location and overlap); $\mathcal{S}$ denotes a set of constraints for designing answer options (e.g., options should have minimum difference), which avoids that participants solving the AVR problems in unintended ways. Based on structure $K$ , we could design system prompts to guide MLLM to understand AVR.

For zero-shot inference, the test set contains $n$ VQA samples, denoted as $\{(\mathbf{q}_{i},\mathbf{x}_{i},\mathbf{y}_{i})\}^{n}_{i=1}$ . $\mathbf{q}_{i}$ represents the question image showing the $3\times 3$ matrix reasoning task (MaRs-VQA, RAVEN) or context-based question description (CVR). $\mathbf{x}_{i}=[x^{1}_{i},...,x^{k}_{i}]$ represents the images in the option set, where $k$ is the number of options. $\mathbf{y}_{i}$ is the answer of the matrix reasoning question. The multimodal zero-shot inference pipeline can be formulated as:

\displaystyle\hat{\mathbf{y}_{i}}=F_{\theta}(\mathbf{q}_{i},\mathbf{x}_{i},%\mathbf{x}_{sys}).

(1)

$\mathbf{x}_{sys}$ is the system prompt, including independent information about the AVR problem setting, structure $K$ for each dataset and requirements for the output format. $\hat{\mathbf{y}_{i}}$ is the prediction result . $F_{\theta}$ is an autoregressive decoder in the MLLM for answer generation. It is defined as:

\displaystyle P(\hat{\mathbf{y}}_{i}|\mathbf{q}_{i},\mathbf{x}_{i},\mathbf{x}_%{sys})=

\displaystyle\prod_{j=1}^{L}P(\hat{\mathbf{y}}_{i,j}|\mathbf{q}_{i},\mathbf{x}%_{i},\mathbf{x}_{sys},\hat{\mathbf{y}}_{i,<j};\theta),

(2)

where $L$ is the sequence length of answers and $\hat{\mathbf{y}}_{i,<j}$ is all answer tokens before $\hat{\mathbf{y}}_{i,j}$ .

What is the Visual Cognition Gap between Humans and Multimodal LLMs? (7)

What is the Visual Cognition Gap between Humans and Multimodal LLMs? (8)

Dataset	Question	Option	Instance	Description
RAVEN[10]			rule-based generation	8 options per instancegrayscale imagerule-based stimuliinclude human study
CVR[20]	Find the outlieramong 4 images		rule-based generation	4 options per instanceRGB imagerule-based stimuliinclude human study
MaRs-VQA			1,440	4 options per instanceRGB imagepsychologist designed stimuliinclude human study

3.2 Benchmark Datasets

Existing AVR benchmark datasets are various in their problem settings. However, all of them are limited in the dataset size. To make a comprehensive evaluation for MLLMs, our benchmark utilizes two well-known existing AVR benchmark datasets (RAVEN and CVR). We additionally collect a new dataest – MaRs-VQA, which is the largest psychologist designed AVR benchmark dataset. We summarize the three datasets in Table1.

RAVEN[10]

The RAVEN dataset is designed to probe abstract reasoning in a similar format to RAVEN IQ test. Each sample in the RAVEN consists of a question image with a $3\times 3$ matrix or grid. 8 of the 9 resulting cells contained an abstract shape, while the bottom-right cell of the matrix was empty. The tasks in RAVEN contain matrices belonging to multiple visual configurations generated by Attributed Stochastic Image Grammar (A-SIG). All samples in RAVEN can be defined by 4 row-based relations: $\mathcal{R}=\{\text{constant},\text{progression},\text{arithmetic},\text{%distribute three}\}$ , 5 cell attributes $\mathcal{A}=\{\text{number},\text{position},\text{shape},\text{size},\text{%color}\}$ , and 2 relations of objects integration in the cell $\mathcal{O}=\{\text{sub-blocks},\text{insideness}\}$ . The answers of RAVEN do not have any constraint, i.e., $\mathcal{S}=\emptyset$ . In our experiments, we use 560 cases in RAVEN to test the zero-shot learning performance of different MLLMs.

CVR[20]

The Compositional Visual Reasoning (CVR) dataset evaluates deep learning models using 103 unique configurations generated by rules. Each sample of the CVR is a single-choice outlier searching problem. Four options are provided in each question. CVR does not contain any question image, i.e., $\mathcal{R}=\emptyset$ . The attributes for each cell in the option set are $\mathcal{A}=\{\text{number},\text{position},\text{shape},\text{size},\text{%color}\}$ and the objects in the cell are integrated by 2 relations $\mathcal{O}=\{\text{adjacent},\text{insideness}\}$ . The answer constraint of CVR is $\mathcal{S}=\{\text{minimal difference}\}$ i.e., making the outlier only contain minimal differences from the others. In our experiments, we use 309 cases in CVR to test the zero-shot learning performance of different MLLMs.

MaRs-VQA

The MaRs-VQA dataset is designed to evaluate the zero-shot abstract reasoning capabilities of MLLMs through various matrix reasoning VQA tasks. All sample images in MaRs-VQA are sourced from the Matrix Reasoning Item Bank (MaRs-IB)[59], which is created by psychologists including 18 sets of abstract reasoning questionnaires (80 instances in each set) for non-verbal abstract reasoning assessment of adolescents and adults. Each item presents an incomplete $3\times 3$ matrix of abstract shapes, requiring participants to identify relationships among the shapes. Compared to RAVEN, the matrix reasoning samples in MaRs-VQA are psychometrically validated and widely used in neurodevelopmental and neuropsychological research[60, 61, 62, 63].

In Figure2, we demonstrate how to transform an AVR problem into a VQA task using a sample from the MaRs-VQA dataset. We define three different option sets. Option Set A and Option Set B are image-based options, with the key difference being that Option Set B uses the complete $3\times 3$ images after incorporating the option image into the question image. Option Set B is used for visualization purposes only and is not included in our experiment. To enhance data quality, we use GPT-4o to generate language-based descriptions for each option, forming Option Set C. In the data generation process, we first manually design 10 VQA examples, which serve as the only human annotations in our data collection. These examples are then used as few-shot samples to query GPT-4o through in-context learning. The context generation system prompt guides GPT-4o to compare all four option images and generate distinct descriptions for each one.

Compared to the other datasets, the task structure in MaRs-VQA are more comprehensive. It is defined by 4 row-based relations: $\mathcal{R}=\{\text{constant},\text{progression},\text{arithmetic},\text{%distribute three}\}$ , 5 cell attributes: $\mathcal{A}=\{\text{number},\text{position},\text{shape},\text{size},\text{%color}\}$ , 2 relations for objects in the cell $\mathcal{O}=\{\text{sub-blocks},\text{insideness}\}$ , and 2 answer constraints $\mathcal{S}=\{\text{minimal difference},\text{paired difference}\}$ . In our experiments, we use 480 cases (from 6 questionnaires) in MaRs-VQA to test the zero-shot learning performance of different MLLMs.

What is the Visual Cognition Gap between Humans and Multimodal LLMs? (22)

What is the Visual Cognition Gap between Humans and Multimodal LLMs? (23)

3.3 Approaches

Different from the original setting in RAVEN, I-RAVEN and CVR, our goal of MLLM agent is to complete the RAVEN matrix by finding the missing cell from multiple options by zero-shot learning. To select the correct missing cell, MLLM agents have to deduce relationships between the other cells of the matrix and infer the missing cells by relationships based on the problem settings.

Chain-of-Thought (CoT) for AVR

Recent research progress in the NLP community has demonstrated the effectiveness of LLMs using CoT reasoning for enhanced problem-solving[65, 66]. Inspired by leveraging improvement of CoT in MLLMs[67, 68, 69], object-centric relational abstraction[70, 71, 72, 23] and object-centric representation learning[73, 74, 75], we propose the object-centric CoT prompting strategy to enhance the MLLM’s zero-shot learning performance in solving AVR problems. Figure3 (a) shows a schematic depiction of how to leverage CoT in AVR tasks. We use three steps to guide MLLM to use human-level thought to understand AVR tasks. The first step is to extract the key patterns from each row in the question image. MLLM will predict row-based high-order rules $\mathcal{R}$ based on this information. The second step is to extract the basic attributes $\mathcal{A}$ and inner relations $\mathcal{O}$ to integrate objects in each option image. The third step is to infer the answer based on exclusion with potential answer designed constraints $\mathcal{S}$ .

Method	Learning	Accuracy (%) $\uparrow$
Method	Learning	MaRs-VQA (4-options)	RAVEN (8-options)	CVR (4-options)
Claude 3 Haiku[19]	zero-shot	23.13	10.27	25.57
Claude 3 Haiku[19]	chain-of-thought	25.57	12.95	26.41
Claude 3 Sonnet[19]	zero-shot	22.92	10.71	27.83
Claude 3 Sonnet[19]	chain-of-thought	23.22	13.39	28.48
Claude 3 Opus[19]	zero-shot	20.85	11.61	26.86
Claude 3 Opus[19]	chain-of-thought	24.13	11.95	27.18
GPT-4V[76]	zero-shot	27.71	13.84	36.25
GPT-4V[76]	chain-of-thought	33.13	15.63	40.62
GPT-4o[18]	zero-shot	30.21	19.20	42.50
GPT-4o[18]	chain-of-thought	33.96	25.89	44.01
Human[59, 10, 20]	-	69.15	84.41	78.70

Vision-Language Models (VLMs) for AVR

In addition to using end-to-end MLLMs to solve AVR problems, another approach involves decomposing AVR tasks by transforming option images into language descriptions and then applying VLMs to analyze the problem[23, 77]. Figure3 (b) illustrates this pipeline. In this method, the input question image is first processed by the visual encoder of VLMs. Then, additional alignment layers are used to map visual features into language feature space. These features, along with the option descriptions extracted by GPT-4o, are sent to the LLM decoder. The LLM decoder then integrates the information from both the input question image and the option descriptions to address the VQA task. This approach leverages the strengths of both visual encoders and language models, allowing for a more comprehensive analysis of the AVR problems. It provides a structured way to break down the problem, potentially improving interpretability compared to end-to-end close-source models.

4 Experiments

4.1 Baselines

Closed-source MLLMs

We selected the Claude 3 family (Haiku, Sonnet, Opus)[19], GPT-4V[76], GPT-4o[18], and Gemini Pro 1.5[78] as the primary closed-source MLLM baselines. The Claude 3 family, GPT-4V, GPT-4o support multiple images input, so they are tested with a more difficult setting in Table2, i.e., the input is a question and multiple option images in Option Set A of Figure2.

Open-source VLMs

For the open-source models, we select state-of-the-arts models such as InstructBLIP[50], MiniGPT-v2[49], LLaVA-v1.6 (LLaVA-NeXT)[64], CogVLMv2[52], Yi-VL[79], Qwen-VL[80], InternVL[81] as the primary VLM baselines. The input of the open-source VLM is a question image and GPT-4o guided language-based options in Option Set C of Figure2.

Human Baseline

The human study results in Table2 and 3 are reported from previous experiment results. The human subjects of RAVEN[10] consists of college students from a subject pool maintained by the Department of Psychology. Only “easily perceptible” examples were used in the investigation. CVR[20] hired 21 participants and each participant completed 6 different tasks with 20 problem samples for each task. The human study results of MaRs-IB[59] (data source of MaRs-VQA) are more rigorous. They are from 4 age groups ( $N=659$ , aged 11–33 years). The accuracy for younger adolescents, mid-adolescents, older adolescents, and adults solving AVR in MaRs-IB are 61%, 68%, 73%, 81%. We use the average result of all groups in the Table2 and 3.

4.2 Implementation

For closed-source baseline models, we establish basic prompts to introduce the AVR problem setting, which serve as the system prompt for zero-shot inference. For object-centric CoT reasoning, we create specific prompts to guide the model’s thought process through multiple stages, enabling step-by-step reasoning. For open-source baseline models, we use the same system prompt settings across all models. Testing is conducted using two NVIDIA RTX 4090 GPUs for 7B-sized VLMs and four NVIDIA A100 80GB GPUs for VLMs larger than 7B. All experiments are run with three different random seeds, and the results are averaged. We evaluate the results based on the accuracy of single-choice AVR problems ( $\text{Acc}=\text{Correct}/\text{Total}$ ), consistent with other VQA benchmarks[82, 83].

MethodTraining DataModel ScaleLLM BackboneAccuracy (%) $\uparrow$ MaRs-VQA (4 Options)RAVEN (8 Options)InstructBLIP[50]129M7BVicuna-7B[84]10.6312.05LLaVA-v1.6[51]1.3M7BMistral-7B[85]16.8814.29MiniGPT-v2[49]-8BLlama-2-7B[86]26.4513.39Qwen-VL[80]1.4B10BQwen-7B[80]29.5816.07InstructBLIP[50]129M13BVicuna-13B[84]10.4214.46CogVLMv2[52]1.5B19BLlama-3-8B[87]26.4612.05InternVL 1.5[81]6.0B26BInternLM2-Chat-20B[88]22.0914.73Yi-VL[79]100M34BYi-34B-Chat[79]25.2119.64LLaVA-v1.6[51]1.3M35BHermes-Yi-34B[79]34.3833.93InternVL 1.2+[81]6.0B40BHermes-Yi-34B[79]32.7133.04Claude 3 Opus[19]unpublishedunpublished-33.7527.68GPT-4o[18]unpublishedunpublished-37.3838.84Gemini Pro 1.5[78]unpublishedunpublished-34.7942.86Human---69.1584.41

4.3 Experimental Results

In this subsection, we present the experimental results of the baselines in the VCog-Bench. The results demonstrate that while parts of baseline models can understand some basic forms of the AVR task, they struggle with complex tasks requiring both visual working memory and multi-image reasoning capability.

We divided our experiments into two parts. The first part involves end-to-end zero-shot inference. For this experiment, we used multiple images as the input, including a question image and several option images (refer to Option Set A in Figure2), and guided the MLLMs to decompose the problem into predefined structures before generating answers based on all available information. We tested the Claude 3 family, GPT-4V, and GPT-4o for this task, as these models support multi-image reasoning. Table2 shows that even the state-of-the-art closed-source MLLMs perform worse than humans in all AVR tasks. While object-centric CoT can help larger models achieve better performance, it does not benefit smaller models such as Claude 3 Haiku and Claude 3 Sonnet. Compared to the results in MaRs-VQA and RAVEN, GPT-4o achieves much better zero-shot and object-centric CoT inference results in the CVR dataset, almost matching the performance of fine-tuned ResNet-50 and ViT-small with 1,000 training samples[20].

In the second part of our experiment, we investigated the use of VLMs and GPT-4o to extract option descriptions (Question image + Option Set C in Figure2) for solving AVR problems in MaRs-VQA and RAVEN. The CVR dataset was excluded because the shapes it contains are too complex for GPT-4o to describe accurately. As shown in Table3, large-scale VLMs, such as LLaVA-1.6-34B and InternVL-1.2-40B, achieved comparable results to GPT-4o in MaRs-VQA and RAVEN. Notably, Gemini Pro 1.5 outperformed GPT-4o on the RAVEN dataset. However, their overall performance remains limited, as children rely on both verbal reasoning and visual cognition to solve AVR problems[89, 90].

We identified three major issues after reviewing the reasoning outputs of current MLLMs in Table2 and 3: (1) Limited Use of Visual Information: MLLMs cannot directly use visual features for reasoning, making them insensitive to non-verbal spatial features during CoT reasoning. This limitation is particularly evident when handling images that require describing the positional relations of objects. For example, it is difficult for MLLMs to distinguish each option in Figure1 using language alone. (2) Restricted Visual Working Memory: The visual working memory of MLLMs is limited, causing visual feature information to be easily lost during the text generation reasoning process. (3) Integration Challenges: Even if MLLMs possess strong task-specific skills like recognition, segmentation, and object detection, they struggle to integrate these skills into high-level visual reasoning tasks. Relative examples will be presented in the Appendix.

Visualization

We also analyze the relationship between AVR accuracy and model size in Figure4. The figure illustrates the significant gap between MLLM’s AVR performance and that of humans. This gap is substantial and suggests that simply increasing model size according to scaling laws will not be sufficient to bridge it. The human-level performance in AVR tasks remains distinctly higher, indicating the need for more advanced strategies beyond mere using larger models to achieve comparable results to humans.

What is the Visual Cognition Gap between Humans and Multimodal LLMs? (24)

5 Discussion

In the present work, we emphasize that zero-shot AVR is a key item to validate human-level intelligence, though it is still unclear how AVR ability is acquired early in human neurodevelopment. Children’s visual reasoners (without any additional training) can provide sensible answers to AVR questions as early as age four. The long-term goal of our work is twofold. The first one is to explore the problem of how close AIs or MLLMs are to human-like cognitive abilities, which is raised by François Chollet in 2019[8]. The second one is to develop an MLLM-powered AI agent that can simulate human-level zero-shot AVR capability. The agent will eventually guide vision generation models to generate new AVR samples and tasks and design new neurodevelopmental assessment tools. This will help psychologists and pediatricians explore and deconstruct how children activate such abilities in the early stage of neurodevelopment.

An open-ended question is whether MLLMs need to achieve or surpass human-level zero-shot inference capability in AVR tasks. Addressing this issue requires new theories from cognitive science and psychology to accurately evaluate and compare human and MLLM intelligence. Unlike MLLMs, which rely on training data and domain-specific skills, human cognition develops gradually and evolves with age. Therefore, AI researchers, psychologists, and cognitive scientists must collaborate to rethink how to benchmark MLLM intelligence with human intelligence.

6 Conclusion

We introduce VCog-Bench, a publicly available zero-shot abstract visual reasoning (AVR) benchmark designed to evaluate Multimodal Large Language Models (MLLMs). This benchmark integrates two well-known AVR datasets from the AI community and includes our newly proposed MaRs-VQA dataset. We also introduce several important concepts to redefine AVR tasks, focusing on designing new problem structures and object-centric Chain-of-Thought (CoT) system prompts. Our findings show that current state-of-the-art MLLMs and Vision-Language Models (VLMs), such as GPT-4o and LLaVA-1.6, InternVL demonstrate some basic understanding of AVR tasks. However, these models still face challenges with complex matrix reasoning tasks. This highlights the need for further exploration and development in this area. By providing a robust benchmark, we aim to encourage further innovation and progress in the field of zero-shot abstract visual reasoning.

References

[1]John McCarthy.Generality in artificial intelligence.In ACM Turing award lectures, page 1971. 2007.
[2]José Hernández-Orallo.Evaluation in artificial intelligence: from task-oriented to ability-oriented measurement.Artificial Intelligence Review, 48:397–447, 2017.
[3]ArthurR Jensen.The factor.Westport, CT: Prager, 1998.
[4]Mikołaj Małkiński and Jacek Mańdziuk.A review of emerging research directions in abstract visual reasoning.Information Fusion, 91:713–736, 2023.
[5]Dedre Gentner.Children’s performance on a spatial analogies task.Child development, pages 1034–1039, 1977.
[6]David Wechsler and Habuku Kodama.Wechsler intelligence scale for children, volume1.Psychological corporation New York, 1949.
[7]Jean Raven.Raven progressive matrices.In Handbook of nonverbal assessment, pages 223–237. Springer, 2003.
[8]François Chollet.On the measure of intelligence.arXiv preprint arXiv:1911.01547, 2019.
[9]David Barrett, Felix Hill, Adam Santoro, Ari Morcos, and Timothy Lillicrap.Measuring abstract reasoning in neural networks.In International conference on machine learning, pages 511–520. PMLR, 2018.
[10]Chi Zhang, Feng Gao, Baoxiong Jia, Yixin Zhu, and Song-Chun Zhu.Raven: A dataset for relational and analogical visual reasoning.In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5317–5327, 2019.
[11]TaylorWhittington Webb, Ishan Sinha, and Jonathan Cohen.Emergent symbols through binding in external memory.In International Conference on Learning Representations, 2020.
[12]Sheng Hu, Yuqing Ma, Xianglong Liu, Yanlu Wei, and Shihao Bai.Stratified rule-aware network for abstract visual reasoning.In Proceedings of the AAAI Conference on Artificial Intelligence, volume35, pages 1567–1574, 2021.
[13]Mikołaj Małkiński and Jacek Mańdziuk.Deep learning methods for abstract visual reasoning: A survey on raven’s progressive matrices.arXiv preprint arXiv:2201.12382, 2022.
[14]Kai Zhao, Chang Xu, and Bailu Si.Learning visual abstract reasoning through dual-stream networks.In Proceedings of the AAAI Conference on Artificial Intelligence, volume38, pages 16979–16988, 2024.
[15]Zhiliang Peng, Wenhui Wang, LiDong, Yaru Hao, Shaohan Huang, Shuming Ma, and Furu Wei.Kosmos-2: Grounding multimodal large language models to the world.arXiv preprint arXiv:2306.14824, 2023.
[16]Shaohan Huang, LiDong, Wenhui Wang, Yaru Hao, Saksham Singhal, Shuming Ma, Tengchao Lv, Lei Cui, OwaisKhan Mohammed, Barun Patra, etal.Language is not all you need: Aligning perception with language models.Advances in Neural Information Processing Systems, 36, 2024.
[17]Xingyu Fu, Yushi Hu, Bangzheng Li, YuFeng, Haoyu Wang, Xudong Lin, Dan Roth, NoahA Smith, Wei-Chiu Ma, and Ranjay Krishna.Blink: Multimodal large language models can see but not perceive.arXiv preprint arXiv:2404.12390, 2024.
[18]OpenAI.Hello gpt-4o.https://openai.com/index/hello-gpt-4o, 2024.
[19]Anthropic.Introducing the next generation of claude.https://www.anthropic.com/news/claude-3-family, 2024.
[20]Aimen Zerroug, Mohit Vaishnav, Julien Colin, Sebastian Musslick, and Thomas Serre.A benchmark for compositional visual reasoning.Advances in neural information processing systems, 35:29776–29788, 2022.
[21]Tomer Ullman.Large language models fail on trivial alterations to theory-of-mind tasks.arXiv preprint arXiv:2302.08399, 2023.
[22]Marcel Binz and Eric Schulz.Using cognitive psychology to understand gpt-3.Proceedings of the National Academy of Sciences, 120(6):e2218523120, 2023.
[23]Yudong Xu, Wenhao Li, Pashootan Vaezipoor, Scott Sanner, and EliasB Khalil.Llms and the abstraction and reasoning corpus: Successes, failures, and the importance of object-based representations.arXiv preprint arXiv:2305.18354, 2023.
[24]Taylor Webb, KeithJ Holyoak, and Hongjing Lu.Emergent analogical reasoning in large language models.Nature Human Behaviour, 7(9):1526–1541, 2023.
[25]Thilo Hagendorff, Sarah Fabi, and Michal Kosinski.Human-like intuitive behavior and reasoning biases emerged in large language models but disappeared in chatgpt.Nature Computational Science, 3(10):833–838, 2023.
[26]JamesWA Strachan, Dalila Albergo, Giulia Borghini, Oriana Pansardi, Eugenio Scaliti, Saurabh Gupta, Krati Saxena, Alessandro Rufo, Stefano Panzeri, Guido Manzi, etal.Testing theory of mind in large language models and humans.Nature Human Behaviour, pages 1–11, 2024.
[27]Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, FlorenciaLeoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, etal.Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023.
[28]Adam Lerer, Sam Gross, and Rob Fergus.Learning physical intuition of block towers by example.In International conference on machine learning, pages 430–438. PMLR, 2016.
[29]Liang Zhou, KevinA Smith, JoshuaB Tenenbaum, and Tobias Gerstenberg.Mental jenga: A counterfactual simulation model of causal judgments about physical support.Journal of Experimental Psychology: General, 152(8):2237, 2023.
[30]Serwan Jassim, Mario Holubar, Annika Richter, Cornelius Wolff, Xenia Ohmer, and Elia Bruni.Grasp: A novel benchmark for evaluating language grounding and situated physics understanding in multimodal language models.arXiv preprint arXiv:2311.09048, 2023.
[31]TimothyA Salthouse.Influence of working memory on adult age differences in matrix reasoning.British Journal of Psychology, 84(2):171–199, 1993.
[32]SusanneM Jaeggi, Barbara Studer-Luethi, Martin Buschkuehl, Yi-Fen Su, John Jonides, and WalterJ Perrig.The relationship between n-back performance and matrix reasoning—implications for training and transfer.Intelligence, 38(6):625–635, 2010.
[33]François Fleuret, Ting Li, Charles Dubout, EmmaK Wampler, Steven Yantis, and Donald Geman.Comparing machines and humans on a visual categorization test.Proceedings of the National Academy of Sciences, 108(43):17621–17625, 2011.
[34]Isabelle Soulières, Michelle Dawson, Fabienne Samson, EliseB Barbeau, CherifP Sahyoun, GaryE Strangman, ThomasA Zeffiro, and Laurent Mottron.Enhanced visual processing contributes to matrix reasoning in autism.Human brain mapping, 30(12):4082–4107, 2009.
[35]AlanS Kaufman, SusanEngi Raiford, and DianeL Coalson.Intelligent testing with the WISC-V.John Wiley & Sons, 2015.
[36]Sebastian Stabinger, David Peer, Justus Piater, and Antonio Rodríguez-Sánchez.Evaluating the progress of deep learning for visual relational concepts.Journal of Vision, 21(11):8–8, 2021.
[37]Jingyi Xu, Tushar Vaidya, Yufei Wu, Saket Chandra, Zhangsheng Lai, and Kai FongErnest Chong.Abstract visual reasoning: An algebraic approach for solving raven’s progressive matrices.In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6715–6724, 2023.
[38]Mikołaj Małkiński and Jacek Mańdziuk.One self-configurable model to solve many abstract visual reasoning problems.In Proceedings of the AAAI Conference on Artificial Intelligence, volume38, pages 14297–14305, 2024.
[39]Bjorn Ommer and JoachimM Buhmann.Learning the compositional nature of visual objects.In 2007 IEEE Conference on Computer Vision and Pattern Recognition, pages 1–8. IEEE, 2007.
[40]Nan Liu, Shuang Li, Yilun Du, Josh Tenenbaum, and Antonio Torralba.Learning to compose visual relations.Advances in Neural Information Processing Systems, 34:23166–23178, 2021.
[41]Yaniv Benny, Niv Pekar, and Lior Wolf.Scale-localized abstract reasoning.In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12557–12565, 2021.
[42]Yizhe Zhang, HeBai, Ruixiang Zhang, Jiatao Gu, Shuangfei Zhai, Josh Susskind, and Navdeep Jaitly.How far are we from intelligent visual deductive reasoning?arXiv preprint arXiv:2403.04732, 2024.
[43]ArseniiKirillovich Moskvichev, VictorVikram Odouard, and Melanie Mitchell.The conceptarc benchmark: Evaluating understanding and generalization in the arc domain.Transactions on Machine Learning Research, 2023.
[44]Rowan Zellers, Yonatan Bisk, Ali Farhadi, and Yejin Choi.From recognition to cognition: Visual commonsense reasoning.In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6720–6731, 2019.
[45]Florian Bordes, RichardYuanzhe Pang, Anurag Ajay, AlexanderC. Li, Adrien Bardes, Suzanne Petryk, Oscar Mañas, Zhiqiu Lin, Anas Mahmoud, Bargav Jayaraman, etal.An introduction to vision-language modeling.arXiv preprint arXiv:2405.17247, 2024.
[46]Duzhen Zhang, Yahan Yu, Chenxing Li, Jiahua Dong, Dan Su, Chenhui Chu, and Dong Yu.Mm-llms: Recent advances in multimodal large language models.arXiv preprint arXiv:2401.13601, 2024.
[47]Hao Shao, Shengju Qian, Han Xiao, Guanglu Song, Zhuofan Zong, Letian Wang, YuLiu, and Hongsheng Li.Visual cot: Unleashing chain-of-thought reasoning in multi-modal language models.arXiv preprint arXiv:2403.16999, 2024.
[48]Tanmay Gupta and Aniruddha Kembhavi.Visual programming: Compositional visual reasoning without training.In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14953–14962, 2023.
[49]Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mohamed Elhoseiny.Minigpt-4: Enhancing vision-language understanding with advanced large language models.arXiv preprint arXiv:2304.10592, 2023.
[50]Wenliang Dai, Junnan Li, Dongxu Li, Anthony MengHuat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, PascaleN Fung, and Steven Hoi.Instructblip: Towards general-purpose vision-language models with instruction tuning.Advances in Neural Information Processing Systems, 36, 2024.
[51]Haotian Liu, Chunyuan Li, Qingyang Wu, and YongJae Lee.Visual instruction tuning.Advances in neural information processing systems, 36, 2024.
[52]Weihan Wang, Qingsong Lv, Wenmeng Yu, Wenyi Hong, JiQi, Yan Wang, Junhui Ji, Zhuoyi Yang, Lei Zhao, Xixuan Song, etal.Cogvlm: Visual expert for pretrained language models.arXiv preprint arXiv:2311.03079, 2023.
[53]Dídac Surís, Sachit Menon, and Carl Vondrick.Vipergpt: Visual inference via python execution for reasoning.In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 11888–11898, 2023.
[54]Penghao Wu and Saining Xie.V*: Guided visual search as a core mechanism in multimodal llms.IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023.
[55]Qing Guo, Prashan Wanigasekara, Jian Zheng, JacobZhiyuan Fang, Xinwei Deng, and Chenyang Tao.How do large multimodal models really fare in classical vision few-shot challenges? a deep dive.In R0-FoMo: Robustness of Few-shot and Zero-shot Learning in Large Foundation Models, 2023.
[56]Mert Yuksekgonul, Federico Bianchi, Pratyusha Kalluri, Dan Jurafsky, and James Zou.When and why vision-language models behave like bags-of-words, and what to do about it?In The Eleventh International Conference on Learning Representations, 2022.
[57]Shengbang Tong, Zhuang Liu, Yuexiang Zhai, YiMa, Yann LeCun, and Saining Xie.Eyes wide shut? exploring the visual shortcomings of multimodal llms.IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024.
[58]PatriciaA Carpenter, MarcelA Just, and Peter Shell.What one intelligence test measures: a theoretical account of the processing in the raven progressive matrices test.Psychological review, 97(3):404, 1990.
[59]Gabriele Chierchia, Delia Fuhrmann, LisaJ Knoll, BlancaPiera Pi-Sunyer, AshokL Sakhardande, and Sarah-Jayne Blakemore.The matrix reasoning item bank (mars-ib): novel, open-access abstract reasoning items for adolescents and adults.Royal Society open science, 6(10):190232, 2019.
[60]ConnorT Keating, DagmarS Fraser, Sophie Sowden, and JenniferL Cook.Differences between autistic and non-autistic adults in the recognition of anger from facial motion remain after controlling for alexithymia.Journal of autism and developmental disorders, 52(4):1855–1871, 2022.
[61]Samuel Zorowitz, Gabriele Chierchia, Sarah-Jayne Blakemore, and NathanielD Daw.An item response theory analysis of the matrix reasoning item bank (mars-ib).Behavior research methods, 56(3):1104–1122, 2024.
[62]Kate Nussenbaum, Maximilian Scheuplein, CamilleV Phaneuf, MichaelD Evans, and CatherineA Hartley.Moving developmental research online: comparing in-lab and web-based studies of model-based reinforcement learning.Collabra: Psychology, 6(1), 2020.
[63]MEMoses-Payne, GChierchia, and S-J Blakemore.Age-related changes in the impact of valence on self-referential processing in female adolescents and young adults.Cognitive Development, 61:101128, 2022.
[64]Haotian Liu, Chunyuan Li, Yuheng Li, BoLi, Yuanhan Zhang, Sheng Shen, and YongJae Lee.Llava-next: Improved reasoning, ocr, and world knowledge.https://llava-vl.github.io/blog/2024-01-30-llava-next, 2024.
[65]Jason Wei, YiTay, Rishi Bommasani, Colin Raffel, Barret Zoph, Sebastian Borgeaud, Dani Yogatama, Maarten Bosma, Denny Zhou, Donald Metzler, etal.Emergent abilities of large language models.arXiv preprint arXiv:2206.07682, 2022.
[66]Takeshi Kojima, ShixiangShane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa.Large language models are zero-shot reasoners.Advances in neural information processing systems, 35:22199–22213, 2022.
[67]Zhuosheng Zhang, Aston Zhang, MuLi, Hai Zhao, George Karypis, and Alex Smola.Multimodal chain-of-thought reasoning in language models.arXiv preprint arXiv:2302.00923, 2023.
[68]Qiji Zhou, Ruochen Zhou, Zike Hu, Panzhong Lu, Siyang Gao, and Yue Zhang.Image-of-thought prompting for visual reasoning refinement in multimodal large language models.arXiv preprint arXiv:2405.13872, 2024.
[69]Daoan Zhang, Junming Yang, Hanjia Lyu, Zijian Jin, Yuan Yao, Mingkai Chen, and Jiebo Luo.Cocot: Contrastive chain-of-thought prompting for large multimodal models with multiple image inputs.arXiv preprint arXiv:2401.02582, 2024.
[70]Taylor Webb, ShankaSubhra Mondal, and JonathanD Cohen.Systematic visual reasoning through object-centric relational abstraction.Advances in Neural Information Processing Systems, 36, 2024.
[71]TaylorW Webb, StevenM Frankland, Awni Altabaa, Simon Segert, Kamesh Krishnamurthy, Declan Campbell, Jacob Russin, Tyler Giallanza, Randall O’Reilly, John Lafferty, etal.The relational bottleneck as an inductive bias for efficient abstraction.Trends in Cognitive Sciences, 2024.
[72]ShankaSubhra Mondal, JonathanD Cohen, and TaylorW Webb.Slot abstractors: Toward scalable abstract visual reasoning.arXiv preprint arXiv:2403.03458, 2024.
[73]Maximilian Seitzer, Max Horn, Andrii Zadaianchuk, Dominik Zietlow, Tianjun Xiao, Carl-Johann Simon-Gabriel, Tong He, Zheng Zhang, Bernhard Schölkopf, Thomas Brox, etal.Bridging the gap to real-world object-centric learning.In The Eleventh International Conference on Learning Representations, 2022.
[74]Andrea Dittadi, SamueleS Papa, Michele DeVita, Bernhard Schölkopf, Ole Winther, and Francesco Locatello.Generalization and robustness implications in object-centric learning.In International Conference on Machine Learning, pages 5221–5285. PMLR, 2022.
[75]Jindong Jiang, Fei Deng, Gautam Singh, and Sungjin Ahn.Object-centric slot diffusion.Advances in Neural Information Processing Systems, 36, 2024.
[76]OpenAI.Gpt-4v(ision) system card.https://openai.com/research/gpt-4v-system-card, 2023.
[77]Giacomo Camposampiero, Loïc Houmard, Benjamin Estermann, Joël Mathys, and Roger Wattenhofer.Abstract visual reasoning enabled by language.In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2642–2646, 2023.
[78]Machel Reid, Nikolay Savinov, Denis Teplyashin, Dmitry Lepikhin, Timothy Lillicrap, Jean-baptiste Alayrac, Radu Soricut, Angeliki Lazaridou, Orhan Firat, Julian Schrittwieser, etal.Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context.arXiv preprint arXiv:2403.05530, 2024.
[79]Alex Young, Bei Chen, Chao Li, Chengen Huang, GeZhang, Guanwei Zhang, Heng Li, Jiangcheng Zhu, Jianqun Chen, Jing Chang, etal.Yi: Open foundation models by 01. ai.arXiv preprint arXiv:2403.04652, 2024.
[80]Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou.Qwen-vl: A frontier large vision-language model with versatile abilities.arXiv preprint arXiv:2308.12966, 2023.
[81]Zhe Chen, Weiyun Wang, Hao Tian, Shenglong Ye, Zhangwei Gao, Erfei Cui, Wenwen Tong, Kongzhi Hu, Jiapeng Luo, Zheng Ma, etal.How far are we to gpt-4v? closing the gap to commercial multimodal models with open-source suites.arXiv preprint arXiv:2404.16821, 2024.
[82]Pan Lu, Swaroop Mishra, Tanglin Xia, Liang Qiu, Kai-Wei Chang, Song-Chun Zhu, Oyvind Tafjord, Peter Clark, and Ashwin Kalyan.Learn to explain: Multimodal reasoning via thought chains for science question answering.Advances in Neural Information Processing Systems, 35:2507–2521, 2022.
[83]Yuan Liu, Haodong Duan, Yuanhan Zhang, BoLi, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, etal.Mmbench: Is your multi-modal model an all-around player?arXiv preprint arXiv:2307.06281, 2023.
[84]Wei-Lin Chiang, Zhuohan Li, ZiLin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, JosephE Gonzalez, etal.Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality.See https://vicuna. lmsys. org (accessed 14 April 2023), 2(3):6, 2023.
[85]AlbertQ Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, DevendraSingh Chaplot, Diego delas Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, etal.Mistral 7b.arXiv preprint arXiv:2310.06825, 2023.
[86]Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, etal.Llama 2: Open foundation and fine-tuned chat models.arXiv preprint arXiv:2307.09288, 2023.
[87]AIMeta.Introducing meta llama 3: The most capable openly available llm to date, 2024.URL https://ai. meta. com/blog/meta-llama-3/. Accessed on April, 26, 2024.
[88]Zheng Cai, Maosong Cao, Haojiong Chen, Kai Chen, Keyu Chen, Xin Chen, Xun Chen, Zehui Chen, Zhi Chen, Pei Chu, etal.Internlm2 technical report.arXiv preprint arXiv:2403.17297, 2024.
[89]DavidJM Kraemer, LaurenM Rosenberg, and SharonL Thompson-Schill.The neural correlates of visual and verbal cognitive styles.Journal of Neuroscience, 29(12):3792–3798, 2009.
[90]Selma Dündar-Coecke, Andrew Tolmie, and Anne Schlottmann.Children’s reasoning about continuous causal processes: The role of verbal and non-verbal ability.British Journal of Educational Psychology, 90(2):364–381, 2020.