COLUMBUS: Evaluating COgnitive Lateral Understanding through Multiple-choice reBUSes

Koen Kraaijveld¹, Yifan Jiang², Kaixin Ma³, Filip Ilievski¹

¹Department of Computer Science, Faculty of Science, Vrije Universiteit Amsterdam ²Information Sciences Institute, University of Southern California
³Tencent AI Lab, Bellevue, WA

Paper arXiv Code

Notebooks Dataset

Abstract

While visual question-answering (VQA) benchmarks have catalyzed the development of reasoning techniques, they have focused on vertical thinking. Effective problem-solving also necessitates lateral thinking, which remains understudied in AI and has not been used to test visual perception systems. To bridge this gap, we formulate visual lateral thinking as a multiple-choice question-answering task and describe a three-step taxonomy-driven methodology for instantiating task examples. Then, we develop COLUMBUS, a synthetic benchmark that applies the task pipeline to create QA sets with text and icon rebus puzzles based on publicly available collections of compounds and common phrases. COLUMBUS comprises over 1,000 puzzles, each with four answer candidates. While the SotA vision language models (VLMs) achieve decent performance, our evaluation demonstrates a substantial gap between humans and models. VLMs benefit from human-curated descriptions but struggle to self-generate such representations at the right level of abstraction.

A) Arrowheads
B) Draw the long bow
C) Arrow in the quiver
D) Make one's bow

A) Back the wrong horse
B) Dark horse
C) High horse
D) Horse around

A) Under one's very eyes
B) Before one's eyes
C) Before one's time
D) Before someone's time

A) Red carpet treatment
B) Under the carpet
C) Roll out the red carpet
D) Roll in the aisles

A) Stroke of luck
B) Cross fingers for luck
C) Push one's luck
D) Down on one's luck

A) Dig one's heels in
B) Take to one's heels
C) Dig in one's heels
D) Dig heels in

A) Ride one's luck
B) Cross fingers for luck
C) Down on one's luck
D) Push one's luck

A) Go down on
B) Go up in flames
C) Go down in flames
D) Go up in smoke

A) Writing on the wall
B) Handwriting on the wall
C) Handwriting
D) Pen picture

A) Hang up one's hat
B) Hang one's hat
C) Hang over one's head
D) Tip one's hat

Introducing Visual Lateral Tasks and COLUMBUS

A visual lateral thinking task is driven by a taxonomy consisting of 18 rules organized across three categories. Each rule uniquely manipulates the appearance and visual-spatial relationships of each element (either text or an icon) in a puzzle. The three categories are as follows:
1. Individual rules define the unary characteristics of an element in a rebus. Example rules include reversing character order (direction), the text color (style), and adding arrows before the element (highlight).
2. Relational rules define the positioning between a pair of elements. We define four relational rules, placing an element beside/inside/above/outside another.
3. Modifier rules are designed to be mutually inclusive with other individual rules. Examples include repeating an element multiple times or substituting it with a phonetically similar element.

Our Approach

The pipeline to instantiate and evaluate visual lateral thinking tasks is shown below. Puzzle generation leverages the above taxonomy to create a graph representation for a puzzle answer and generate an image for the graph. Each graph is a directed, attributed graph whose nodes are elements that will be rendered into a puzzle image. The node attributes specify the rendering of that element (i.e., the individual or modifier rules that will apply to it). The edges between two nodes are annotated with an attribute that specifies their relational rule. The distractor sampling step is based on a weighted average of orthographic and semantic similarity between a puzzle’s correct answer and its visible elements.

The COLUMBUS Benchmark

COLUMBUS is generated from collections of compound words and common phrases that were either web scraped, downloaded or manually added. The benchmark comprises over 1,000 puzzles spread over two partitions: COLUMBUS-text, with puzzles that only contain text and COLUMBUS-icon, with puzzles that contain at least one icon.

Results

Overall Performance

The models we test include open- and closed-source instruction-tuned and non-instruction-tuned vision-language models. We also experiment with two structural variants of closed- source models: forward (FC) and backward chaining (BC). All models are evaluated in a zero-shot setting using standard hyperparameter values.

Generally, the closed-source and larger open-source models perform best on both partitions, while the small, non-instruction-tuned models perform near-randomly. The best model for each partition is consistently GPT-4o, which is the expected result. Yet, none of the models surpass human accuracy, with average gaps of 38.17% on COLUMBUS-text and 30.64% on COLUMBUS-icon.

Forward chaining yields a negligible effect on GPT-4o (-1.88%) and Gemini (+2.26%), averaged across both models and paritions. On the contrary, backward chaining yields an 14.1% and 22.86% drop in accuracy for GPT-4o and GPT-4o-mini, averaged across the two partitions.

Model Sensitivity to Input Information

Can models benefit from a ground-truth structured description of the puzzle provided in their input? We experiment with four prompts, each of which supplies a model with varying degrees of additional information that can be used to solve a puzzle (1 = least information, 4 = most information). For both COLUMBUS-text (left plot) COLUMBUS-icon (right plot), we generally see that adding increasingly more information results in better performance.

VLM Generation of Puzzles

Given the strong generative abilities of VLMs, a natural question arises: can they generate puzzles without our methodology? Through a user study, we find that humans overwhelmingly prefer our puzzles over ones generated by DALLE-3. An example is shown below with a puzzle from COLUMBUS (left) and a puzzle generated by DALLE-3 (right).

Rule-based Analysis

We also perform an analysis of puzzles solved per rule using GPT-4o. We find that performance is much better on relational rather individual rules, and slightly better when individual rules are combined with modifier rules. We also note that, while the GPT-4o’s performance is similar on the two partitions, specific rules are more difficult for this model when represented as text (e.g., repetition rules). In contrast, others are more challenging when presented as icons (e.g., size).

Citation (BibTeX)

If you have found COLUMBUS useful, please cite us:

@article{kraaijveld2024columbus,
      title={COLUMBUS: Evaluating COgnitive Lateral Understanding through Multiple-choice reBUSes}, 
      author={Koen Kraaijveld and Yifan Jiang and Kaixin Ma and Filip Ilievski},
      year={2024},
      eprint={2409.04053},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2409.04053}, 
}