Visual Grounding for Object Questions

Everaert, Martin Nicolas; Liu, Xiruo; Takeda, Hiroyuki; Bala, Raja; Yadav, Vivek; Narayanan, Vidya

Visual Grounding for Object Questions

Martin Nicolas Everaert^{* 1}, Xiruo Liu², Hiroyuki Takeda², Raja Bala², Vivek Yadav², Vidya Narayanan²

¹ EPFL (IVRL lab), ² Amazon
^*Work done during an internship at Amazon

CVPR 2026

CVPR Open Access CVPR Virtual Platform Paper Supplementary Material ABO-VGOQ dataset VizWiz-VGOQ dataset

Abstract

Current visual grounding research remains limited for practical applications, because existing tasks primarily focus on direct visual queries (e.g., “find the red car”) or reading visible text (e.g., “what is the title of this book?”), rather than supporting general questions about objects (e.g., “how comfortable are these earbuds?”). We introduce the novel problem of Visual Grounding for Object Questions (VGOQ). Unlike previous tasks that ground only what is directly visible in images, VGOQ handles open-ended general questions about objects, including concepts such as ease and comfort of use, and aims to identify visual evidence or context that would support an answer. This unexplored problem has immediate practical value, particularly in designing and optimizing product imagery in e-commerce stores. As initial steps toward this task, we develop two automated data generation techniques, which serve to train a lightweight visual grounding model, and to evaluate visual grounding approaches on the resulting synthetic benchmarks, ABO-VGOQ and VizWiz-VGOQ. Our results provide initial evidence that VGOQ represents a meaningful research direction: current SoTA visual grounding performance decreases from 52% gIoU to 37% gIoU when questions are rephrased from visual questions (segmentation of the answer) to general object questions (segmentation of visual evidence). On our new benchmarks, our lightweight model outperforms prior models while being much smaller.

Video Presentation

Toward More Practical Visual Grounding

Existing Visual Grounding tasks

Grounding directly visible elements

Segmentation

Open vocabulary segmentation with labels

Only handle short 'class names', cannot handle longer descriptions or reasoning.

Referring Expression Segmentation

Mask highlighting the black boat on the left

Limited to direct visual descriptions, not open-ended questions.

VQA Grounding / VQ Grounding

The answers to the questions are directly visible in the image.

Other existing Visual Grounding tasks

Phrase grounding
Dense captioning
Grounded captioning
Reasoning segmentation
3D / Video Grounding

All these tasks are limited to grounding directly visible elements.

Visual Grounding for Object Questions (VGOQ) New

Grounding visual evidence or useful context that supports or helps understanding the answer to general questions about objects

Desk front view — Original object images from ABO dataset

Wood texture close-up — Original object images from ABO dataset

Applications of Visual Grounding in e‑commerce stores require:

Handling general questions about objects, including abstract concepts (e.g., comfort, ease of use).
Identifying visual evidence or context that supports an answer, not just the answer itself (not necessarily directly visible).
Possibly leveraging multiple images of the same object instead of a single image.
Possibly leveraging object metadata, e.g., title, description, and other object attributes on top of images.

There are no datasets containing images, questions about objects, and visual grounding of evidence or context, rather than visual grounding of the answer directly.

Automated Data Generation

One key challenge of VGOQ is the lack of datasets containing images, questions about objects, and visual grounding of evidence or context rather than visual grounding of the answer directly.

As a first step toward addressing this problem, we create two automated techniques:

Transforming existing visual questions from VQA grounding datasets into general object questions → VizWiz-VGOQ.
A zero-shot pipeline using Claude and traditional grounding models (Molmo 7B-D, Florence-2, SAM-2) to create visual grounding in a zero-shot manner for generating customer questions about products → ABO-VGOQ.

We then use Claude to automatically categorize the segmentation based on its evidential relationship to the object question. This allows us to train and evaluate visual grounding models on the different evidential quality categories (e.g., specific visual evidence, related context without visual evidence, etc).

1/ Convert VQA grounding datasets into general object questions

Examples from VizWiz-VGOQ

Object image:

Grounding:

Q: Is this seasoning suitable for people on a low-sodium diet?

The grounding highlights specific visual evidence (the "Mrs Dash" label) that supports answering the object question. Highlighting "mrs dash" is useful because this brand is specifically known for producing salt-free seasonings.

Object image:

Grounding:

Mask highlighting the Beef ravioli label

Q: Is this product suitable for vegetarians?

The grounding highlights specific visual evidence. Highlighting "beef ravioli" is useful because it shows that the product contains beef, which is meat from cattle.

2/ Combining multimodal LLMs and existing grounding models

Examples from ABO-VGOQ

Input (Object images + Metadata):

+ product listing from ABO dataset (Item name: "Red Wagon Quilted Triple Strap Velcro, Boys’ Low-Top Sneakers") Grounding:

Q: Are the Velcro straps easy for small children to open and close by themselves?

The highlighting shows specific visual evidence that could help answer the question by focusing on the wide, substantial Velcro straps. The images reveal that the straps are broad and appear to have good gripping surfaces, which would likely make them easier for small children to manipulate.

Input (Object images + Metadata):

+ product listing from ABO dataset (Item name: "Amazon Brand - Stone & Beam Prudence Tufted King Bed, 84"W, Curious Pearl") Grounding:

Q: Is the headboard attached or can it be removed?

The highlighting shows related context, but no visual evidence. While the highlighting correctly identifies the headboard that the customer is asking about, it doesn't provide alone clear information about whether the headboard is attached or removable.

Lightweight Model

The data generation pipeline we developed to create the ABO-VGOQ data allows locating visual evidence / context for answering general object questions, but it relies on multiple models and is computationally expensive, making it impractical for real-time deployment.

We thus train a lightweight model (1.77M parameters), based on CLIPSeg, that combines CLIP image and text encoders with a lightweight grounding transformer.

The model takes as input a textual input (e.g., an object question) and one image, and outputs a segmentation map and a relevance score. The relevance score indicates how relevant the image is for answering the question, and the segmentation map highlights visual evidence or context that would support answering the question.

Image
336 × 336

Textual input
(object question, visual question, referring expression, ...)

❄ CLIP Vision Encoder

❄ CLIP Text Encoder

Proj. 1

Proj. 2

Proj. 3

Task-specific FiLM Layer

Transformer Block 1

+

Transformer Block 2

+

Transformer Block 3

Split: CLS token + spatial tokens

Head 1: Grounding

Head 2: Relevance

Segmentation heatmap
336 × 336

Relevance score
∈ [0, 1]

Evaluation

We evaluate various models (OFA, UnifiedIO, GLaMM, Qwen3-VL, our lightweight model) on our synthetic benchmarks (ABO-VGOQ and VizWiz-VGOQ), and on traditional VQ / VQA grounding benchmarks (VizWiz-VQA-Grounding, TextVQA-X, and Toloka benchmark).

Despite being much smaller (1.77M parameters), faster, and trained for much less time, our lightweight model outperforms most other models on the VGOQ tasks. On ABO-VGOQ, for the task of locating specific visual evidence, our model achieves 32.5–39.5% gIoU (average Intersection over Union), compared to 12.9–15.1% for a simple baseline, 11.9–17.9% for OFA, 12.3–15.6% for UnifiedIO, 19.4–22.3% for GLaMM, and 25.8–32.7% for Qwen3-VL.

Limitations and Future Work

Synthetic datasets. Both datasets are automatically generated rather than manually created, lacking the quality and precision of human-created segmentation. The quality of individual examples varies. However, taken as a whole, the metric (gIoU) appears to be a good indication of the model's ability to locate visual evidence or relevant context that would support answering general object questions. To the best of our knowledge, there are no prior existing datasets that contain images, general questions about objects, and visual grounding of the evidence / context that would support answering those questions, rather than visual grounding of the answer directly.

Subjectivity and task definition. Visual evidence assessment for object questions is inherently subjective. What constitutes valid evidence can vary based on context and persons. For instance, for a size question, it is unclear whether showing an image where size is only conveyed relative to other elements counts as valid evidence, or if precise measurements are required. Future work could categorize question types and define what counts as valid evidence, including the distinction between direct visual evidence and visual evidence requiring external knowledge (see the Mrs Dash example above).

Poster

PNG

Citation

Please use the following BibTeX entry to cite our paper:

@inproceedings{everaert2026visual,
    title     = {{V}isual {G}rounding for {O}bject {Q}uestions},
    author    = {Everaert, Martin Nicolas and Liu, Xiruo and Takeda, Hiroyuki and Bala, Raja and Yadav, Vivek and Narayanan, Vidya},
    booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2026)},
    year      = {2026},
    pages     = {11966-11975},
    url       = {https://martin-ev.github.io/vgoq}
}