Abstract

Current visual grounding research remains limited for practical applications, because existing tasks primarily focus on direct visual queries (e.g., “find the red car”) or reading visible text (e.g., “what is the title of this book?”), rather than supporting general questions about objects (e.g., “how comfortable are these earbuds?”). We introduce the novel problem of Visual Grounding for Object Questions (VGOQ). Unlike previous tasks that ground only what is directly visible in images, VGOQ handles open-ended general questions about objects, including concepts such as ease and comfort of use, and aims to identify visual evidence or context that would support an answer. This unexplored problem has immediate practical value, particularly in designing and optimizing product imagery in e-commerce stores. As initial steps toward this task, we develop two automated data generation techniques, which serve to train a lightweight visual grounding model, and to evaluate visual grounding approaches on the resulting synthetic benchmarks, ABO-VGOQ and VizWiz-VGOQ. Our results provide initial evidence that VGOQ represents a meaningful research direction: current SoTA visual grounding performance decreases from 52% gIoU to 37% gIoU when questions are rephrased from visual questions (segmentation of the answer) to general object questions (segmentation of visual evidence). On our new benchmarks, our lightweight model outperforms prior models while being much smaller.

Video Presentation

Toward More Practical Visual Grounding

Existing Visual Grounding tasks

Grounding directly visible elements

Segmentation

Example of traditional segmentation from COCO dataset Input image of a rowing boat COCO segmentation masks
boatperson
Example of open-vocabulary segmentation from dino.txt Input image of boats at a dock Open vocabulary segmentation with labels
boatfieldlandpedestalpierseatriverskywater
Only handle short 'class names', cannot handle longer descriptions or reasoning.

Referring Expression Segmentation

Example from RefCOCO Street scene with cars and fire hydrant Mask highlighting the car on the right
car on the right side
Example from RefCOCO Boats at a dock Mask highlighting the black boat on the left
black boat on left
Limited to direct visual descriptions, not open-ended questions.

VQA Grounding / VQ Grounding

Example from VizWiz-VQA-Grounding Mrs Dash seasoning bottle Mask highlighting the Mrs Dash label
Q: What is the name of this seasoning? A: mrs dash
Example from Toloka benchmark Windsurfer on water Mask highlighting the sail
Q: What makes the wind move a boat?
The answers to the questions are directly visible in the image.

Other existing Visual Grounding tasks

  • Phrase grounding
  • Dense captioning
  • Grounded captioning
  • Reasoning segmentation
  • 3D / Video Grounding
All these tasks are limited to grounding directly visible elements.
Visual Grounding for Object Questions (VGOQ) New

Grounding visual evidence or useful context that supports or helps understanding the answer to general questions about objects

Original object images from ABO dataset
Desk front view sturdiness
Wood texture close-up materials
Desk with dimensions dimensions space
Desk with drawers open
Drawer detail materials sturdiness
Desk in a styled space space
  • What are the dimensions of this desk?
  • What materials is the desk made of?
  • Will this desk fit in my space?
  • How sturdy is this wooden desk?

Applications of Visual Grounding in e‑commerce stores require:

  • Handling general questions about objects, including abstract concepts (e.g., comfort, ease of use).
  • Identifying visual evidence or context that supports an answer, not just the answer itself (not necessarily directly visible).
  • Possibly leveraging multiple images of the same object instead of a single image.
  • Possibly leveraging object metadata, e.g., title, description, and other object attributes on top of images.
There are no datasets containing images, questions about objects, and visual grounding of evidence or context, rather than visual grounding of the answer directly.

Automated Data Generation

One key challenge of VGOQ is the lack of datasets containing images, questions about objects, and visual grounding of evidence or context rather than visual grounding of the answer directly.

As a first step toward addressing this problem, we create two automated techniques:

  1. Transforming existing visual questions from VQA grounding datasets into general object questions → VizWiz-VGOQ.
  2. A zero-shot pipeline using Claude and traditional grounding models (Molmo 7B-D, Florence-2, SAM-2) to create visual grounding in a zero-shot manner for generating customer questions about products → ABO-VGOQ.

We then use Claude to automatically categorize the segmentation based on its evidential relationship to the object question. This allows us to train and evaluate visual grounding models on the different evidential quality categories (e.g., specific visual evidence, related context without visual evidence, etc).

1/ Convert VQA grounding datasets into general object questions

Examples from VizWiz-VGOQ

Object image: Mrs Dash seasoning bottle
Grounding: Mask highlighting the Mrs Dash label
Q: Is this seasoning suitable for people on a low-sodium diet?

The grounding highlights specific visual evidence (the "Mrs Dash" label) that supports answering the object question. Highlighting "mrs dash" is useful because this brand is specifically known for producing salt-free seasonings.

Object image: a can of beef ravioli
Grounding: Mask highlighting the Beef ravioli label
Q: Is this product suitable for vegetarians?

The grounding highlights specific visual evidence. Highlighting "beef ravioli" is useful because it shows that the product contains beef, which is meat from cattle.

2/ Combining multimodal LLMs and existing grounding models

Examples from ABO-VGOQ

Q: Are the Velcro straps easy for small children to open and close by themselves?

The highlighting shows specific visual evidence that could help answer the question by focusing on the wide, substantial Velcro straps. The images reveal that the straps are broad and appear to have good gripping surfaces, which would likely make them easier for small children to manipulate.

Q: Is the headboard attached or can it be removed?

The highlighting shows related context, but no visual evidence. While the highlighting correctly identifies the headboard that the customer is asking about, it doesn't provide alone clear information about whether the headboard is attached or removable.

Lightweight Model

The data generation pipeline we developed to create the ABO-VGOQ data allows locating visual evidence / context for answering general object questions, but it relies on multiple models and is computationally expensive, making it impractical for real-time deployment.

We thus train a lightweight model (1.77M parameters), based on CLIPSeg, that combines CLIP image and text encoders with a lightweight grounding transformer.

The model takes as input a textual input (e.g., an object question) and one image, and outputs a segmentation map and a relevance score. The relevance score indicates how relevant the image is for answering the question, and the segmentation map highlights visual evidence or context that would support answering the question.

Image
336 × 336
Textual input
(object question, visual question, referring expression, ...)
CLIP Vision Encoder
CLIP Text Encoder
Proj. 1
Proj. 2
Proj. 3
Task-specific FiLM Layer
Transformer Block 1
+
Transformer Block 2
+
Transformer Block 3
Split: CLS token + spatial tokens
Head 1: Grounding
Head 2: Relevance
Segmentation heatmap
336 × 336
Relevance score
∈ [0, 1]

Evaluation

We evaluate various models (OFA, UnifiedIO, GLaMM, Qwen3-VL, our lightweight model) on our synthetic benchmarks (ABO-VGOQ and VizWiz-VGOQ), and on traditional VQ / VQA grounding benchmarks (VizWiz-VQA-Grounding, TextVQA-X, and Toloka benchmark).

Despite being much smaller (1.77M parameters), faster, and trained for much less time, our lightweight model outperforms most other models on the VGOQ tasks. On ABO-VGOQ, for the task of locating specific visual evidence, our model achieves 32.5–39.5% gIoU (average Intersection over Union), compared to 12.9–15.1% for a simple baseline, 11.9–17.9% for OFA, 12.3–15.6% for UnifiedIO, 19.4–22.3% for GLaMM, and 25.8–32.7% for Qwen3-VL.

Limitations and Future Work

Synthetic datasets. Both datasets are automatically generated rather than manually created, lacking the quality and precision of human-created segmentation. The quality of individual examples varies. However, taken as a whole, the metric (gIoU) appears to be a good indication of the model's ability to locate visual evidence or relevant context that would support answering general object questions. To the best of our knowledge, there are no prior existing datasets that contain images, general questions about objects, and visual grounding of the evidence / context that would support answering those questions, rather than visual grounding of the answer directly.

Subjectivity and task definition. Visual evidence assessment for object questions is inherently subjective. What constitutes valid evidence can vary based on context and persons. For instance, for a size question, it is unclear whether showing an image where size is only conveyed relative to other elements counts as valid evidence, or if precise measurements are required. Future work could categorize question types and define what counts as valid evidence, including the distinction between direct visual evidence and visual evidence requiring external knowledge (see the Mrs Dash example above).

Poster

Citation

Please use the following BibTeX entry to cite our paper:

@inproceedings{everaert2026visual,
    title     = {{V}isual {G}rounding for {O}bject {Q}uestions},
    author    = {Everaert, Martin Nicolas and Liu, Xiruo and Takeda, Hiroyuki and Bala, Raja and Yadav, Vivek and Narayanan, Vidya},
    booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2026)},
    year      = {2026},
    pages     = {11966-11975},
    url       = {https://martin-ev.github.io/vgoq}
}