Visual Grounding for Object Questions
Abstract
Current visual grounding research remains limited for practical applications, because existing tasks primarily focus on direct visual queries (e.g., “find the red car”) or reading visible text (e.g., “what is the title of this book?”), rather than supporting general questions about objects (e.g., “how comfortable are these earbuds?”). We introduce the novel problem of Visual Grounding for Object Questions (VGOQ). Unlike previous tasks that ground only what is directly visible in images, VGOQ handles open-ended general questions about objects, including concepts such as ease and comfort of use, and aims to identify visual evidence or context that would support an answer. This unexplored problem has immediate practical value, particularly in designing and optimizing product imagery in e-commerce stores. As initial steps toward this task, we develop two automated data generation techniques, which serve to train a lightweight visual grounding model, and to evaluate visual grounding approaches on the resulting synthetic benchmarks, ABO-VGOQ and VizWiz-VGOQ. Our results provide initial evidence that VGOQ represents a meaningful research direction: current SoTA visual grounding performance decreases from 52% gIoU to 37% gIoU when questions are rephrased from visual questions (segmentation of the answer) to general object questions (segmentation of visual evidence). On our new benchmarks, our lightweight model outperforms prior models while being much smaller.
Video Presentation
Toward More Practical Visual Grounding
Grounding directly visible elements
Segmentation
VQA Grounding / VQ Grounding
Other existing Visual Grounding tasks
- Phrase grounding
- Dense captioning
- Grounded captioning
- Reasoning segmentation
- 3D / Video Grounding
Grounding visual evidence or useful context that supports or helps understanding the answer to general questions about objects
Applications of Visual Grounding in e‑commerce stores require:
- Handling general questions about objects, including abstract concepts (e.g., comfort, ease of use).
- Identifying visual evidence or context that supports an answer, not just the answer itself (not necessarily directly visible).
- Possibly leveraging multiple images of the same object instead of a single image.
- Possibly leveraging object metadata, e.g., title, description, and other object attributes on top of images.
Automated Data Generation
One key challenge of VGOQ is the lack of datasets containing images, questions about objects, and visual grounding of evidence or context rather than visual grounding of the answer directly.
As a first step toward addressing this problem, we create two automated techniques:
- Transforming existing visual questions from VQA grounding datasets into general object questions → VizWiz-VGOQ.
- A zero-shot pipeline using Claude and traditional grounding models (Molmo 7B-D, Florence-2, SAM-2) to create visual grounding in a zero-shot manner for generating customer questions about products → ABO-VGOQ.
We then use Claude to automatically categorize the segmentation based on its evidential relationship to the object question. This allows us to train and evaluate visual grounding models on the different evidential quality categories (e.g., specific visual evidence, related context without visual evidence, etc).
Examples from VizWiz-VGOQ
The grounding highlights specific visual evidence (the "Mrs Dash" label) that supports answering the object question. Highlighting "mrs dash" is useful because this brand is specifically known for producing salt-free seasonings.
The grounding highlights specific visual evidence. Highlighting "beef ravioli" is useful because it shows that the product contains beef, which is meat from cattle.
Examples from ABO-VGOQ
+
product listing from ABO dataset (Item name: "Red Wagon Quilted Triple Strap Velcro, Boys’ Low-Top Sneakers")
Grounding:
The highlighting shows specific visual evidence that could help answer the question by focusing on the wide, substantial Velcro straps. The images reveal that the straps are broad and appear to have good gripping surfaces, which would likely make them easier for small children to manipulate.
+
product listing from ABO dataset (Item name: "Amazon Brand - Stone & Beam Prudence Tufted King Bed, 84"W, Curious Pearl")
Grounding:
The highlighting shows related context, but no visual evidence. While the highlighting correctly identifies the headboard that the customer is asking about, it doesn't provide alone clear information about whether the headboard is attached or removable.
Lightweight Model
The data generation pipeline we developed to create the ABO-VGOQ data allows locating visual evidence / context for answering general object questions, but it relies on multiple models and is computationally expensive, making it impractical for real-time deployment.
We thus train a lightweight model (1.77M parameters), based on CLIPSeg, that combines CLIP image and text encoders with a lightweight grounding transformer.
The model takes as input a textual input (e.g., an object question) and one image, and outputs a segmentation map and a relevance score. The relevance score indicates how relevant the image is for answering the question, and the segmentation map highlights visual evidence or context that would support answering the question.
336 × 336
(object question, visual question, referring expression, ...)
336 × 336
∈ [0, 1]
Evaluation
We evaluate various models (OFA, UnifiedIO, GLaMM, Qwen3-VL, our lightweight model) on our synthetic benchmarks (ABO-VGOQ and VizWiz-VGOQ), and on traditional VQ / VQA grounding benchmarks (VizWiz-VQA-Grounding, TextVQA-X, and Toloka benchmark).
Despite being much smaller (1.77M parameters), faster, and trained for much less time, our lightweight model outperforms most other models on the VGOQ tasks. On ABO-VGOQ, for the task of locating specific visual evidence, our model achieves 32.5–39.5% gIoU (average Intersection over Union), compared to 12.9–15.1% for a simple baseline, 11.9–17.9% for OFA, 12.3–15.6% for UnifiedIO, 19.4–22.3% for GLaMM, and 25.8–32.7% for Qwen3-VL.
Limitations and Future Work
Synthetic datasets. Both datasets are automatically generated rather than manually created, lacking the quality and precision of human-created segmentation. The quality of individual examples varies. However, taken as a whole, the metric (gIoU) appears to be a good indication of the model's ability to locate visual evidence or relevant context that would support answering general object questions. To the best of our knowledge, there are no prior existing datasets that contain images, general questions about objects, and visual grounding of the evidence / context that would support answering those questions, rather than visual grounding of the answer directly.
Subjectivity and task definition. Visual evidence assessment for object questions is inherently subjective. What constitutes valid evidence can vary based on context and persons. For instance, for a size question, it is unclear whether showing an image where size is only conveyed relative to other elements counts as valid evidence, or if precise measurements are required. Future work could categorize question types and define what counts as valid evidence, including the distinction between direct visual evidence and visual evidence requiring external knowledge (see the Mrs Dash example above).
Poster
Citation
Please use the following BibTeX entry to cite our paper:
@inproceedings{everaert2026visual,
title = {{V}isual {G}rounding for {O}bject {Q}uestions},
author = {Everaert, Martin Nicolas and Liu, Xiruo and Takeda, Hiroyuki and Bala, Raja and Yadav, Vivek and Narayanan, Vidya},
booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2026)},
year = {2026},
pages = {11966-11975},
url = {https://martin-ev.github.io/vgoq}
}