CLVL 2025

Program

Workshop Program

The workshop program is as follows:

Morning Session

8:30 - 8:45 Introduction
8:45 - 9:15 Invited Talk 1: Richard Socher
9:15 - 9:45 Invited Talk 2: "Reasoning, data-efficiency and alignment in vision-language models" by Aishwarya Agrawal
9:45 - 10:30 Spotlight Session (selected workshop accepted papers)
10:30 - 11:15 Coffee Break
11:15 - 11:45 Invited Talk 3 "Understanding and Advancing Multimodal Models" by Yong Jae Lee
11:45 - 12:00 Challenge 1: Reliable Visual Question Answering
12:00 - 12:15 Challenge 2: Long-Video Understanding
12:15 - 12:45 Poster Session
12:45 - 14:15 Lunch + Poster Session

Afternoon Session

14:15 - 14:45 Invited Talk 4 "Words, Worlds and Wheels: The Road Ahead" by Ani Kembhavi
14:45 - 15:15 Invited Talk 5: Georgia Gkioxari
15:15 - 15:30 Challenge 3: Multimodal Inconsistency Reasoning
15:30 - 16:00 Coffee Break
16:00 - 16:30 Invited Talk 6 "Towards Grounded Reasoning in Multimodal Agents" by Katerina Fragkiadaki
16:30 - 16:45 Challenge 4: 3D Language Grounding
16:45 - 17:15 Invited Talk 7 "Blind Spots in Multimodal AI Models" by Kate Saenko
17:15 - 18:00 Panel discussion with invited speakers

About the Workshop

This workshop explores the intersection of Computer Vision and Natural Language Processing (NLP), focusing on joint vision-language understanding. Recent advances in large-scale multimodal pretraining—especially with transformer architectures—have driven significant progress across a range of tasks.

Visual-linguistic representation learning
Visual Question Answering (VQA)
Image and video captioning
Visual dialog and referring expressions
Vision-and-language navigation and embodied QA
Text-to-image generation
Joint video-language understanding

We also emphasize critical work on dataset and algorithmic bias, generalization, transparency, and explainability. The workshop welcomes contributions addressing both foundational advances and real-world challenges in vision-language research.

Challenges

Reliable Visual Question Answering

Visual Question Answering (VQA) is a core task in Vision and Language research. While models have achieved high accuracy on datasets like VQAv2, their reliability—such as abstaining when uncertain—remains a challenge. This task, known as “Selective Prediction,” evaluates models on their ability to be self-aware and handle out-of-distribution data or distribution shifts.

Challenge Winner

We thank all participants, and especially the teams that submitted a report. Our winners, Team OMG021 from Bilibili Inc., Shanghai, China, presented an innovative approach with the key insight that the selector overfits to internal model activations. They propose to compensate this overfitting by using an external verification model that evaluates the given answer independently and complements the selector score. Their method won the challenge by a large margin (on the challenge set: Phi_100 of 38.78 vs. 31.31 achieved by the second place and 22.48 by our baseline). For more details, please see the team’s report!

Details

Long-Video Understanding (InfiniBench)

Understanding long videos (tens of minutes to hours) is a major challenge. InfiniBench is a benchmark with 1,000+ hours of video, 111.82K Q&A pairs, and diverse question types. It evaluates state-of-the-art multi-modal models and highlights the challenges of long video understanding.

Details

Multimodal Inconsistency Reasoning (MMIR)

Multimodal Large Language Models (MLLMs) are becoming increasingly proficient at reasoning over rich visual-textual content. However, real-world documents—like webpages, presentation slides, and posters—often contain semantic inconsistencies that current models struggle to detect. The MMIR Challenge invites participants to build models that can detect and localize inconsistencies in such layout-rich, multimodal artifacts.

Details

3D VISION AND LANGUAGE CHALLENGE

This is the 3D vision and language challenge for the 2025 ICCV CLVL workshop: Closing the Loop between Vision and Language. 3D vision and language focuses on localizing objects in a scene based on language descriptions. Building on ScanRefer, this challenge uses Multi3DRefer and ViGiL3D to evaluate models’ ability to identify objects in complex scenes and handle diverse linguistic prompts.

Details

Paper submission

Call for papers

Format

The typical paper's length is 4 pages excluding references in ICCV25 format. Authors kit, which includes templates and guidelines for formatting submissions, can be found here.

Archival submissions: Following ICCV 2025 guidelines, only long-form papers (between 4 and 8 pages, 4 < x ≤ 8) are permitted for archival submission. This aligns with the dual-submission policy, where peer-reviewed papers exceeding 4 pages qualify as archival publications.

Non-archival submissions are limited to 4 pages.

The submitted papers will be peer-reviewed in a double-blind manner. Abstracts will be presented in the workshop poster session. A portion of the accepted papers will be orally presented.

Paper submissions will be handled through CMT3, a conference management tool used for organizing paper submissions and peer reviews. You can access it via the following link: CMT3 ICCVCLVL2025

The Microsoft CMT service was used for managing the peer-reviewing process for this conference. This service was provided for free by Microsoft and they bore all expenses, including costs for Azure cloud services as well as for software development and support.

Topics

The scope of this workshop lies at the intersection of Computer Vision and Natural Language Processing (NLP). There is growing interest in joint understanding of vision and language in recent years, with researchers studying a multitude of tasks at the intersection. Most recently, there has also been a lot of interest in large-scale multi-modal pretraining, where Transformer-based architectures have established the new state-of-the-art on a range of tasks. Topics include but are not limited to:

Example Topics of Interest

Novel problems in vision and language
Learning to solve non-visual tasks using visual cues
Language guided visual understanding (objects, relationships)
Visual dialog and question answering by visual verification
Visual question generation
Vision and Language Navigation
Visually grounded conversation
Deep learning methods for vision and language

Language-based image and video search/retrieval
Linguistic descriptions of spatial relations
Auto-illustration
Natural language grounding & learning by watching
Learning knowledge from the web
Language as a mechanism to structure and reason about visual perception
Language as a learning bias to aid vision in both machines and humans
Dialog as means of sharing knowledge about visual perception

...and more

Dates

Timeline

Archival track (will appear in ICCV proceedings):

Event	Date
Paper submission deadline	July 7, 2025 - 11:59 PM (AoE)
Notification to authors	July 11, 2025
Camera-ready deadline	August 18, 2025
Workshop date	October 20, 2025

Non-archival track:

Event	Date
Paper submission deadline	August 15, 2025
Notification to authors	September 19, 2025
Camera-ready deadline	September 26, 2025
Workshop date	October 20, 2025

Speakers

Invited Speakers

ICCV 2025 - Room 313 C, Hawaii Convention Center - Honolulu, Hawai'i

October 20nd - 8:30 AM - 18:00 PM (GMT-10)

CLVL: 6th Workshop on Closing the Loop Between Vision and Language (Decade Mark)

000

Days

00

Hours

00

Minutes

00

Secondes

Program

Workshop Program

Morning Session

Afternoon Session

About the Workshop

Challenges

Reliable Visual Question Answering

Challenge Winner

Long-Video Understanding (InfiniBench)

Multimodal Inconsistency Reasoning (MMIR)

3D VISION AND LANGUAGE CHALLENGE

Paper submission

Call for papers

Format

Topics

Dates

Timeline

Archival track (will appear in ICCV proceedings):

Non-archival track:

Speakers

Invited Speakers

Organizers

Workshop Organizers

Previous Versions of the workshop

Accepted Papers

Archival Papers

Non-Archival Papers

October 20^nd - 8:30 AM - 18:00 PM (GMT-10)

CLVL: 6^th Workshop on Closing the Loop Between Vision and Language (Decade Mark)