ICCV 2025 - Honolulu, Hawai'i
October 20nd - 8:30 AM - 18:00 PM (GMT-10)

CLVL: 6th Workshop on Closing the Loop Between Vision and Language (Decade Mark)

the Sixth workshop on Closing the Loop Between Vision and Language, hosted by ICCV2025.

000

Days

00

Hours

00

Minutes

00

Secondes

About the Workshop

This workshop explores the intersection of Computer Vision and Natural Language Processing (NLP), focusing on joint vision-language understanding. Recent advances in large-scale multimodal pretraining—especially with transformer architectures—have driven significant progress across a range of tasks.

  • Visual-linguistic representation learning
  • Visual Question Answering (VQA)
  • Image and video captioning
  • Visual dialog and referring expressions
  • Vision-and-language navigation and embodied QA
  • Text-to-image generation
  • Joint video-language understanding

We also emphasize critical work on dataset and algorithmic bias, generalization, transparency, and explainability. The workshop welcomes contributions addressing both foundational advances and real-world challenges in vision-language research.

Challenges

Reliable Visual Question Answering

Visual Question Answering (VQA) is a core task in Vision and Language research. While models have achieved high accuracy on datasets like VQAv2, their reliability—such as abstaining when uncertain—remains a challenge. This task, known as “Selective Prediction,” evaluates models on their ability to be self-aware and handle out-of-distribution data or distribution shifts.

Details
Long-Video Understanding (InfiniBench)

Understanding long videos (tens of minutes to hours) is a major challenge. InfiniBench is a benchmark with 1,000+ hours of video, 111.82K Q&A pairs, and diverse question types. It evaluates state-of-the-art multi-modal models and highlights the challenges of long video understanding.

Details
Multimodal Inconsistency Reasoning (MMIR)

Multimodal Large Language Models (MLLMs) are becoming increasingly proficient at reasoning over rich visual-textual content. However, real-world documents—like webpages, presentation slides, and posters—often contain semantic inconsistencies that current models struggle to detect. The MMIR Challenge invites participants to build models that can detect and localize inconsistencies in such layout-rich, multimodal artifacts.

Details
3D VISION AND LANGUAGE CHALLENGE

This is the 3D vision and language challenge for the 2025 ICCV CLVL workshop: Closing the Loop between Vision and Language. 3D vision and language focuses on localizing objects in a scene based on language descriptions. Building on ScanRefer, this challenge uses Multi3DRefer and ViGiL3D to evaluate models’ ability to identify objects in complex scenes and handle diverse linguistic prompts.

Details

Paper submission

Call for papers

Format


The typical paper's length is 4 pages excluding references in ICCV25 format. Authors kit, which includes templates and guidelines for formatting submissions, can be found here.

Archival submissions: Following ICCV 2025 guidelines, only long-form papers (between 4 and 8 pages, 4 < x ≤ 8) are permitted for archival submission. This aligns with the dual-submission policy, where peer-reviewed papers exceeding 4 pages qualify as archival publications.

Non-archival submissions are limited to 4 pages.
The submitted papers will be peer-reviewed in a double-blind manner. Abstracts will be presented in the workshop poster session. A portion of the accepted papers will be orally presented.

Paper submissions will be handled through CMT3, a conference management tool used for organizing paper submissions and peer reviews. You can access it via the following link: CMT3 ICCVCLVL2025

The Microsoft CMT service was used for managing the peer-reviewing process for this conference. This service was provided for free by Microsoft and they bore all expenses, including costs for Azure cloud services as well as for software development and support.


Topics


The scope of this workshop lies at the intersection of Computer Vision and Natural Language Processing (NLP). There is growing interest in joint understanding of vision and language in recent years, with researchers studying a multitude of tasks at the intersection. Most recently, there has also been a lot of interest in large-scale multi-modal pretraining, where Transformer-based architectures have established the new state-of-the-art on a range of tasks. Topics include but are not limited to:

Example Topics of Interest
  • Novel problems in vision and language
  • Learning to solve non-visual tasks using visual cues
  • Language guided visual understanding (objects, relationships)
  • Visual dialog and question answering by visual verification
  • Visual question generation
  • Vision and Language Navigation
  • Visually grounded conversation
  • Deep learning methods for vision and language
  • Language-based image and video search/retrieval
  • Linguistic descriptions of spatial relations
  • Auto-illustration
  • Natural language grounding & learning by watching
  • Learning knowledge from the web
  • Language as a mechanism to structure and reason about visual perception
  • Language as a learning bias to aid vision in both machines and humans
  • Dialog as means of sharing knowledge about visual perception
...and more

Dates

Timeline

Archival track (will appear in ICCV proceedings):
Event Date
Paper submission deadline July 7, 2025 - 11:59 PM (AoE)
Notification to authors July 11, 2025
Camera-ready deadline August 18, 2025
Workshop date October 20, 2025
Non-archival track:
Event Date
Paper submission deadline August 15, 2025
Notification to authors September 19, 2025
Camera-ready deadline September 26, 2025
Workshop date October 20, 2025

Speakers

Invited Speakers

Richard Socher

CEO, You.com Adjunct Professor, Stanford University

Ani Kembhavi

Director of Science Strategy at Wayve AI Affiliate Associate Professor, University of Washington

Kate Saenko

Full Professor, Boston University AI Research Scientist at FAIR, Meta

Georgia Gkioxari

Assistant Professor California Institute of Technology (Caltech)

Aishwarya Agrawal

Assistant Professor University of Montreal

Katerina Fragkiadaki

Assistant Professor Carnegie Mellon University

Yong Jae Lee

Associate Professor University of Wisconsin-Madison

Organizers

Workshop Organizers

Mohamed Elhoseiny

Associate Professor King Abdullah University of Science and Technology (KAUST)

Angel Chang

Associate Professor Simon Fraser University

Anna Rohrbach

Full Professor TU Darmstadt

Marcus Rohrbach

Full Professor TU Darmstadt

Xin Eric Wang

Assistant Professor UC Santa Cruz

Krishna Kumar

Research Scientist Adobe

Kilichbek Haydarov

PhD Candidate King Abdullah University of Science and Technology (KAUST)

Eslam Abdelrahman

PhD Candidate King Abdullah University of Science and Technology (KAUST)

Austin Wang

PhD Student Simon Fraser University

Yiming Zhang

PhD Student Simon Fraser University

Tobias Wieczorek

PhD Student TU Darmstadt

Qianqi (Jackie) Yan

PhD Student UC Santa Cruz

Program

Workshop Program

The workshop program is as follows:

Morning Session
  • 8:30 - 8:45 Introduction
  • 8:45 - 9:15 Invited Talk 1: Richard Socher
  • 9:15 - 10:45 Invited Talk 2: Ani Kembhavi
  • 10:45 - 11:15 Coffee Break
  • 11:15 - 11:45 Invited Talk 3: Kate Saenko
  • 11:45 - 12:00 Challenge 1: Reliable Visual Question Answering
  • 12:00 - 12:15 Challenge 2: Long-Video Understanding
  • 12:15 - 12:45 Spotlight Session (selected workshop accepted papers)
  • 12:45 - 14:15 Lunch and Posters
Afternoon Session
  • 14:15 - 14:45 Invited Talk 4: Aishwarya Agrawal
  • 14:45 - 15:15 Invited Talk 5: Georgia Gkioxari
  • 15:15 - 15:30 Challenge 3: Multimodal Inconsistency Reasoning
  • 15:30 - 16:00 Coffee Break
  • 16:00 - 16:30 Invited Talk 6: Katerina Fragkiadaki
  • 16:30 - 16:45 Challenge 4: 3D Language Grounding
  • 16:45 - 17:15 Invited Talk 7: Yong Jae Lee
  • 17:15 - 18:00 Panel discussion with invited speakers

For any question or support, please reach @Mohamed Elhoseiny.