ICCV 2025 - Honolulu, Hawai'i
October 20nd - 8:30 AM - 18:00 PM (GMT-10)

CLVL: 6th Workshop on Closing the Loop Between Vision and Language (Decade Mark)

the Sixth workshop on Closing the Loop Between Vision and Language, hosted by ICCV2025.

000

Days

00

Hours

00

Minutes

00

Secondes

About the Workshop

This workshop explores the intersection of Computer Vision and Natural Language Processing (NLP), focusing on joint vision-language understanding. Recent advances in large-scale multimodal pretraining—especially with transformer architectures—have driven significant progress across a range of tasks.

  • Visual-linguistic representation learning
  • Visual Question Answering (VQA)
  • Image and video captioning
  • Visual dialog and referring expressions
  • Vision-and-language navigation and embodied QA
  • Text-to-image generation
  • Joint video-language understanding

We also emphasize critical work on dataset and algorithmic bias, generalization, transparency, and explainability. The workshop welcomes contributions addressing both foundational advances and real-world challenges in vision-language research.

Challenges

Reliable Visual Question Answering

Visual Question Answering (VQA) is a core task in Vision and Language research. While models have achieved high accuracy on datasets like VQAv2, their reliability—such as abstaining when uncertain—remains a challenge. This task, known as “Selective Prediction,” evaluates models on their ability to be self-aware and handle out-of-distribution data or distribution shifts. (Organizers: Tobias Wieczorek and Marcus Rohrbach)

Details
Long-Video Understanding (InfiniBench)

Understanding long videos (tens of minutes to hours) is a major challenge. InfiniBench is a benchmark with 1,000+ hours of video, 111.82K Q&A pairs, and diverse question types. It evaluates state-of-the-art multi-modal models and highlights the challenges of long video understanding. (Organizers: Eslam Abdelrahman and Mohamed Elhoseiny)

Details
Multimodal Inconsistency Reasoning (MMIR)

MMIR evaluates models on detecting semantic mismatches in real-world multimodal artifacts. The benchmark includes 534 realistic samples with open-ended and multiple-choice settings, requiring models to identify or select elements corresponding to introduced inconsistencies. (Organizers: Jackie Yan and Xin Eric Wang)

Details
3D Language Grounding

3D language grounding focuses on localizing objects in a scene based on language descriptions. Building on ScanRefer, this challenge uses Multi3DRefer and ViGiL3D to evaluate models’ ability to identify objects in complex scenes and handle diverse linguistic prompts. (Organizers: Yiming Zhang, Austin T. Wang, and Angel X. Chang)

Details

Paper submission

Call for papers

Format


The typical paper's length is 4 pages excluding references in ICCV25 format. Authors kit, which includes templates and guidelines for formatting submissions, can be found here.

Note: Applies only to archival papers

Following ICCV 2025 guidelines, only long-form papers (between 4 and 8 pages, 4 < x ≤ 8) are permitted for archival submission. This aligns with the dual-submission policy, where peer-reviewed papers exceeding 4 pages qualify as archival publications.
Non-archival submissions should be limited to 4 pages.

The submitted papers will be peer-reviewed in a double-blind manner. Abstracts will be presented in the workshop poster session. A portion of the accepted papers will be orally presented.

Paper submissions will be handled through CMT3, a conference management tool used for organizing paper submissions and peer reviews. You can access it via the following link: CMT3 ICCVCLVL2025

The Microsoft CMT service was used for managing the peer-reviewing process for this conference. This service was provided for free by Microsoft and they bore all expenses, including costs for Azure cloud services as well as for software development and support.


Topics


The scope of this workshop lies at the intersection of Computer Vision and Natural Language Processing (NLP). There is growing interest in joint understanding of vision and language in recent years, with researchers studying a multitude of tasks at the intersection. Most recently, there has also been a lot of interest in large-scale multi-modal pretraining, where Transformer-based architectures have established the new state-of-the-art on a range of tasks. Topics include but are not limited to:

Example Topics of Interest
  • Novel problems in vision and language
  • Learning to solve non-visual tasks using visual cues
  • Language guided visual understanding (objects, relationships)
  • Visual dialog and question answering by visual verification
  • Visual question generation
  • Vision and Language Navigation
  • Visually grounded conversation
  • Deep learning methods for vision and language
  • Language-based image and video search/retrieval
  • Linguistic descriptions of spatial relations
  • Auto-illustration
  • Natural language grounding & learning by watching
  • Learning knowledge from the web
  • Language as a mechanism to structure and reason about visual perception
  • Language as a learning bias to aid vision in both machines and humans
  • Dialog as means of sharing knowledge about visual perception
...and more

Dates

Timeline

Archival track (will appear in ICCV proceedings):
Event Date
Paper submission deadline July 7, 2025 - 11:59 PM (AoE)
Notification to authors July 11, 2025
Camera-ready deadline August 18, 2025
Workshop date October 20, 2025
Non-archival track:
Event Date
Paper submission deadline August 15, 2025
Notification to authors September 7, 2025
Camera-ready deadline September 15, 2025
Workshop date October 20, 2025

Speakers

Invited Speakers

Richard Socher

CEO, You.com Adjunct Professor, Stanford University

Ani Kembhavi

Director of Science Strategy at Wayve AI Affiliate Associate Professor, University of Washington

Kate Saenko

Full Professor, Boston University AI Research Scientist at FAIR, Meta

Yejin Choi

Senior Director,NVIDIA Professor, Stanford University

Svetlana Lazebnik

Professor University of Illinois Urbana-Champaign

Katerina Fragkiadaki

Assistant Professor Carnegie Mellon University

Yong Jae Lee

Associate Professor University of Wisconsin-Madison

Organizers

Workshop Organizers

Mohamed Elhoseiny

Associate Professor King Abdullah University of Science and Technology (KAUST)

Angel Chang

Associate Professor Simon Fraser University

Anna Rohrbach

Full Professor TU Darmstadt

Marcus Rohrbach

Full Professor TU Darmstadt

Xin Eric Wang

Assistant Professor UC Santa Cruz

Krishna Kumar

Research Scientist Adobe

Kilichbek Haydarov

PhD Candidate King Abdullah University of Science and Technology (KAUST)

Eslam Abdelrahman

PhD Candidate King Abdullah University of Science and Technology (KAUST)

Austin Wang

PhD Student Simon Fraser University

Yiming Zhang

PhD Student Simon Fraser University

Tobias Wieczorek

PhD Student TU Darmstadt

Qianqi (Jackie) Yan

PhD Student UC Santa Cruz

Program

Workshop Program

The workshop program is as follows:

Morning Session
  • 8:30 - 8:45 Introduction
  • 8:45 - 9:15 Invited Talk 1: Richard Socher
  • 9:15 - 10:45 Invited Talk 2: Ani Kembhavi
  • 10:45 - 11:15 Coffee Break
  • 11:15 - 11:45 Invited Talk 3: Kate Saenko
  • 11:45 - 12:00 Challenge 1: Reliable Visual Question Answering
  • 12:00 - 12:15 Challenge 2: Long-Video Understanding
  • 12:15 - 12:45 Spotlight Session (selected workshop accepted papers)
  • 12:45 - 14:15 Lunch and Posters
Afternoon Session
  • 14:15 - 14:45 Invited Talk 4: Yejin Choi
  • 14:45 - 15:15 Invited Talk 5: Svetlana Lazebnik
  • 15:15 - 15:30 Challenge 3: Multimodal Inconsistency Reasoning
  • 15:30 - 16:00 Coffee Break
  • 16:00 - 16:30 Invited Talk 6: Katerina Fragkiadaki
  • 16:30 - 16:45 Challenge 4: 3D Language Grounding
  • 16:45 - 17:15 Invited Talk 7: Yong Jae Lee
  • 17:15 - 18:00 Panel discussion with invited speakers

For any question or support, please reach @Mohamed Elhoseiny.