ICCV 2023 - Paris, France
October 2nd - 8:30 AM - 18:00 PM (GMT+2)

CLVL: 5th Workshop on Closing the Loop Between Vision and Language

Fifth workshop on Closing the Loop Between Vision and Language, hosted by #ICCV2023.










1. Scientific Figure Captioning Challenge -> link

2. Aerial Vision-and-Dialog Navigation Challenge -> link

3. Visual-Dialog Based Emotion Explanation Generation Challenge -> link

Paper submission

Call for papers


The typical paper's length is 4 pages excluding references in ICCV23 format. Authors kit can be found here.

The submitted 4-page abstracts will be peer-reviewed in ICCV2023 format. Abstracts will be presented in the workshop poster session, and a portion of the accepted papers will be orally presented.

Paper submissions will be handled with CMT through the following link: CMT3 ICCVCLVL2023


The scope of this workshop lies in the boundary of Computer Vision and Natural Language Processing (NLP). There is growing interest in joint understanding of vision and language in recent years, with researchers studying a multitude of tasks at the intersection. Most recently, there has also been a lot of interest in large-scale multi-modal pretraining, where Transformer-based architectures have established the new state-of-the-art on a range of tasks. Topics include but are not limited to

  • Novel problems in vision and language
  • Learning to solve non-visual tasks using visual cues
  • Language guided visual understanding (objects, relationships)
  • Visual dialog and question answering by visual verification
  • Visual question generation
  • Vision and Language Navigation
  • Visually grounded conversation
  • Visual sense disambiguation
  • Deep learning methods for vision and language
  • Visual reasoning on language problems
  • Text-to-image generation
  • Language based visual abstraction
  • Text as weak labels for image or video classification.
  • Image/Video Annotation and natural language description generation
  • Transfer learning for vision and language
  • Jointly learn to parse and perceive (text+image, text+video)
  • Multimodal clustering and word sense disambiguation
  • Unstructured text search for visual content
  • Visually grounded language acquisition and understanding
  • Referring expression comprehension
  • Language-based image and video search/retrieval
  • Linguistic descriptions of spatial relations
  • Auto-illustration
  • Natural language grounding & learning by watching
  • Learning knowledge from the web
  • Language as a mechanism to structure and reason about visual perception
  • Language as a learning bias to aid vision in both machines and humans
  • Dialog as means of sharing knowledge about visual perception
  • Stories as means of abstraction
  • Understanding the relationship between language and vision in humans
  • Humanistic, subjective, or expressive vision-to-language
  • Visual storytelling
  • Generating Audio Descriptions for movies
  • Multi-sentence descriptions for images and videos
  • Visual question answering for video
  • Visual fill-in-the blank tasks
  • Language as supervision for video understanding
  • Using dialogs and/or audio for video understanding
  • Understanding videos and plots
  • Limitations of existing vision and language datasets
  • ...

Accepted papers

Paper ID Paper Title Author Names
3 Sparse Linear Concept Discovery Models (oral) Panousis, Konstantinos*; Ienco, Dino; Marcos, Diego
7 Compositional Image Search with Progressive Vision-language Alignment and Multimodal Fusion Hu, Zhizhang*; Zhu, Xinliang; Tran, Son ; Vidal, Rene; Dhua, Arnab
8 Vision-Language Models Performing Zero-Shot Tasks Exhibit Disparities Between Gender Groups Hall, Melissa*; Gustafson, Laura; Adcock, Aaron; Misra, Ishan; Ross, Candace
11 BiLMa: Bidirectional Local-Matching for Text-based Person Re-identification Fujii, Takuro*; Tarashima, Shuhei
12 Alignment and Generation Adapter for Efficient Video-text Understanding Fang, Han*; Yang, Zhifei; Wei, Yuhan; zang, xianghao; ban, chao; Feng, Zerun; He, Zhongjiang; Li, Yongxiang; Sun, Hao
13 LLaViLo: Boosting Video Moment Retrieval via Adapter-Based Multimodal Modeling Ma, Kaijing; zang, xianghao; Feng, Zerun; Fang, Han*; ban, chao; Wei, Yuhan; He, Zhongjiang; Li, Yongxiang; Sun, Hao
16 Zero-Shot and Few-Shot Video Question Answering with Multi-Modal Prompts (oral) Engin, Deniz*; Avrithis, Yannis
19 ECO: Ensembling Context Optimization for Vision-Language Models Agnolucci, Lorenzo*; Baldrati, Alberto; Todino, Francesco; Becattini, Federico; Bertini, Marco; Del Bimbo, Alberto
20 A Cross-Dataset Study on the Brazilian Sign Language Translation Sarmento, Amanda H A*; Ponti, Moacir A
22 Context-VQA: Towards Context-Aware and Purposeful Visual Question Answering Naik, Nandita S*; Potts, Christopher; Kreiss, Elisa
25 Explaining Vision and Language through Graphs of Events in Space and Time Masala, Mihai LLP; Cudlenco, Nicolae; Rebedea, Traian; Leordeanu, Marius*
27 Mapping Memes to Words for Multimodal Hateful Meme Classification Burbi, Giovanni; Baldrati, Alberto*; Agnolucci, Lorenzo; Bertini, Marco; Del Bimbo, Alberto
28 Cross-Modal Dense Passage Retrieval for Open Knowledge Visual Question Answering Reichman, Benjamin*; Heck, Larry
29 PatFig: Generating Short and Long Captions for Patent Figures Aubakirova, Dana*; Gerdes, Kim; Liu, Lufei
30 An empirical study of the effect of video encoders on Temporal Video Grounding Meza, Ignacio A*; Rodriguez, Cristian; Marrese-Taylor, Edison; Bravo-Marquez, Felipe
31 Towards Vision-Language Mechanistic Interpretability: A Causal Tracing Tool for BLIP Palit, Vedant; Pandey, Rohan*; Arora, Aryaman; Liang, Paul Pu
32 Multimodal Neurons in Pretrained Text-Only Transformers (oral) Schwettmann, Sarah*; Chowdhury, Neil; Klein, Samuel J; Bau, David; Torralba, Antonio
6 VQA Therapy: Exploring Answer Differences by Visually Grounding Answers Chen, Chongyan*; Anjum, Samreen; Gurari, Danna
9 HRS-Bench: Holistic, Reliable and Scalable Benchmark for Text-to-Image Models Abdelrahman, Eslam Mohamed*; Sun, Pengzhan; shen, xiaoqian; Khan, Faizan Farooq; Li, Li Erran; Elhoseiny, Mohamed
14 In-Style: Bridging Text and Uncurated Videos with Style Transfer for Text-Video Retrieval Shvetsova, Nina*; Kukleva, Anna; Schiele, Bernt; Kuehne, Hilde
23 Pretrained Language Models as Visual Planners for Human Assistance Patel, Dhruvesh; Eghbalzadeh, Hamid; Chen, Brian; Kamra, Nitin; Iuzzolino, Michael; Jain, Unnat; Desai, Ruta P*
33 Painter: Teaching Auto-regressive Language Models to Draw Sketches Pourreza, Reza*; Bhattacharyya, Apratim; Panchal, Sunny P; Lee, Mingu; Madan, Pulkit; Memisevic, Roland
34 LLMs as Zero-Shot Visual Reasoners: Bridging Visual Understanding and Commonsense Reasoning with Large Language Models Zhou, Kaiwen*; Lee, Kwonjoon; Misu, Teruhisa; Wang, Xin Eric
36 Improving Commonsense in Vision-Language Models via Knowledge Graph Riddles Ye, Shuquan*; Xie, Yujia; Chen, DongDong; Xu, Yichong; Yuan, Lu; Zhu, Chenguang; Liao, Jing
37 Zero-Shot Composed Image Retrieval with Textual Inversion Baldrati, Alberto*; Agnolucci, Lorenzo; Bertini, Marco; Del Bimbo, Alberto
38 Coarse-to-Fine Contrastive Learning in Image-Text-Graph Space for Improved Vision-Language Compositionality (oral) Singh, Harman*; Zhang, Pengchuan; Wang, Qifan; Wang, Mengjiao; Xiong, Wenhan; Du, Jingfei; Chen, Hugo
39 Simple Token-Level Confidence Improves Caption Correctness (oral) Petryk, Suzanne*; Whitehead, Spencer; Gonzalez, Joseph; Darrell, Trevor; Rohrbach, Anna; Rohrbach, Marcus
40 DeViL: Decoding Vision features into Language Dani, Meghal*; Rio-Torto, Isabel; Alaniz, Stephan; Akata, Zeynep
41 MAtch, eXpand and Improve: Unsupervised Finetuning for Zero-Shot Action Recognition with Language Knowledge Lin, Wei*; Karlinsky, Leonid; Shvetsova, Nina; Possegger, Horst; Kozinski, Mateusz; Panda, Rameswar; Feris, Rogerio; Kuehne, Hilde; Bischof, Horst
42 Improved Probabilistic Image-Text Representations Chun, Sanghyuk*
44 Look, Remember and Reason: Visual Reasoning with Grounded Rationales Bhattacharyya, Apratim*; Panchal, Sunny P; Lee, Mingu; Pourreza, Reza; Madan, Pulkit; Memisevic, Roland
45 Visual Coherence Loss for Coherent and Visually Grounded Story Generation (oral) Hong, Xudong*; Demberg, Vera; Sayeed, Asad B; Schiele, Bernt
46 Comics for Everyone: Generating Accessible Text Descriptions for Comic Strips R, Reshma*
47 Instruction-tuned Self-Questioning Framework for Multimodal Reasoning Jang, Youwon*; Heo, Yu-Jung; Kim, Jaeseok; Lee, Minsu; Chang, Du-Seong; Zhang, Byoung-Tak



Archival track (will appear in ICCV proceedings):
Event Date
Paper submission deadline July 25, 2023 - 11:59 PM PT
Notification to authors August 8, 2023
Camera-ready deadline August 14, 2023
Workshop date October 2, 2023
Non-archival track:
Event Date
Paper submission deadline August 15, 2023
Notification to authors September 7, 2023
Camera-ready deadline September 15, 2023
Workshop date October 2, 2023


Invited Speakers

Alane Suhr

Assistant Professor University of California, Berkeley

Andrew Zisserman

Professor University of Oxford

Ranjay Krishna

‚ÄčAssistant Professor University of Washington

Jiasen Lu

Research Scientist Allen Institute for AI

Karthik Narasimhan

Assistant Professor Princeton University

Shuran Song

Assistant Professor Stanford University and Columbia University

Trevor Darrell

Professor University of California, Berkeley


Workshop Organizers

Mohamed Elhoseiny

Assistant Professor King Abdullah University of Science and Technology (KAUST)

Angel Chang

Assistant Professor Simon Fraser University

Xin Eric Wang

Assistant Professor UC Santa Cruz

Gamaleldin Elsayed

Research Scientist Google Brain

Kilichbek Haydarov

PhD Student King Abdullah University of Science and Technology (KAUST)


Workshop Program

The workshop program is as follows:

  • 8:30 AM - 8:40 AM - Opening remarks
  • 8:40 AM - 9:15 AM - Ranjay Krishna (University of Washington) - TBD
  • 9:15 AM - 9:45 AM - Ting-Hao 'Kenneth' Huang (Penn State University) - Scientific Figure Captioning Challenge (Guest Challenge)
  • 9:45 AM - 10:15 AM - Kilichbek Haydarov (King Abdullah University of Science and Technology) - Visual-Dialog Based Emotion Explanation Generation Challenge
  • 10:15 AM - 10:45 AM - Coffee Break and Poster Session
  • 10:45 AM - 11:20 AM - Andrew Zisserman (University of Oxford) - Automatically Generating Audio Descriptions for Movies
  • 11:20 AM - 12:00 PM - Selected Oral Presentations
    • 11:20 AM - 11:25 AM - Sparse Linear Concept Discovery Models - Konstantinos Panousis (INRIA); Dino Ienco (INRAE); Diego Marcos (Inria)
    • 11:25 AM - 11:30 AM - Zero-Shot and Few-Shot Video Question Answering with Multi-Modal Prompts - Deniz Engin (INRIA)*; Yannis Avrithis (IARAI)
    • 11:30 AM - 11:35 AM - Multimodal Neurons in Pretrained Text-Only Transformers - Sarah Schwettmann (MIT)*; Neil Chowdhury (MIT); Samuel J Klein (Knowledge Futures Group); David Bau (Northeastern University); Antonio Torralba (MIT)
    • 11:35 AM - 11:40 AM - QA for each three papers
    • 11:40 AM - 11:45 AM - Simple Token-Level Confidence Improves Caption Correctness - Suzanne Petryk (UC Berkeley)*; Spencer Whitehead (Meta AI); Joseph Gonzalez (U.C. Berkeley); Trevor Darrell (UC Berkeley); Anna Rohrbach (UC Berkeley); Marcus Rohrbach (Facebook AI Research)
    • 11:45 AM - 11:50 AM - Coarse-to-Fine Contrastive Learning in Image-Text-Graph Space for Improved Vision-Language Compositionality - Harman Singh (Meta)*; Pengchuan Zhang (Meta); Qifan Wang (Meta AI); Mengjiao Wang (Facebook); Wenhan Xiong (Meta); Jingfei Du (-); Hugo Chen (Meta)
    • 11:50 AM - 11:55 AM - Visual Coherence Loss for Coherent and Visually Grounded Story Generation - Xudong Hong (Saarland University)*; Vera Demberg (Dept. of Mathematics and Computer Science, Saarland University); Asad B Sayeed (University of Gothenburg); Bernt Schiele (MPI Informatics)
    • 11:55 AM - 12:00 PM - QA for each three papers
  • 12:00 PM - 13:30 PM - Lunch and Posters
  • 13:30 PM - 14:00 PM - Trevor Darrel (UC Berkeley) - Recent advances in (L)LMs for vision and language
  • 14:00 PM - 14:30 PM - Jiasen Lu (AI2) - Unified-IO 2: Multi-Modal Reentry for Vision, Language, Audio and Action
  • 14:30 PM - 15:00 PM - Shuran Song (Stanford and Columbia University) - Closing the Loop between Vision Language and Actions
  • 15:00 PM - 15:30 PM - Xin Eric Wang/ Team (UC Santa Cruz) - Aerial Vision-and-Dialog Navigation Challenge
  • 15:30 PM - 16:15 PM - Coffee Break and Poster Sessions
  • 16:15 PM - 16:45 PM - Karthik Narasimhan (Princeton University) - Language Agents: Machines that can Read, Think and Act
  • 16:45 PM - 17:15 PM - Alane Suhr (UC Berkeley) - TBD
  • 17:15 PM - 18:00 PM - Panel Discussion

For any question or support, please reach @Mohamed Elhoseiny.