CLVL 2023

Paper submission

Call for papers

Format

The typical paper's length is 4 pages excluding references in ICCV23 format. Authors kit can be found here.

The submitted 4-page abstracts will be peer-reviewed in ICCV2023 format. Abstracts will be presented in the workshop poster session, and a portion of the accepted papers will be orally presented.

Paper submissions will be handled with CMT through the following link: CMT3 ICCVCLVL2023

Topics

The scope of this workshop lies in the boundary of Computer Vision and Natural Language Processing (NLP). There is growing interest in joint understanding of vision and language in recent years, with researchers studying a multitude of tasks at the intersection. Most recently, there has also been a lot of interest in large-scale multi-modal pretraining, where Transformer-based architectures have established the new state-of-the-art on a range of tasks. Topics include but are not limited to

Novel problems in vision and language
Learning to solve non-visual tasks using visual cues
Language guided visual understanding (objects, relationships)
Visual dialog and question answering by visual verification
Visual question generation
Vision and Language Navigation
Visually grounded conversation
Visual sense disambiguation
Deep learning methods for vision and language
Visual reasoning on language problems
Text-to-image generation
Language based visual abstraction
Text as weak labels for image or video classification.
Image/Video Annotation and natural language description generation
Transfer learning for vision and language
Jointly learn to parse and perceive (text+image, text+video)
Multimodal clustering and word sense disambiguation
Unstructured text search for visual content
Visually grounded language acquisition and understanding
Referring expression comprehension
Language-based image and video search/retrieval
Linguistic descriptions of spatial relations
Auto-illustration
Natural language grounding & learning by watching
Learning knowledge from the web
Language as a mechanism to structure and reason about visual perception
Language as a learning bias to aid vision in both machines and humans
Dialog as means of sharing knowledge about visual perception
Stories as means of abstraction
Understanding the relationship between language and vision in humans
Humanistic, subjective, or expressive vision-to-language
Visual storytelling
Generating Audio Descriptions for movies
Multi-sentence descriptions for images and videos
Visual question answering for video
Visual fill-in-the blank tasks
Language as supervision for video understanding
Using dialogs and/or audio for video understanding
Understanding videos and plots
Limitations of existing vision and language datasets
...

Accepted papers

Paper ID	Paper Title	Author Names
3	Sparse Linear Concept Discovery Models (oral)	Panousis, Konstantinos*; Ienco, Dino; Marcos, Diego
7	Compositional Image Search with Progressive Vision-language Alignment and Multimodal Fusion	Hu, Zhizhang*; Zhu, Xinliang; Tran, Son ; Vidal, Rene; Dhua, Arnab
8	Vision-Language Models Performing Zero-Shot Tasks Exhibit Disparities Between Gender Groups	Hall, Melissa*; Gustafson, Laura; Adcock, Aaron; Misra, Ishan; Ross, Candace
11	BiLMa: Bidirectional Local-Matching for Text-based Person Re-identification	Fujii, Takuro*; Tarashima, Shuhei
12	Alignment and Generation Adapter for Efficient Video-text Understanding	Fang, Han*; Yang, Zhifei; Wei, Yuhan; zang, xianghao; ban, chao; Feng, Zerun; He, Zhongjiang; Li, Yongxiang; Sun, Hao
13	LLaViLo: Boosting Video Moment Retrieval via Adapter-Based Multimodal Modeling	Ma, Kaijing; zang, xianghao; Feng, Zerun; Fang, Han*; ban, chao; Wei, Yuhan; He, Zhongjiang; Li, Yongxiang; Sun, Hao
16	Zero-Shot and Few-Shot Video Question Answering with Multi-Modal Prompts (oral)	Engin, Deniz*; Avrithis, Yannis
19	ECO: Ensembling Context Optimization for Vision-Language Models	Agnolucci, Lorenzo*; Baldrati, Alberto; Todino, Francesco; Becattini, Federico; Bertini, Marco; Del Bimbo, Alberto
20	A Cross-Dataset Study on the Brazilian Sign Language Translation	Sarmento, Amanda H A*; Ponti, Moacir A
22	Context-VQA: Towards Context-Aware and Purposeful Visual Question Answering	Naik, Nandita S*; Potts, Christopher; Kreiss, Elisa
25	Explaining Vision and Language through Graphs of Events in Space and Time	Masala, Mihai LLP; Cudlenco, Nicolae; Rebedea, Traian; Leordeanu, Marius*
27	Mapping Memes to Words for Multimodal Hateful Meme Classification	Burbi, Giovanni; Baldrati, Alberto*; Agnolucci, Lorenzo; Bertini, Marco; Del Bimbo, Alberto
28	Cross-Modal Dense Passage Retrieval for Open Knowledge Visual Question Answering	Reichman, Benjamin*; Heck, Larry
29	PatFig: Generating Short and Long Captions for Patent Figures	Aubakirova, Dana*; Gerdes, Kim; Liu, Lufei
30	An empirical study of the effect of video encoders on Temporal Video Grounding	Meza, Ignacio A*; Rodriguez, Cristian; Marrese-Taylor, Edison; Bravo-Marquez, Felipe
31	Towards Vision-Language Mechanistic Interpretability: A Causal Tracing Tool for BLIP	Palit, Vedant; Pandey, Rohan*; Arora, Aryaman; Liang, Paul Pu
32	Multimodal Neurons in Pretrained Text-Only Transformers (oral)	Schwettmann, Sarah*; Chowdhury, Neil; Klein, Samuel J; Bau, David; Torralba, Antonio
6	VQA Therapy: Exploring Answer Differences by Visually Grounding Answers	Chen, Chongyan*; Anjum, Samreen; Gurari, Danna
9	HRS-Bench: Holistic, Reliable and Scalable Benchmark for Text-to-Image Models	Abdelrahman, Eslam Mohamed*; Sun, Pengzhan; shen, xiaoqian; Khan, Faizan Farooq; Li, Li Erran; Elhoseiny, Mohamed
14	In-Style: Bridging Text and Uncurated Videos with Style Transfer for Text-Video Retrieval	Shvetsova, Nina*; Kukleva, Anna; Schiele, Bernt; Kuehne, Hilde
23	Pretrained Language Models as Visual Planners for Human Assistance	Patel, Dhruvesh; Eghbalzadeh, Hamid; Chen, Brian; Kamra, Nitin; Iuzzolino, Michael; Jain, Unnat; Desai, Ruta P*
33	Painter: Teaching Auto-regressive Language Models to Draw Sketches	Pourreza, Reza*; Bhattacharyya, Apratim; Panchal, Sunny P; Lee, Mingu; Madan, Pulkit; Memisevic, Roland
34	LLMs as Zero-Shot Visual Reasoners: Bridging Visual Understanding and Commonsense Reasoning with Large Language Models	Zhou, Kaiwen*; Lee, Kwonjoon; Misu, Teruhisa; Wang, Xin Eric
36	Improving Commonsense in Vision-Language Models via Knowledge Graph Riddles	Ye, Shuquan*; Xie, Yujia; Chen, DongDong; Xu, Yichong; Yuan, Lu; Zhu, Chenguang; Liao, Jing
37	Zero-Shot Composed Image Retrieval with Textual Inversion	Baldrati, Alberto*; Agnolucci, Lorenzo; Bertini, Marco; Del Bimbo, Alberto
38	Coarse-to-Fine Contrastive Learning in Image-Text-Graph Space for Improved Vision-Language Compositionality (oral)	Singh, Harman*; Zhang, Pengchuan; Wang, Qifan; Wang, Mengjiao; Xiong, Wenhan; Du, Jingfei; Chen, Hugo
39	Simple Token-Level Confidence Improves Caption Correctness (oral)	Petryk, Suzanne*; Whitehead, Spencer; Gonzalez, Joseph; Darrell, Trevor; Rohrbach, Anna; Rohrbach, Marcus
40	DeViL: Decoding Vision features into Language	Dani, Meghal*; Rio-Torto, Isabel; Alaniz, Stephan; Akata, Zeynep
41	MAtch, eXpand and Improve: Unsupervised Finetuning for Zero-Shot Action Recognition with Language Knowledge	Lin, Wei*; Karlinsky, Leonid; Shvetsova, Nina; Possegger, Horst; Kozinski, Mateusz; Panda, Rameswar; Feris, Rogerio; Kuehne, Hilde; Bischof, Horst
42	Improved Probabilistic Image-Text Representations	Chun, Sanghyuk*
44	Look, Remember and Reason: Visual Reasoning with Grounded Rationales	Bhattacharyya, Apratim*; Panchal, Sunny P; Lee, Mingu; Pourreza, Reza; Madan, Pulkit; Memisevic, Roland
45	Visual Coherence Loss for Coherent and Visually Grounded Story Generation (oral)	Hong, Xudong*; Demberg, Vera; Sayeed, Asad B; Schiele, Bernt
46	Comics for Everyone: Generating Accessible Text Descriptions for Comic Strips	R, Reshma*
47	Instruction-tuned Self-Questioning Framework for Multimodal Reasoning	Jang, Youwon*; Heo, Yu-Jung; Kim, Jaeseok; Lee, Minsu; Chang, Du-Seong; Zhang, Byoung-Tak

Event	Date
Paper submission deadline	July 25, 2023 - 11:59 PM PT
Notification to authors	August 8, 2023
Camera-ready deadline	August 14, 2023
Workshop date	October 2, 2023

Event	Date
Paper submission deadline	August 15, 2023
Notification to authors	September 7, 2023
Camera-ready deadline	September 15, 2023
Workshop date	October 2, 2023

Program

Workshop Program

The workshop program is as follows:

8:30 AM - 8:40 AM - Opening remarks
8:40 AM - 9:15 AM - Ranjay Krishna (University of Washington) - TBD
9:15 AM - 9:45 AM - Ting-Hao 'Kenneth' Huang (Penn State University) - Scientific Figure Captioning Challenge (Guest Challenge)
9:45 AM - 10:15 AM - Kilichbek Haydarov (King Abdullah University of Science and Technology) - Visual-Dialog Based Emotion Explanation Generation Challenge
10:15 AM - 10:45 AM - Coffee Break and Poster Session
10:45 AM - 11:20 AM - Andrew Zisserman (University of Oxford) - Automatically Generating Audio Descriptions for Movies
11:20 AM - 12:00 PM - Selected Oral Presentations
- 11:20 AM - 11:25 AM - Sparse Linear Concept Discovery Models - Konstantinos Panousis (INRIA); Dino Ienco (INRAE); Diego Marcos (Inria)
- 11:25 AM - 11:30 AM - Zero-Shot and Few-Shot Video Question Answering with Multi-Modal Prompts - Deniz Engin (INRIA)*; Yannis Avrithis (IARAI)
- 11:30 AM - 11:35 AM - Multimodal Neurons in Pretrained Text-Only Transformers - Sarah Schwettmann (MIT)*; Neil Chowdhury (MIT); Samuel J Klein (Knowledge Futures Group); David Bau (Northeastern University); Antonio Torralba (MIT)
- 11:35 AM - 11:40 AM - QA for each three papers
- 11:40 AM - 11:45 AM - Simple Token-Level Confidence Improves Caption Correctness - Suzanne Petryk (UC Berkeley)*; Spencer Whitehead (Meta AI); Joseph Gonzalez (U.C. Berkeley); Trevor Darrell (UC Berkeley); Anna Rohrbach (UC Berkeley); Marcus Rohrbach (Facebook AI Research)
- 11:45 AM - 11:50 AM - Coarse-to-Fine Contrastive Learning in Image-Text-Graph Space for Improved Vision-Language Compositionality - Harman Singh (Meta)*; Pengchuan Zhang (Meta); Qifan Wang (Meta AI); Mengjiao Wang (Facebook); Wenhan Xiong (Meta); Jingfei Du (-); Hugo Chen (Meta)
- 11:50 AM - 11:55 AM - Visual Coherence Loss for Coherent and Visually Grounded Story Generation - Xudong Hong (Saarland University)*; Vera Demberg (Dept. of Mathematics and Computer Science, Saarland University); Asad B Sayeed (University of Gothenburg); Bernt Schiele (MPI Informatics)
- 11:55 AM - 12:00 PM - QA for each three papers
12:00 PM - 13:30 PM - Lunch and Posters
13:30 PM - 14:00 PM - Trevor Darrel (UC Berkeley) - Recent advances in (L)LMs for vision and language
14:00 PM - 14:30 PM - Jiasen Lu (AI2) - Unified-IO 2: Multi-Modal Reentry for Vision, Language, Audio and Action
14:30 PM - 15:00 PM - Shuran Song (Stanford and Columbia University) - Closing the Loop between Vision Language and Actions
15:00 PM - 15:30 PM - Xin Eric Wang/ Team (UC Santa Cruz) - Aerial Vision-and-Dialog Navigation Challenge
15:30 PM - 16:15 PM - Coffee Break and Poster Sessions
16:15 PM - 16:45 PM - Karthik Narasimhan (Princeton University) - Language Agents: Machines that can Read, Think and Act
16:45 PM - 17:15 PM - Alane Suhr (UC Berkeley) - TBD
17:15 PM - 18:00 PM - Panel Discussion

ICCV 2023 - Paris, France

October 2nd - 8:30 AM - 18:00 PM (GMT+2)

CLVL: 5th Workshop on Closing the Loop Between Vision and Language

000

Days

00

Hours

00

Minutes

00

Secondes

Previous Versions of the workshop

Challenges

1. Scientific Figure Captioning Challenge -> link

2. Aerial Vision-and-Dialog Navigation Challenge -> link

3. Visual-Dialog Based Emotion Explanation Generation Challenge -> link

Paper submission

Call for papers

Format

Topics

Accepted papers

Dates

Timeline

Archival track (will appear in ICCV proceedings):

Non-archival track:

Speakers

Invited Speakers

Organizers

Workshop Organizers

Program

Workshop Program

October 2^nd - 8:30 AM - 18:00 PM (GMT+2)

CLVL: 5^th Workshop on Closing the Loop Between Vision and Language