Format
The typical paper's length is
4 pages excluding references in ICCV25 format. Authors kit, which includes templates and guidelines for formatting submissions, can be found
here.
Note: Applies only to archival papers
Following ICCV 2025 guidelines, only long-form papers (between 4 and 8 pages, 4 < x ≤ 8) are permitted for archival submission.
This aligns with the dual-submission policy, where peer-reviewed papers exceeding 4 pages qualify as archival publications.
Non-archival submissions should be limited to 4 pages.
The submitted papers will be peer-reviewed in a
double-blind manner. Abstracts will be presented in the workshop poster session.
A portion of the accepted papers will be orally presented.
Paper submissions will be handled through CMT3, a conference management tool used for organizing paper submissions and peer reviews. You can access it via the following link:
CMT3 ICCVCLVL2025
The Microsoft CMT service was used for managing the peer-reviewing process for this conference. This service was provided for free by Microsoft and they bore all expenses, including costs for Azure cloud services as well as for software development and support.
Topics
The scope of this workshop lies at the intersection of Computer Vision and Natural Language
Processing (NLP). There is growing interest in joint understanding of vision and language in recent
years, with researchers studying a multitude of tasks at the intersection. Most recently, there has also
been a lot of interest in large-scale multi-modal pretraining, where Transformer-based architectures
have established the new state-of-the-art on a range of tasks. Topics include but are not limited to:
- Novel problems in vision and language
- Learning to solve non-visual tasks using visual cues
- Language guided visual understanding (objects, relationships)
- Visual dialog and question answering by visual verification
- Visual question generation
- Vision and Language Navigation
- Visually grounded conversation
- Deep learning methods for vision and language
- Language-based image and video search/retrieval
- Linguistic descriptions of spatial relations
- Auto-illustration
- Natural language grounding & learning by watching
- Learning knowledge from the web
- Language as a mechanism to structure and reason about visual perception
- Language as a learning bias to aid vision in both machines and humans
- Dialog as means of sharing knowledge about visual perception