- Authors of accepted papers will present a 5-minute talk about their work. You may either present in person, or submit a video. For the latter option, please submit by June 15th (11:59 PST) to CMT, following the previous CVPR oral instructions here (uploading as a .mp4 file).
- We'll have two paper presentation sessions: 9am - 11am and 1pm - 2pm. Each session will be a mix of in-person and video presentations. Throughout the paper sessions, there will be short Q&A sessions for all of the papers that precede them. We'll also release recordings on our website for offline viewing. We'll post the paper schedule in the coming weeks.
- You are welcome to optionally present a poster during the lunch and coffee breaks. We unfortunately are unable to offer a hybrid option for posters.
- Please also submit the camera ready version of your paper via CMT by June 17th (11:59 PST). Papers will be available on our website.
- Looking forward to seeing you there!
Thanks, everyone, for joining us! We'll provide video recordings for the workshop in the coming days.
Schedule
9:00 - 9:05 (CDT) | Welcome | ||
9:05 - 10:30 (CDT) | Paper session #1 | Chair: Triantafyllos Afouras | |
Quantized GAN for Complex Music Generation from Dance Videos | Ye Zhu, Kyle B Olszewski, Yu Wu, Panos Achlioptas, Menglei Chai, Jian Ren, Yan Yan, Sergey Tulyakov | ||
Synchronisation of Lips and Voices | Venkatesh Shenoy Kadandale, Juan Felipe Montesinos, Gloria Haro | ||
A Model You Can Hear: Audio Classification with Playable Prototypes | Romain Loiseau, Baptiste Bouvier, Teytaut Yann, Elliot Vincent, Mathieu Aubry, loic landrieu | ||
Audio-Visual Object Localization in Egocentric Videos | Chao Huang, Yapeng Tian, Anurag Kumar, Chenliang Xu | ||
Q&A session | |||
Audio-Visual Event Localization via Recursive Joint Co-Attention | Bin Duan, Hugo M Latapie, Gaowen Liu, Yan Yan | ||
The Sound of Motion: Multimodal horse motion estimation from video and audio | Ci Li, Elin Hernlund, Hedvig Kjellström, Silvia Zuffi | ||
Learning Sound Localization Better From Semantically Similar Samples | Arda Senocak, Hyeonggon Ryu, Junsik Kim, In So Kweon | ||
SVTS: Scalable Video-to-Speech Synthesis - Extended Abstract | Rodrigo Mira, Alexandros Haliassos, Stavros Petridis, Bjoern W. Schuller, Maja Pantic | ||
Q&A session | |||
Quantifying Predictive Uncertainty for Stochastic Video Synthesis from Audio | Moitreya Chatterjee, Narendra Ahuja, Anoop Cherian | ||
Exploring a Probabilistic Approach to Vehicle Sound Source Localization in Urban Scenes | Julia Wilkins, Magdalena Fuentes, Luca Bondi, Shabnam Ghaffarzadegan, Bea Steers, Ali Abavisani, Juan P Bello | ||
SEMI: Self-supervised Exploration via Multisensory Incongruity | Ziwen Zhuang, Jianren Wang, Hang Zhao | ||
Sound Adversarial Audio-Visual Navigation | Yinfeng Yu, Changan Chen, Fuchun Sun | ||
ECLIPSE: Efficient Long-range Video Retrieval using Sight and Sound | Yan-Bo Lin, Jie Lei, Mohit Bansal, Gedas Bertasius | ||
Q&A session | |||
10:30 - 11:00 | Coffee break & posters | ||
11:00 - 11:30 (CDT) | Invited talk |
Arsha Nagrani | ![]() |
11:30 - 12:00 (CDT) | Invited talk |
Jeannette Bohg | ![]() |
12:00 - 1:00 (CDT) | Lunch | ||
1:00 - 2:00 (CDT) | Paper session #2 | ||
How to Listen? Rethinking Visual Sound Localization | Ho-Hsiang Wu, Magdalena Fuentes, Prem Seetharaman, Juan P Bello | ||
Urban Sound & Sight: Dataset and benchmark for Audio-Visual Urban Scene Understanding | Magdalena Fuentes, Bea A Steers, Pablo Zinemanas, Martín Rocamora, Luca Bondi, Julia Wilkins, Qianyi Shi, Yao Hou, Samarjit Das, Xavier Serra, Juan P Bello | ||
On Negative Sampling for Audio-Visual Contrastive Learning from Movies | Mahdi M. Kalayeh, Shervin Ardeshir, Kamath Nagendra, Lingyi Liu, Ashok Chandrashekar | ||
Audio-visual voice separation transformer | Juan Felipe Montesinos, Venkatesh Shenoy Kadandale, Gloria Haro | ||
Q&A session | |||
Everything at Once - Multi-modal Fusion Transformer for Video Retrieval | Nina Shvetsova, Brian Chen, Andrew Rouditchenko, Samuel Thomas, Brian Kingsbury, Rogerio Feris, David Harwath, James Glass, Hilde Kuehne | ||
Tap to the Beat: Cross-modal Music Beat Localization for Dancing Videos | Tianyi Ma, Yu Wu | ||
Leveraging Real Talking Faces via Self-Supervision for Robust Forgery Detection - Extended Abstract | Alexandros Haliassos, Rodrigo Mira, Stavros Petridis, Maja Pantic | ||
Visual Speech Recognition for Multiple Languages | Pingchuan Ma, Stavros Petridis, Maja Pantic | ||
Q&A session | |||
2:00 - 2:30 (CDT) | Invited talk |
David Brang | ![]() |
2:30 - 3:30 (CDT) | Coffee break & posters | ||
3:30 - 4:00 (CDT) | Invited talk |
Carl Vondrick | ![]() |
4:00 - 5:00 (CDT) | Invited paper talks | Chair: Ruohan Gao | |
Taming visually guided sound generation | Vladimir Iashin, Esa Rahtu | ||
Audio-Visual Speech Codecs: Rethinking Audio-Visual Speech Enhancement by Re-Synthesis | Karren Yang, Dejan Marković, Steven Krenn, Vasu Agrawal, Alexander Richard | ||
Learning to Answer Questions in Dynamic Audio-Visual Scenarios | Guangyao Li, Yake Wei, Yapeng Tian, Chenliang Xu, Ji-Rong Wen, Di Hu | ||
Q&A session | |||
Cross-Modal Perceptionist: Can Face Geometry be Gleaned from Voices? | Cross-Modal Perceptionist: Can Face Geometry be Gleaned from Voices? | ||
Learning Hierarchical Cross-Modal Association for Co-Speech Gesture Generation | Xian Liu, Qianyi Wu, Hang Zhou, Yinghao Xu, Rui Qian, Xinyi Lin, Xiaowei Zhou, Wayne Wu, Bo Dai, Bolei Zhou | ||
Sound and Visual Representation Learning with Multiple Pretraining Tasks | Arun Balajee Vasudevan, Dengxin Dai, Luc Van Gool | ||
Active Audio-Visual Separation of Dynamic Sound Sources | Sagnik Majumder, Ziad Al-Halah, Kristen Grauman | ||
Q&A sessions | |||
5:00 - 5:30 (CDT) | Invited talk |
Hilde Kuehne | ![]() |
5:30 - 6:00 (CDT) | Invited talk |
Pedro Morgado | ![]() |
Presentation instructions
Organizers
![]() Andrew Owens University of Michigan |
![]() Jiajun Wu Stanford |
![]() Arsha Nagrani |
![]() Triantafyllos Afouras Meta |
![]() Ruohan Gao Stanford |
![]() Hang Zhao Tsinghua |
![]() William Freeman MIT/Google |
![]() Andrew Zisserman Oxford |
![]() Kristen Grauman UT Austin / Meta | ![]() Antonio Torralba MIT | ![]() Jean-Charles Bazin Meta |