Thanks, everyone, for joining us! We'll provide video recordings for the workshop in the coming days.


9:00 - 9:05 (CDT) Welcome
9:05 - 10:30 (CDT) Paper session #1Chair: Triantafyllos Afouras

Quantized GAN for Complex Music Generation from Dance Videos Ye Zhu, Kyle B Olszewski, Yu Wu, Panos Achlioptas, Menglei Chai, Jian Ren, Yan Yan, Sergey Tulyakov
Synchronisation of Lips and Voices Venkatesh Shenoy Kadandale, Juan Felipe Montesinos, Gloria Haro
A Model You Can Hear: Audio Classification with Playable Prototypes Romain Loiseau, Baptiste Bouvier, Teytaut Yann, Elliot Vincent, Mathieu Aubry, loic landrieu
Audio-Visual Object Localization in Egocentric Videos Chao Huang, Yapeng Tian, Anurag Kumar, Chenliang Xu
Q&A session
Audio-Visual Event Localization via Recursive Joint Co-Attention Bin Duan, Hugo M Latapie, Gaowen Liu, Yan Yan
The Sound of Motion: Multimodal horse motion estimation from video and audio Ci Li, Elin Hernlund, Hedvig Kjellström, Silvia Zuffi
Learning Sound Localization Better From Semantically Similar Samples Arda Senocak, Hyeonggon Ryu, Junsik Kim, In So Kweon
SVTS: Scalable Video-to-Speech Synthesis - Extended Abstract Rodrigo Mira, Alexandros Haliassos, Stavros Petridis, Bjoern W. Schuller, Maja Pantic
Q&A session
Quantifying Predictive Uncertainty for Stochastic Video Synthesis from Audio Moitreya Chatterjee, Narendra Ahuja, Anoop Cherian
Exploring a Probabilistic Approach to Vehicle Sound Source Localization in Urban Scenes Julia Wilkins, Magdalena Fuentes, Luca Bondi, Shabnam Ghaffarzadegan, Bea Steers, Ali Abavisani, Juan P Bello
SEMI: Self-supervised Exploration via Multisensory Incongruity Ziwen Zhuang, Jianren Wang, Hang Zhao
Sound Adversarial Audio-Visual Navigation Yinfeng Yu, Changan Chen, Fuchun Sun
ECLIPSE: Efficient Long-range Video Retrieval using Sight and Sound Yan-Bo Lin, Jie Lei, Mohit Bansal, Gedas Bertasius
Q&A session
10:30 - 11:00 Coffee break & posters
11:00 - 11:30 (CDT) Invited talk
Arsha Nagrani
11:30 - 12:00 (CDT) Invited talk
Jeannette Bohg
12:00 - 1:00 (CDT) Lunch
1:00 - 2:00 (CDT) Paper session #2

How to Listen? Rethinking Visual Sound Localization Ho-Hsiang Wu, Magdalena Fuentes, Prem Seetharaman, Juan P Bello
Urban Sound & Sight: Dataset and benchmark for Audio-Visual Urban Scene Understanding Magdalena Fuentes, Bea A Steers, Pablo Zinemanas, Martín Rocamora, Luca Bondi, Julia Wilkins, Qianyi Shi, Yao Hou, Samarjit Das, Xavier Serra, Juan P Bello
On Negative Sampling for Audio-Visual Contrastive Learning from Movies Mahdi M. Kalayeh, Shervin Ardeshir, Kamath Nagendra, Lingyi Liu, Ashok Chandrashekar
Audio-visual voice separation transformer Juan Felipe Montesinos, Venkatesh Shenoy Kadandale, Gloria Haro
Q&A session
Everything at Once - Multi-modal Fusion Transformer for Video Retrieval Nina Shvetsova, Brian Chen, Andrew Rouditchenko, Samuel Thomas, Brian Kingsbury, Rogerio Feris, David Harwath, James Glass, Hilde Kuehne
Tap to the Beat: Cross-modal Music Beat Localization for Dancing Videos Tianyi Ma, Yu Wu
Leveraging Real Talking Faces via Self-Supervision for Robust Forgery Detection - Extended Abstract Alexandros Haliassos, Rodrigo Mira, Stavros Petridis, Maja Pantic
Visual Speech Recognition for Multiple Languages Pingchuan Ma, Stavros Petridis, Maja Pantic
Q&A session
2:00 - 2:30 (CDT) Invited talk
David Brang
2:30 - 3:30 (CDT) Coffee break & posters
3:30 - 4:00 (CDT) Invited talk
Carl Vondrick
4:00 - 5:00 (CDT) Invited paper talksChair: Ruohan Gao

Taming visually guided sound generationVladimir Iashin, Esa Rahtu
Audio-Visual Speech Codecs: Rethinking Audio-Visual Speech Enhancement by Re-SynthesisKarren Yang, Dejan Marković, Steven Krenn, Vasu Agrawal, Alexander Richard
Learning to Answer Questions in Dynamic Audio-Visual ScenariosGuangyao Li, Yake Wei, Yapeng Tian, Chenliang Xu, Ji-Rong Wen, Di Hu
Q&A session
Cross-Modal Perceptionist: Can Face Geometry be Gleaned from Voices?Cross-Modal Perceptionist: Can Face Geometry be Gleaned from Voices?
Learning Hierarchical Cross-Modal Association for Co-Speech Gesture GenerationXian Liu, Qianyi Wu, Hang Zhou, Yinghao Xu, Rui Qian, Xinyi Lin, Xiaowei Zhou, Wayne Wu, Bo Dai, Bolei Zhou
Sound and Visual Representation Learning with Multiple Pretraining TasksArun Balajee Vasudevan, Dengxin Dai, Luc Van Gool
Active Audio-Visual Separation of Dynamic Sound SourcesSagnik Majumder, Ziad Al-Halah, Kristen Grauman
Q&A sessions
5:00 - 5:30 (CDT) Invited talk
Hilde Kuehne
5:30 - 6:00 (CDT) Invited talk
Pedro Morgado

Presentation instructions

  • Authors of accepted papers will present a 5-minute talk about their work. You may either present in person, or submit a video. For the latter option, please submit by June 15th (11:59 PST) to CMT, following the previous CVPR oral instructions here (uploading as a .mp4 file).
  • We'll have two paper presentation sessions: 9am - 11am and 1pm - 2pm. Each session will be a mix of in-person and video presentations. Throughout the paper sessions, there will be short Q&A sessions for all of the papers that precede them. We'll also release recordings on our website for offline viewing. We'll post the paper schedule in the coming weeks.
  • You are welcome to optionally present a poster during the lunch and coffee breaks. We unfortunately are unable to offer a hybrid option for posters.
  • Please also submit the camera ready version of your paper via CMT by June 17th (11:59 PST). Papers will be available on our website.
  • Looking forward to seeing you there!