There are two ways to attend:

  • If you are registered for CVPR, you can join over Zoom.
  • The workshop is also being streamed on Facebook.


9:00 - 9:05 (PST) Welcome
9:05 - 11:00 (PST) Paper session [Video]  Session chairs: Arsha Nagrani and Triantafyllos Afouras

Synthetic Acoustic Image Generation for Audio-Visual LocalizationValentina Sanguineti, Pietro Morerio, Alessio Del Bue, Vittorio Murino
Pose-Controllable Talking Face Generation by Implicitly Modularized Audio-Visual RepresentationHang Zhou, Yasheng Sun, Wayne Wu, Chen Change Loy, Xiaogang Wang, Ziwei Liu
Self-Supervised Learning for Cross-Modal Retrieval based on Sound Category and LocationTomoya Sato, Yusuke Sugano, Yoichi Sato
Estimating Individual A Cappella Voices in Music Videos with Singing FacesVenkatesh Shenoy Kadandale, Juan Felipe Montesinos, Gloria Haro
Q&A session
Spoken ObjectNet: A Bias-Controlled Spoken Caption Dataset - Extended AbstractIan A Palmer, Andrew Rouditchenko, Andrei Barbu, Boris Katz, James Glass
Cascaded Multilingual Audio-Visual Learning from Videos - Extended AbstractAndrew Rouditchenko, Angie W Boggust, David Harwath, Samuel Thomas, Hilde Kuehne, Brian Chen, Rameswar Panda, Rogerio Feris, Brian Kingsbury, Michael Picheny, James Glass
End-To-End Video-To-Speech Synthesis using Generative Adversarial Networks with Multiple CriticsRodrigo Schonburg Carrillo de Mira, Konstantinos Vougioukas, Pingchuan Ma, Stavros Petridis, Bjoern W. Schuller, Maja Pantic
Neural Dubber: Dubbing for Silent Videos According to ScriptsChenxu Hu, Qiao Tian, Tingle Li, Wang Yuping, Yuxuan Wang, Hang Zhao
Q&A session
Learning Representations from Audio-Visual Spatial AlignmentYi Li, Pedro Morgado, Nuno Vasconcelos
Multimodal Clustering Networks for Self-supervised Learning from Unlabeled VideosBrian Chen, Andrew Rouditchenko, Kevin Duarte, Hilde Kuehne, Samuel Thomas, Angie W Boggust, Rameswar Panda, Brian Kingsbury, Rogerio Feris, David Harwath, James Glass, Michael Picheny, Shih-Fu Chang
Material Converter: Manipulating Materials of Visual Objects with SoundTingle L, Yichen Liu, Andrew Owens, Hang Zhao
Depth Infused Binaural Audio Generation using Hierarchical Cross-Modal AttentionKranti K Parida, Siddharth Srivastava, Neeraj Matiyali, Gaurav Sharma
Q&A session
Localizing Visual Sounds the Hard WayHonglie Chen, Weidi Xie, Triantafyllos Afouras, Arsha Nagrani, Andrea Vedaldi, Andrew Zisserman
Face-to-Music TranslationChelhwon Kim, Andrew Port, Mitesh Patel
Q&A session

11:00 - 11:30 (PST) Invited talk
Justin Salamon
11:30 - 12:00 (PST) Invited talk
Chenliang Xu
12:00 - 12:30 (PST) Invited talk
Kristen Grauman
12:30 - 2:00 (PST) Invited paper talks [Video] Session chair: Ruohan Gao

[Paper] The Boombox: Visual Reconstruction from Acoustic VibrationsBoyuan Chen, Mia Chiquier, Hod Lipson, Carl Vondrick
[Paper] Visually Informed Binaural Audio Generation without Binaural AudiosXudong Xu, Hang Zhou, Ziwei Liu, Bo Dai, Xiaogang Wang, Dahua Lin
[Paper] Unsupervised Sound Localization via Iterative Contrastive LearningYan-Bo Lin, Hung-Yu Tseng, Hsin-Ying Lee, Yen-Yu Lin, Ming-Hsuan Yang
[Paper] See, hear, explore: Curiosity via audio-visual associationVictoria Dean, Shubham Tulsiani, Abhinav Gupta.
Q&A session
[Paper] VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and TextHassan Akbari, Linagzhe Yuan, Rui Qian, Wei-Hong Chuang, Shih-Fu Chang, Yin Cui, Boqing Gong
[Paper] Repetitive Activity Counting by Sight and SoundYunhua Zhang, Ling Shao, Cees G. M. Snoek
[Paper] AdaMML: Adaptive Multi-Modal Learning for Efficient Video RecognitionRameswar Panda, Chun-Fu (Richard) Chen, Quanfu Fan, Ximeng Sun, Kate Saenko, Aude Oliva, Rogerio Feris
[Paper] Space-Time Crop & Attend: Improving Cross-modal Video Representation LearningMandela Patrick, Yuki M. Asano, Bernie Huang, Ishan Misra, Florian Metze, Joao Henriques, Andrea Vedaldi
Q&A session

2:00 - 2:30 (PST) Invited talk
Dima Damen
2:30 - 3:00 (PST) Invited talk
Chuang Gan
3:00 - 3:30 (PST) Invited talk
John Hershey & Efthymios Tzinis    
3:30 - 4:00 (PST) Invited talk
James Traer
Hearing the world with noise (and statistics)


While traditionally visual and audio data have been studied in isolation, researchers have increasingly been creating algorithms that learn from both modalities. This has produced many exciting developments in automatic lip-reading, multi-modal representation learning, and audio-visual action recognition.

Since pretty much every internet video has an audio track, the prospect of learning from paired audio-visual data — either with new forms of unsupervised learning, or by simply incorporating sound data into existing vision algorithms — is appealing, and this workshop will cover recent advances in this direction. It will also touch on higher-level questions, such as what information sound conveys that vision doesn't, the merits of sound versus other "supplemental" modalities such as text and depth, and the relationship between visual motion and sound. We'll also discuss how these techniques are being used to create new audio-visual applications, such as in the fields of speech processing and video editing.

Previous workshops: 2018, 2019, 2020

Presentation instructions

  • Authors of accepted papers can present a 5-minute (or shorter) talk about their work. Please submit the video by June 18th (11:59 PST) to CMT, following the CVPR oral instructions here (uploading as a .mp4 file).
  • We'll have a paper presentation session on 9:00am - 11:00am PST on June 20. During this session, we'll play the pre-recorded talks, with time for Q&A from authors (if they are present). We'll also release the videos on our website for offline viewing.
  • Please also submit the camera ready version of your paper via CMT by June 18th (11:59 PST). Papers will be available on our website.
  • Looking forward to seeing you there!


Andrew Owens
University of Michigan

Jiajun Wu

Arsha Nagrani

Triantafyllos Afouras

Ruohan Gao

William Freeman

Andrew Zisserman

Kristen Grauman
UT Austin / Facebook

Antonio Torralba

Jean-Charles Bazin