There are two ways to attend:
Attending
Schedule
9:00 - 9:05 (PST) | Welcome | ||
9:05 - 11:00 (PST) | Paper session [Video] | Session chairs: Arsha Nagrani and Triantafyllos Afouras | |
Synthetic Acoustic Image Generation for Audio-Visual Localization | Valentina Sanguineti, Pietro Morerio, Alessio Del Bue, Vittorio Murino | ||
Pose-Controllable Talking Face Generation by Implicitly Modularized Audio-Visual Representation | Hang Zhou, Yasheng Sun, Wayne Wu, Chen Change Loy, Xiaogang Wang, Ziwei Liu | ||
Self-Supervised Learning for Cross-Modal Retrieval based on Sound Category and Location | Tomoya Sato, Yusuke Sugano, Yoichi Sato | ||
Estimating Individual A Cappella Voices in Music Videos with Singing Faces | Venkatesh Shenoy Kadandale, Juan Felipe Montesinos, Gloria Haro | ||
Q&A session | |||
Spoken ObjectNet: A Bias-Controlled Spoken Caption Dataset - Extended Abstract | Ian A Palmer, Andrew Rouditchenko, Andrei Barbu, Boris Katz, James Glass | ||
Cascaded Multilingual Audio-Visual Learning from Videos - Extended Abstract | Andrew Rouditchenko, Angie W Boggust, David Harwath, Samuel Thomas, Hilde Kuehne, Brian Chen, Rameswar Panda, Rogerio Feris, Brian Kingsbury, Michael Picheny, James Glass | ||
End-To-End Video-To-Speech Synthesis using Generative Adversarial Networks with Multiple Critics | Rodrigo Schonburg Carrillo de Mira, Konstantinos Vougioukas, Pingchuan Ma, Stavros Petridis, Bjoern W. Schuller, Maja Pantic | ||
Neural Dubber: Dubbing for Silent Videos According to Scripts | Chenxu Hu, Qiao Tian, Tingle Li, Wang Yuping, Yuxuan Wang, Hang Zhao | ||
Q&A session | |||
Learning Representations from Audio-Visual Spatial Alignment | Yi Li, Pedro Morgado, Nuno Vasconcelos | ||
Multimodal Clustering Networks for Self-supervised Learning from Unlabeled Videos | Brian Chen, Andrew Rouditchenko, Kevin Duarte, Hilde Kuehne, Samuel Thomas, Angie W Boggust, Rameswar Panda, Brian Kingsbury, Rogerio Feris, David Harwath, James Glass, Michael Picheny, Shih-Fu Chang | ||
Material Converter: Manipulating Materials of Visual Objects with Sound | Tingle L, Yichen Liu, Andrew Owens, Hang Zhao | ||
Depth Infused Binaural Audio Generation using Hierarchical Cross-Modal Attention | Kranti K Parida, Siddharth Srivastava, Neeraj Matiyali, Gaurav Sharma | ||
Q&A session | |||
Localizing Visual Sounds the Hard Way | Honglie Chen, Weidi Xie, Triantafyllos Afouras, Arsha Nagrani, Andrea Vedaldi, Andrew Zisserman | ||
Face-to-Music Translation | Chelhwon Kim, Andrew Port, Mitesh Patel | ||
Q&A session | |||
11:00 - 11:30 (PST) | Invited talk [Video] |
Justin Salamon | |
11:30 - 12:00 (PST) | Invited talk [Video] |
Chenliang Xu | |
12:00 - 12:30 (PST) | Invited talk [Video] |
Kristen Grauman | |
12:30 - 2:00 (PST) | Invited paper talks [Video] | Session chair: Ruohan Gao | |
[Paper] | The Boombox: Visual Reconstruction from Acoustic Vibrations | Boyuan Chen, Mia Chiquier, Hod Lipson, Carl Vondrick | |
[Paper] | Visually Informed Binaural Audio Generation without Binaural Audios | Xudong Xu, Hang Zhou, Ziwei Liu, Bo Dai, Xiaogang Wang, Dahua Lin | |
[Paper] | Unsupervised Sound Localization via Iterative Contrastive Learning | Yan-Bo Lin, Hung-Yu Tseng, Hsin-Ying Lee, Yen-Yu Lin, Ming-Hsuan Yang | |
[Paper] | See, hear, explore: Curiosity via audio-visual association | Victoria Dean, Shubham Tulsiani, Abhinav Gupta. | |
Q&A session | |||
[Paper] | VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text | Hassan Akbari, Linagzhe Yuan, Rui Qian, Wei-Hong Chuang, Shih-Fu Chang, Yin Cui, Boqing Gong | |
[Paper] | Repetitive Activity Counting by Sight and Sound | Yunhua Zhang, Ling Shao, Cees G. M. Snoek | |
[Paper] | AdaMML: Adaptive Multi-Modal Learning for Efficient Video Recognition | Rameswar Panda, Chun-Fu (Richard) Chen, Quanfu Fan, Ximeng Sun, Kate Saenko, Aude Oliva, Rogerio Feris | |
[Paper] | Space-Time Crop & Attend: Improving Cross-modal Video Representation Learning | Mandela Patrick, Yuki M. Asano, Bernie Huang, Ishan Misra, Florian Metze, Joao Henriques, Andrea Vedaldi | |
Q&A session | |||
2:00 - 2:30 (PST) | Invited talk [Video] |
Dima Damen | |
2:30 - 3:00 (PST) | Invited talk [Video] |
Chuang Gan | |
3:00 - 3:30 (PST) | Invited talk [Video] |
John Hershey & Efthymios Tzinis | |
3:30 - 4:00 (PST) | Invited talk [Video] |
James Traer Hearing the world with noise (and statistics) |
Summary
While traditionally visual and audio data have been studied in isolation, researchers have increasingly been creating algorithms that learn from both modalities. This has produced many exciting developments in automatic lip-reading, multi-modal representation learning, and audio-visual action recognition.
Since pretty much every internet video has an audio track, the prospect of learning from paired audio-visual data — either with new forms of unsupervised learning, or by simply incorporating sound data into existing vision algorithms — is appealing, and this workshop will cover recent advances in this direction. It will also touch on higher-level questions, such as what information sound conveys that vision doesn't, the merits of sound versus other "supplemental" modalities such as text and depth, and the relationship between visual motion and sound. We'll also discuss how these techniques are being used to create new audio-visual applications, such as in the fields of speech processing and video editing.
Previous workshops: 2018, 2019, 2020
Presentation instructions
Organizers
Andrew Owens
University of Michigan
Jiajun Wu
Stanford
Arsha Nagrani
Google
Triantafyllos Afouras
Oxford
Ruohan Gao
Stanford
William Freeman
MIT/Google
Andrew Zisserman
Oxford
Kristen Grauman
UT Austin / Facebook
Antonio Torralba
MIT
Jean-Charles Bazin
KAIST