In recent years, there have been many advances in learning from visual and auditory data. While traditionally these modalities have been studied in isolation, researchers have increasingly been creating algorithms that learn from both modalities. This has produced many exciting developments in automatic lip-reading, multi-modal representation learning, and audio-visual action recognition.
Since pretty much every internet video has an audio track, the prospect of learning from paired audio-visual data — either with new forms of unsupervised learning, or by simply incorporating sound data into existing vision algorithms — is intuitively appealing, and this workshop will cover recent advances in this direction. But it will also touch on higher-level questions, such as what information sound conveys that vision doesn't, the merits of sound versus other "supplemental" modalities such as text and depth, and the relationship between visual motion and sound. We'll also discuss how these techniques are being used to create new audio-visual applications, such as in the fields of speech processing and video editing.
Please click here to see last year's workshop (at CVPR 2018).
Accepted short papers
Sound to Visual: Hierarchical Cross-Modal Talking Face Generation Lele Chen, Haitian Zheng, Ross Maddox, Zhiyao Duan, Chenliang Xu Audio-Visual Event Localization in the Wild Yapeng Tian, Jing Shi, Bochen Li, Zhiyao Duan, Chenliang Xu Audio-Visual Interpretable and Controllable Video Captioning Yapeng Tian, Chenxiao Guan, Goodman Justin, Marc Moore, Chenliang Xu Reflection and Diffraction-Aware Sound Source Localization Inkyu An, Jung-Woo Choi, Dinesh Manocha, Sung-Eui Yoon Generating Video from Single Image and Sound Yukitaka Tsuchiya, Takahiro Itazuri, Ryota Natsume, Shintaro Yamamoto, Takuya Kato, Shigeo Morishima WAV2PIX: Speech-conditioned Face Generation using Generative Adversarial Networks Amanda Cardoso Duarte, Francisco Roldan, Miquel Tubau, Janna Escur, Santiago Pascual, Amaia Salvador, Eva Mohedano, Kevin McGuinness, Jordi Torres, Xavier Giro-i-Nieto On Attention Modules for Audio-Visual Synchronization Naji Khosravan, Shervin Ardeshir, Rohit Puri Grounding Spoken Words in Unlabeled Video Angie W Boggust, Kartik Audhkhasi, Dhiraj Joshi, David Harwath, Samuel Thomas, Rogerio Feris, Danny Gutfreund, Yang Zhang, Antonio Torralba, Michael Picheny, James Glass A Neurorobotic Experiment for Crossmodal Conflict Resolution German Parisi, Pablo Barros, Di Fu, Sven Magg, Haiyan Wu, Xun Liu, Stefan Wermter End-to-End Speech-Driven Realistic Facial Animation with Temporal GANs Konstantinos Vougioukas, Stavros Petridis, Maja Pantic Organizers
Andrew Owens
UC Berkeley
Jiajun Wu
MIT
William Freeman
MIT/Google
Andrew Zisserman
Oxford
Jean-Charles Bazin
KAIST
Zhengyou Zhang
Tencent
Antonio Torralba
MIT
Kristen Grauman
UT Austin / Facebook