This workshop is on Monday in Grand Ballroom B.
Directions: enter the conference center, go up the escalator to reach the 2nd floor, and turn right.

9:00 - 9:05 Welcome
Paper session 1 9:05 - 9:15 Yapeng Tian, Jing Shi, Bochen Li, Zhiyao Duan, Chenliang Xu Audio-Visual Event Localization in the Wild
9:15 - 9:25 Triantafyllos Afouras, Joon Son Chung, Andrew Zisserman My lips are concealed: Audio-visual speech enhancement through obstructions
9:25 - 9:55 Invited Talk Ruohan Gao (UT Austin)
Learning to See and Hear with Unlabeled Video
9:55 - 10:15 Poster spotlights The authors of each poster will have the chance to present a 2-minute lighting talk.
10:15 - 11:00 Poster presentations & coffee break Posters will be in the same room as the rest of the workshop (Grand Ballroom B). Poster numbers: 168 - 177
11:00 - 11:30 Invited talk Lorenzo Torresani (Dartmouth/Facebook)
Audio-Visual Learning for Reduced Supervision and Improved Efficiency
11:30 - 12:00 Invited talk Tali Dekel (Google)
From face to speech, and back to face
12:00 - 1:45 Lunch
1:45 - 2:15 Invited talk Antonio Torralba (MIT)
2:15 - 3:00 Keynote talk Josh McDermott (MIT)
Old and New Problems in Auditory Scene Analysis
3:00 - 3:30 Invited talk Jitendra Malik (UC Berkeley)
Learning Individual Styles of Conversational Gesture
3:30 - 4:00 Coffee break
4:00 - 4:30 Invited talk Aäron van den Oord (DeepMind)
Paper session 2 4:30 - 4:40 Joon Son Chung Audio-visual speaker diarisation in the wild


In recent years, there have been many advances in learning from visual and auditory data. While traditionally these modalities have been studied in isolation, researchers have increasingly been creating algorithms that learn from both modalities. This has produced many exciting developments in automatic lip-reading, multi-modal representation learning, and audio-visual action recognition.

Since pretty much every internet video has an audio track, the prospect of learning from paired audio-visual data — either with new forms of unsupervised learning, or by simply incorporating sound data into existing vision algorithms — is intuitively appealing, and this workshop will cover recent advances in this direction. But it will also touch on higher-level questions, such as what information sound conveys that vision doesn't, the merits of sound versus other "supplemental" modalities such as text and depth, and the relationship between visual motion and sound. We'll also discuss how these techniques are being used to create new audio-visual applications, such as in the fields of speech processing and video editing.

Please click here to see last year's workshop (at CVPR 2018).

Accepted short papers

Sound to Visual: Hierarchical Cross-Modal Talking Face GenerationLele Chen, Haitian Zheng, Ross Maddox, Zhiyao Duan, Chenliang Xu
Audio-Visual Event Localization in the WildYapeng Tian, Jing Shi, Bochen Li, Zhiyao Duan, Chenliang Xu
Audio-Visual Interpretable and Controllable Video CaptioningYapeng Tian, Chenxiao Guan, Goodman Justin, Marc Moore, Chenliang Xu
Reflection and Diffraction-Aware Sound Source LocalizationInkyu An, Jung-Woo Choi, Dinesh Manocha, Sung-Eui Yoon
Generating Video from Single Image and Sound Yukitaka Tsuchiya, Takahiro Itazuri, Ryota Natsume, Shintaro Yamamoto, Takuya Kato, Shigeo Morishima
WAV2PIX: Speech-conditioned Face Generation using Generative Adversarial Networks Amanda Cardoso Duarte, Francisco Roldan, Miquel Tubau, Janna Escur, Santiago Pascual, Amaia Salvador, Eva Mohedano, Kevin McGuinness, Jordi Torres, Xavier Giro-i-Nieto
On Attention Modules for Audio-Visual Synchronization Naji Khosravan, Shervin Ardeshir, Rohit Puri
Grounding Spoken Words in Unlabeled Video Angie W Boggust, Kartik Audhkhasi, Dhiraj Joshi, David Harwath, Samuel Thomas, Rogerio Feris, Danny Gutfreund, Yang Zhang, Antonio Torralba, Michael Picheny, James Glass
A Neurorobotic Experiment for Crossmodal Conflict Resolution German Parisi, Pablo Barros, Di Fu, Sven Magg, Haiyan Wu, Xun Liu, Stefan Wermter
End-to-End Speech-Driven Realistic Facial Animation with Temporal GANs Konstantinos Vougioukas, Stavros Petridis, Maja Pantic
Posters should follow the CVPR 2019 format. Oral presentations should be no more than 10 minutes long (including questions). Papers are available via the Computer Vision Foundation.