Sight and Sound - CVPR 2019

Schedule

This workshop is on Monday in Grand Ballroom B.
Directions: enter the conference center, go up the escalator to reach the 2nd floor, and turn right.

9:00 - 9:05	Welcome
Paper session 1	9:05 - 9:15	Yapeng Tian, Jing Shi, Bochen Li, Zhiyao Duan, Chenliang Xu	Audio-Visual Event Localization in the Wild
	9:15 - 9:25	Triantafyllos Afouras, Joon Son Chung, Andrew Zisserman	My lips are concealed: Audio-visual speech enhancement through obstructions
9:25 - 9:55	Invited Talk	Ruohan Gao (UT Austin) Learning to See and Hear with Unlabeled Video
9:55 - 10:15	Poster spotlights	The authors of each poster will have the chance to present a 2-minute lighting talk.
10:15 - 11:00	Poster presentations & coffee break	Posters will be in the same room as the rest of the workshop (Grand Ballroom B). ~~Poster numbers: 168 - 177~~
11:00 - 11:30	Invited talk	Lorenzo Torresani (Dartmouth/Facebook) Audio-Visual Learning for Reduced Supervision and Improved Efficiency
11:30 - 12:00	Invited talk	Tali Dekel (Google) From face to speech, and back to face
12:00 - 1:45	Lunch
1:45 - 2:15	Invited talk	Antonio Torralba (MIT)
2:15 - 3:00	Keynote talk	Josh McDermott (MIT) Old and New Problems in Auditory Scene Analysis
3:00 - 3:30	Invited talk	Jitendra Malik (UC Berkeley) Learning Individual Styles of Conversational Gesture
3:30 - 4:00	Coffee break
4:00 - 4:30	Invited talk	Aäron van den Oord (DeepMind)
Paper session 2	4:30 - 4:40	Joon Son Chung	Audio-visual speaker diarisation in the wild

Summary

In recent years, there have been many advances in learning from visual and auditory data. While traditionally these modalities have been studied in isolation, researchers have increasingly been creating algorithms that learn from both modalities. This has produced many exciting developments in automatic lip-reading, multi-modal representation learning, and audio-visual action recognition.

Since pretty much every internet video has an audio track, the prospect of learning from paired audio-visual data — either with new forms of unsupervised learning, or by simply incorporating sound data into existing vision algorithms — is intuitively appealing, and this workshop will cover recent advances in this direction. But it will also touch on higher-level questions, such as what information sound conveys that vision doesn't, the merits of sound versus other "supplemental" modalities such as text and depth, and the relationship between visual motion and sound. We'll also discuss how these techniques are being used to create new audio-visual applications, such as in the fields of speech processing and video editing.

Please click here to see last year's workshop (at CVPR 2018).

Accepted short papers

Sound to Visual: Hierarchical Cross-Modal Talking Face Generation	Lele Chen, Haitian Zheng, Ross Maddox, Zhiyao Duan, Chenliang Xu
Audio-Visual Event Localization in the Wild	Yapeng Tian, Jing Shi, Bochen Li, Zhiyao Duan, Chenliang Xu
Audio-Visual Interpretable and Controllable Video Captioning	Yapeng Tian, Chenxiao Guan, Goodman Justin, Marc Moore, Chenliang Xu
Reflection and Diffraction-Aware Sound Source Localization	Inkyu An, Jung-Woo Choi, Dinesh Manocha, Sung-Eui Yoon
Generating Video from Single Image and Sound	Yukitaka Tsuchiya, Takahiro Itazuri, Ryota Natsume, Shintaro Yamamoto, Takuya Kato, Shigeo Morishima
WAV2PIX: Speech-conditioned Face Generation using Generative Adversarial Networks	Amanda Cardoso Duarte, Francisco Roldan, Miquel Tubau, Janna Escur, Santiago Pascual, Amaia Salvador, Eva Mohedano, Kevin McGuinness, Jordi Torres, Xavier Giro-i-Nieto
On Attention Modules for Audio-Visual Synchronization	Naji Khosravan, Shervin Ardeshir, Rohit Puri
Grounding Spoken Words in Unlabeled Video	Angie W Boggust, Kartik Audhkhasi, Dhiraj Joshi, David Harwath, Samuel Thomas, Rogerio Feris, Danny Gutfreund, Yang Zhang, Antonio Torralba, Michael Picheny, James Glass
A Neurorobotic Experiment for Crossmodal Conflict Resolution	German Parisi, Pablo Barros, Di Fu, Sven Magg, Haiyan Wu, Xun Liu, Stefan Wermter
End-to-End Speech-Driven Realistic Facial Animation with Temporal GANs	Konstantinos Vougioukas, Stavros Petridis, Maja Pantic

Posters should follow the CVPR 2019 format. Oral presentations should be no more than 10 minutes long (including questions). Papers are available via the Computer Vision Foundation.

Organizers

Andrew Owens
UC Berkeley

Jiajun Wu
MIT

William Freeman
MIT/Google

Andrew Zisserman
Oxford

Jean-Charles Bazin
KAIST

Zhengyou Zhang
Tencent

Antonio Torralba
MIT

Kristen Grauman
UT Austin / Facebook