In recent years, there have been many advances in learning from visual and auditory data. While traditionally these modalities have been studied in isolation, researchers have increasingly been creating algorithms that learn from both modalities. This has produced many exciting developments in automatic lip-reading, multi-modal representation learning, and audio-visual action recognition.

Since pretty much every video has an audio track, the prospect of learning from paired audio-visual data — either with new forms of unsupervised learning, or by simply incorporating sound data into existing vision algorithms — is intuitively appealing, and this workshop will cover recent advances in this direction. But it will also touch on higher-level questions, such as what information sound conveys that vision doesn't, the merits of sound versus other "supplemental" modalities such as text and depth, and the relationship between visual motion and sound. We'll also discuss how these techniques are being used to create new audio-visual applications, such as in the fields of speech processing and video editing.

Schedule

This is a half-day workshop that will take place in the afternoon on Friday, June 22, 2018.

1:30 - 1:35 Welcome
1:35 - 2:00 Paper Session 1 Ruohan Gao, Rogerio Feris, Kristen Grauman Learning to Separate Object Sounds by Watching Unlabeled Video
Yipin Zhou, Zhaowen Wang, Chen Fang, Trung Bui, Tamara L. Berg Visual to Sound: Generating Natural Sound for Videos in the Wild
Arda Senocak Tae-Hyun Oh, Junsik Kim, Ming-Hsuan Yang, In So Kweon On Learning Association of Sound Source and Visual Scenes
2:00 - 2:30 Invited Talk Antonio Torralba (MIT)
2:30 - 3:00 Invited Talk Joon Son Chung (Oxford)
3:00 - 3:30 Paper Session 2 Arsha Nagrani, Samuel Albanie, Andrew Zisserman Learnable PINs: Cross-Modal Embeddings for Person Identity
Zhoutong Zhang, Jiajun Wu, Qiujia Li, Zhengjia Huang, Joshua Tenenbaum, William Freeman Inverting Audio-Visual Simulation for Shape and Material Perception
Chiori Hori, Takaaki, Gordon, Jue Wang, Teng-Yok Lee, Anoop, Tim Marks Multimodal Attention for Fusion of Audio and Spatiotemporal Features for Video Description
3:30 - 4:00 Afternoon Break & Posters
4:00 - 4:30 Invited Talk William Freeman (MIT/Google)
Tali Dekel (Google)
 
4:30 - 5:00 Invited Talk Relja Arandjelović (DeepMind)
5:00 - 5:30 Paper Session 3 Herman Kamper, Gregory Shakhnarovich, Karen Livescu Semantic speech retrieval with a visually grounded model of untranscribed speech
Sanjeel Parekh, Slim Essid, Alexey Ozerov Ngoc Duong, Patrick Perez, Gael Richard Weakly Supervised Representation Learning for Unsynchronized Audio-Visual Events
Abe Davis, Maneesh Agrawala Visual Rhythm and Beat

Accepted short papers

Learning to Separate Object Sounds by Watching Unlabeled Video Ruohan Gao (University of Texas at Austin), Rogerio Feris (IBM Research), Kristen Grauman (University of Texas at Austin)
Visual to Sound: Generating Natural Sound for Videos in the Wild Yipin Zhou (UNC-Chapel Hill), Zhaowen Wang (Adobe Research) Chen Fang (Adobe Research), Trung Bui (Adobe Research), Tamara Berg (UNC-Chapel Hill)
Fast forwarding Egocentric Videos by Listening and Watching Vinicius Furlan (Universidade Federal de Minas Gerais) Ruzena Bajcsy (UC Berkeley), Erickson Nascimento (Universidade Federal de Minas Gerais)
Learnable PINs: Cross-Modal Embeddings for Person Identity Arsha Nagrani (Oxford University), Samuel Albanie (University of Oxford), Andrew Zisserman (University of Oxford)
The Sound of Pixels Hang Zhao (MIT), Chuang Gan (MIT), Andrew Rouditchenko (MIT), Carl Vondrick (MIT), Josh McDermott (MIT), Antonio Torralba (MIT)
On Learning Association of Sound Source and Visual Scenes Arda Senocak (KAIST), Tae-Hyun Oh, (MIT CSAIL), Junsik Kim (KAIST), Ming-Hsuan Yang (University of California at Merced), In Kweon (KAIST)
Image generation associated with music data Yue Qiu (University of Tsukuba), Hirokatsu Kataoka (National Institute of Advanced Industrial Science and Technology)
Semantic speech retrieval with a visually grounded model of untranscribed speech Herman Kamper (Stellenbosch University), Greg Shakhnarovich (Toyota Technological Institute at Chicago), Karen Livescu (Toyota Technological Institute at Chicago)
Weakly Supervised Representation Learning for Unsynchronized Audio-Visual Events Sanjeel Parekh (Technicolor R&D France), Slim Essid (Telecom Paristech), Alexey Ozerov (Technicolor) Ngoc Duong (Technicolor) Patrick Perez (Technicolor), Gael Richard (Telecom Paristech)
Looking to Listen at the Cocktail Party: A Speaker-Independent Audio-Visual Model for Speech Separation Tali Dekel (Google), Miki Rubinstein (Google), Inbar Mosseri (Google), Bill Freeman (Google) Oran Lang, (Google), Kevin Wilson (Google) Ariel Ephrat (HUJI), Avinatan Hasidim (Google)
The Excitement of Sports: Automatic Highlights Using Audio/Visual Cues Michele Merler (IBM Research), Dhiraj Joshi (IBM Research), Khoi-Nguyen Mac (UIUC), Quoc-Bao Nguyen (IBM Research), Stephen Hammer (IBM), John Kent (IBM), Jinjun Xiong (IBM Thomas J. Watson Research Center), Minh Do (UIUC), John Smith (IBM), Rogerio Feris (IBM Research)
A Multimodal Approach to Mapping Soundscapes Tawfiq Salem (University of Kentucky), Menghua Zhai (University of Kentucky), Scott Workman (University of Kentucky) Nathan Jacobs (University of Kentucky).
Multimodal Attention for Fusion of Audio and Spatiotemporal Features for Video Description Chiori Hori (Mitsubishi Electric Research Laboratories (MERL)), Takaaki Hori (MERL), Gordon Wichern (MERL), Jue Wang (MERL), Teng-Yok Lee Laboratories (MERL), Anoop Cherian, Tim Marks (MERL)
Visual Rhythm and Beat Abe Davis (Stanford University), Maneesh Agrawala (Stanford University)
Inverting Audio-Visual Simulation for Shape and Material Perception Zhoutong Zhang (MIT), Jiajun Wu (MIT), Qiujia Li (MIT), Zhengjia Huang (ShanghaiTech University), Joshua Tenenbaum (MIT), Bill Freeman (MIT).
Posters should follow the CVPR 2018 format. Oral presentations should be no more than 10 minutes long (including questions). Papers are also available via the Computer Vision Foundation.

Organizers