Sight and Sound - CVPR 2018

In recent years, there have been many advances in learning from visual and auditory data. While traditionally these modalities have been studied in isolation, researchers have increasingly been creating algorithms that learn from both modalities. This has produced many exciting developments in automatic lip-reading, multi-modal representation learning, and audio-visual action recognition.

Since pretty much every video has an audio track, the prospect of learning from paired audio-visual data — either with new forms of unsupervised learning, or by simply incorporating sound data into existing vision algorithms — is intuitively appealing, and this workshop will cover recent advances in this direction. But it will also touch on higher-level questions, such as what information sound conveys that vision doesn't, the merits of sound versus other "supplemental" modalities such as text and depth, and the relationship between visual motion and sound. We'll also discuss how these techniques are being used to create new audio-visual applications, such as in the fields of speech processing and video editing.

Schedule

This is a half-day workshop that will take place in the afternoon on Friday, June 22, 2018.

1:30 - 1:35	Welcome
1:35 - 2:00	Paper Session 1	Ruohan Gao, Rogerio Feris, Kristen Grauman	Learning to Separate Object Sounds by Watching Unlabeled Video
		Yipin Zhou, Zhaowen Wang, Chen Fang, Trung Bui, Tamara L. Berg	Visual to Sound: Generating Natural Sound for Videos in the Wild
		Arda Senocak Tae-Hyun Oh, Junsik Kim, Ming-Hsuan Yang, In So Kweon	On Learning Association of Sound Source and Visual Scenes
2:00 - 2:30	Invited Talk	Antonio Torralba (MIT)
2:30 - 3:00	Invited Talk	Joon Son Chung (Oxford)
3:00 - 3:30	Paper Session 2	Arsha Nagrani, Samuel Albanie, Andrew Zisserman	Learnable PINs: Cross-Modal Embeddings for Person Identity
		Zhoutong Zhang, Jiajun Wu, Qiujia Li, Zhengjia Huang, Joshua Tenenbaum, William Freeman	Inverting Audio-Visual Simulation for Shape and Material Perception
		Chiori Hori, Takaaki, Gordon, Jue Wang, Teng-Yok Lee, Anoop, Tim Marks	Multimodal Attention for Fusion of Audio and Spatiotemporal Features for Video Description
3:30 - 4:00	Afternoon Break & Posters
4:00 - 4:30	Invited Talk	William Freeman (MIT/Google) Tali Dekel (Google)
4:30 - 5:00	Invited Talk	Relja Arandjelović (DeepMind)
5:00 - 5:30	Paper Session 3	Herman Kamper, Gregory Shakhnarovich, Karen Livescu	Semantic speech retrieval with a visually grounded model of untranscribed speech
		Sanjeel Parekh, Slim Essid, Alexey Ozerov Ngoc Duong, Patrick Perez, Gael Richard	Weakly Supervised Representation Learning for Unsynchronized Audio-Visual Events
		Abe Davis, Maneesh Agrawala	Visual Rhythm and Beat

Accepted short papers

Learning to Separate Object Sounds by Watching Unlabeled Video	Ruohan Gao (University of Texas at Austin), Rogerio Feris (IBM Research), Kristen Grauman (University of Texas at Austin)
Visual to Sound: Generating Natural Sound for Videos in the Wild	Yipin Zhou (UNC-Chapel Hill), Zhaowen Wang (Adobe Research) Chen Fang (Adobe Research), Trung Bui (Adobe Research), Tamara Berg (UNC-Chapel Hill)
Fast forwarding Egocentric Videos by Listening and Watching	Vinicius Furlan (Universidade Federal de Minas Gerais) Ruzena Bajcsy (UC Berkeley), Erickson Nascimento (Universidade Federal de Minas Gerais)
Learnable PINs: Cross-Modal Embeddings for Person Identity	Arsha Nagrani (Oxford University), Samuel Albanie (University of Oxford), Andrew Zisserman (University of Oxford)
The Sound of Pixels	Hang Zhao (MIT), Chuang Gan (MIT), Andrew Rouditchenko (MIT), Carl Vondrick (MIT), Josh McDermott (MIT), Antonio Torralba (MIT)
On Learning Association of Sound Source and Visual Scenes	Arda Senocak (KAIST), Tae-Hyun Oh, (MIT CSAIL), Junsik Kim (KAIST), Ming-Hsuan Yang (University of California at Merced), In Kweon (KAIST)
Image generation associated with music data	Yue Qiu (University of Tsukuba), Hirokatsu Kataoka (National Institute of Advanced Industrial Science and Technology)
Semantic speech retrieval with a visually grounded model of untranscribed speech	Herman Kamper (Stellenbosch University), Greg Shakhnarovich (Toyota Technological Institute at Chicago), Karen Livescu (Toyota Technological Institute at Chicago)
Weakly Supervised Representation Learning for Unsynchronized Audio-Visual Events	Sanjeel Parekh (Technicolor R&D France), Slim Essid (Telecom Paristech), Alexey Ozerov (Technicolor) Ngoc Duong (Technicolor) Patrick Perez (Technicolor), Gael Richard (Telecom Paristech)
Looking to Listen at the Cocktail Party: A Speaker-Independent Audio-Visual Model for Speech Separation	Tali Dekel (Google), Miki Rubinstein (Google), Inbar Mosseri (Google), Bill Freeman (Google) Oran Lang, (Google), Kevin Wilson (Google) Ariel Ephrat (HUJI), Avinatan Hasidim (Google)
The Excitement of Sports: Automatic Highlights Using Audio/Visual Cues	Michele Merler (IBM Research), Dhiraj Joshi (IBM Research), Khoi-Nguyen Mac (UIUC), Quoc-Bao Nguyen (IBM Research), Stephen Hammer (IBM), John Kent (IBM), Jinjun Xiong (IBM Thomas J. Watson Research Center), Minh Do (UIUC), John Smith (IBM), Rogerio Feris (IBM Research)
A Multimodal Approach to Mapping Soundscapes	Tawfiq Salem (University of Kentucky), Menghua Zhai (University of Kentucky), Scott Workman (University of Kentucky) Nathan Jacobs (University of Kentucky).
Multimodal Attention for Fusion of Audio and Spatiotemporal Features for Video Description	Chiori Hori (Mitsubishi Electric Research Laboratories (MERL)), Takaaki Hori (MERL), Gordon Wichern (MERL), Jue Wang (MERL), Teng-Yok Lee Laboratories (MERL), Anoop Cherian, Tim Marks (MERL)
Visual Rhythm and Beat	Abe Davis (Stanford University), Maneesh Agrawala (Stanford University)
Inverting Audio-Visual Simulation for Shape and Material Perception	Zhoutong Zhang (MIT), Jiajun Wu (MIT), Qiujia Li (MIT), Zhengjia Huang (ShanghaiTech University), Joshua Tenenbaum (MIT), Bill Freeman (MIT).

Posters should follow the CVPR 2018 format. Oral presentations should be no more than 10 minutes long (including questions). Papers are also available via the Computer Vision Foundation.

Organizers

Andrew Owens
UC Berkeley

Jiajun Wu
MIT

William Freeman
MIT/Google

Andrew Zisserman
Oxford

Jean-Charles Bazin
KAIST

Zhengyou Zhang
Microsoft Research

Antonio Torralba
MIT

Alexei Efros
UC Berkeley