Sight and Sound - CVPR 2021

Attending

There are two ways to attend:

If you are registered for CVPR, you can join over Zoom.
The workshop is also being streamed on Facebook.

Schedule

9:00 - 9:05 (PST)	Welcome
9:05 - 11:00 (PST)	Paper session [Video]		Session chairs: Arsha Nagrani and Triantafyllos Afouras
	Synthetic Acoustic Image Generation for Audio-Visual Localization		Valentina Sanguineti, Pietro Morerio, Alessio Del Bue, Vittorio Murino
	Pose-Controllable Talking Face Generation by Implicitly Modularized Audio-Visual Representation		Hang Zhou, Yasheng Sun, Wayne Wu, Chen Change Loy, Xiaogang Wang, Ziwei Liu
	Self-Supervised Learning for Cross-Modal Retrieval based on Sound Category and Location		Tomoya Sato, Yusuke Sugano, Yoichi Sato
	Estimating Individual A Cappella Voices in Music Videos with Singing Faces		Venkatesh Shenoy Kadandale, Juan Felipe Montesinos, Gloria Haro
	Q&A session
	Spoken ObjectNet: A Bias-Controlled Spoken Caption Dataset - Extended Abstract		Ian A Palmer, Andrew Rouditchenko, Andrei Barbu, Boris Katz, James Glass
	Cascaded Multilingual Audio-Visual Learning from Videos - Extended Abstract		Andrew Rouditchenko, Angie W Boggust, David Harwath, Samuel Thomas, Hilde Kuehne, Brian Chen, Rameswar Panda, Rogerio Feris, Brian Kingsbury, Michael Picheny, James Glass
	End-To-End Video-To-Speech Synthesis using Generative Adversarial Networks with Multiple Critics		Rodrigo Schonburg Carrillo de Mira, Konstantinos Vougioukas, Pingchuan Ma, Stavros Petridis, Bjoern W. Schuller, Maja Pantic
	Neural Dubber: Dubbing for Silent Videos According to Scripts		Chenxu Hu, Qiao Tian, Tingle Li, Wang Yuping, Yuxuan Wang, Hang Zhao
	Q&A session
	Learning Representations from Audio-Visual Spatial Alignment		Yi Li, Pedro Morgado, Nuno Vasconcelos
	Multimodal Clustering Networks for Self-supervised Learning from Unlabeled Videos		Brian Chen, Andrew Rouditchenko, Kevin Duarte, Hilde Kuehne, Samuel Thomas, Angie W Boggust, Rameswar Panda, Brian Kingsbury, Rogerio Feris, David Harwath, James Glass, Michael Picheny, Shih-Fu Chang
	Material Converter: Manipulating Materials of Visual Objects with Sound		Tingle L, Yichen Liu, Andrew Owens, Hang Zhao
	Depth Infused Binaural Audio Generation using Hierarchical Cross-Modal Attention		Kranti K Parida, Siddharth Srivastava, Neeraj Matiyali, Gaurav Sharma
	Q&A session
	Localizing Visual Sounds the Hard Way		Honglie Chen, Weidi Xie, Triantafyllos Afouras, Arsha Nagrani, Andrea Vedaldi, Andrew Zisserman
	Face-to-Music Translation		Chelhwon Kim, Andrew Port, Mitesh Patel
	Q&A session
11:00 - 11:30 (PST)	Invited talk [Video]	Justin Salamon
11:30 - 12:00 (PST)	Invited talk [Video]	Chenliang Xu
12:00 - 12:30 (PST)	Invited talk [Video]	Kristen Grauman
12:30 - 2:00 (PST)	Invited paper talks [Video]		Session chair: Ruohan Gao
[Paper]	The Boombox: Visual Reconstruction from Acoustic Vibrations		Boyuan Chen, Mia Chiquier, Hod Lipson, Carl Vondrick
[Paper]	Visually Informed Binaural Audio Generation without Binaural Audios		Xudong Xu, Hang Zhou, Ziwei Liu, Bo Dai, Xiaogang Wang, Dahua Lin
[Paper]	Unsupervised Sound Localization via Iterative Contrastive Learning		Yan-Bo Lin, Hung-Yu Tseng, Hsin-Ying Lee, Yen-Yu Lin, Ming-Hsuan Yang
[Paper]	See, hear, explore: Curiosity via audio-visual association		Victoria Dean, Shubham Tulsiani, Abhinav Gupta.
	Q&A session
[Paper]	VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text		Hassan Akbari, Linagzhe Yuan, Rui Qian, Wei-Hong Chuang, Shih-Fu Chang, Yin Cui, Boqing Gong
[Paper]	Repetitive Activity Counting by Sight and Sound		Yunhua Zhang, Ling Shao, Cees G. M. Snoek
[Paper]	AdaMML: Adaptive Multi-Modal Learning for Efficient Video Recognition		Rameswar Panda, Chun-Fu (Richard) Chen, Quanfu Fan, Ximeng Sun, Kate Saenko, Aude Oliva, Rogerio Feris
[Paper]	Space-Time Crop & Attend: Improving Cross-modal Video Representation Learning		Mandela Patrick, Yuki M. Asano, Bernie Huang, Ishan Misra, Florian Metze, Joao Henriques, Andrea Vedaldi
	Q&A session
2:00 - 2:30 (PST)	Invited talk [Video]	Dima Damen
2:30 - 3:00 (PST)	Invited talk [Video]	Chuang Gan
3:00 - 3:30 (PST)	Invited talk [Video]	John Hershey & Efthymios Tzinis Improving On-Screen Sound Separation for Open Domain Videos with Audio-Visual Self-attention
3:30 - 4:00 (PST)	Invited talk [Video]	James Traer Hearing the world with noise (and statistics)

Summary

While traditionally visual and audio data have been studied in isolation, researchers have increasingly been creating algorithms that learn from both modalities. This has produced many exciting developments in automatic lip-reading, multi-modal representation learning, and audio-visual action recognition.

Since pretty much every internet video has an audio track, the prospect of learning from paired audio-visual data — either with new forms of unsupervised learning, or by simply incorporating sound data into existing vision algorithms — is appealing, and this workshop will cover recent advances in this direction. It will also touch on higher-level questions, such as what information sound conveys that vision doesn't, the merits of sound versus other "supplemental" modalities such as text and depth, and the relationship between visual motion and sound. We'll also discuss how these techniques are being used to create new audio-visual applications, such as in the fields of speech processing and video editing.

Previous workshops: 2018, 2019, 2020

Presentation instructions

Authors of accepted papers can present a 5-minute (or shorter) talk about their work. Please submit the video by June 18th (11:59 PST) to CMT, following the CVPR oral instructions here (uploading as a .mp4 file).
We'll have a paper presentation session on 9:00am - 11:00am PST on June 20. During this session, we'll play the pre-recorded talks, with time for Q&A from authors (if they are present). We'll also release the videos on our website for offline viewing.
Please also submit the camera ready version of your paper via CMT by June 18th (11:59 PST). Papers will be available on our website.
Looking forward to seeing you there!

Organizers

Andrew Owens University of Michigan	Jiajun Wu Stanford	Arsha Nagrani Google	Triantafyllos Afouras Oxford	Ruohan Gao Stanford
William Freeman MIT/Google	Andrew Zisserman Oxford	Kristen Grauman UT Austin / Facebook	Antonio Torralba MIT	Jean-Charles Bazin KAIST