Sight and Sound - CVPR 2022

Thanks, everyone, for joining us! We'll provide video recordings for the workshop in the coming days.

Schedule

9:00 - 9:05 (CDT)	Welcome
9:05 - 10:30 (CDT)	Paper session #1		Chair: Triantafyllos Afouras
	Quantized GAN for Complex Music Generation from Dance Videos		Ye Zhu, Kyle B Olszewski, Yu Wu, Panos Achlioptas, Menglei Chai, Jian Ren, Yan Yan, Sergey Tulyakov
	Synchronisation of Lips and Voices		Venkatesh Shenoy Kadandale, Juan Felipe Montesinos, Gloria Haro
	A Model You Can Hear: Audio Classification with Playable Prototypes		Romain Loiseau, Baptiste Bouvier, Teytaut Yann, Elliot Vincent, Mathieu Aubry, loic landrieu
	Audio-Visual Object Localization in Egocentric Videos		Chao Huang, Yapeng Tian, Anurag Kumar, Chenliang Xu
	Q&A session
	Audio-Visual Event Localization via Recursive Joint Co-Attention		Bin Duan, Hugo M Latapie, Gaowen Liu, Yan Yan
	The Sound of Motion: Multimodal horse motion estimation from video and audio		Ci Li, Elin Hernlund, Hedvig Kjellström, Silvia Zuffi
	Learning Sound Localization Better From Semantically Similar Samples		Arda Senocak, Hyeonggon Ryu, Junsik Kim, In So Kweon
	SVTS: Scalable Video-to-Speech Synthesis - Extended Abstract		Rodrigo Mira, Alexandros Haliassos, Stavros Petridis, Bjoern W. Schuller, Maja Pantic
	Q&A session
	Quantifying Predictive Uncertainty for Stochastic Video Synthesis from Audio		Moitreya Chatterjee, Narendra Ahuja, Anoop Cherian
	Exploring a Probabilistic Approach to Vehicle Sound Source Localization in Urban Scenes		Julia Wilkins, Magdalena Fuentes, Luca Bondi, Shabnam Ghaffarzadegan, Bea Steers, Ali Abavisani, Juan P Bello
	SEMI: Self-supervised Exploration via Multisensory Incongruity		Ziwen Zhuang, Jianren Wang, Hang Zhao
	Sound Adversarial Audio-Visual Navigation		Yinfeng Yu, Changan Chen, Fuchun Sun
	ECLIPSE: Efficient Long-range Video Retrieval using Sight and Sound		Yan-Bo Lin, Jie Lei, Mohit Bansal, Gedas Bertasius
	Q&A session
10:30 - 11:00	Coffee break & posters
11:00 - 11:30 (CDT)	Invited talk	Arsha Nagrani
11:30 - 12:00 (CDT)	Invited talk	Jeannette Bohg
12:00 - 1:00 (CDT)	Lunch
1:00 - 2:00 (CDT)	Paper session #2
	How to Listen? Rethinking Visual Sound Localization		Ho-Hsiang Wu, Magdalena Fuentes, Prem Seetharaman, Juan P Bello
	Urban Sound & Sight: Dataset and benchmark for Audio-Visual Urban Scene Understanding		Magdalena Fuentes, Bea A Steers, Pablo Zinemanas, Martín Rocamora, Luca Bondi, Julia Wilkins, Qianyi Shi, Yao Hou, Samarjit Das, Xavier Serra, Juan P Bello
	On Negative Sampling for Audio-Visual Contrastive Learning from Movies		Mahdi M. Kalayeh, Shervin Ardeshir, Kamath Nagendra, Lingyi Liu, Ashok Chandrashekar
	Audio-visual voice separation transformer		Juan Felipe Montesinos, Venkatesh Shenoy Kadandale, Gloria Haro
	Q&A session
	Everything at Once - Multi-modal Fusion Transformer for Video Retrieval		Nina Shvetsova, Brian Chen, Andrew Rouditchenko, Samuel Thomas, Brian Kingsbury, Rogerio Feris, David Harwath, James Glass, Hilde Kuehne
	Tap to the Beat: Cross-modal Music Beat Localization for Dancing Videos		Tianyi Ma, Yu Wu
	Leveraging Real Talking Faces via Self-Supervision for Robust Forgery Detection - Extended Abstract		Alexandros Haliassos, Rodrigo Mira, Stavros Petridis, Maja Pantic
	Visual Speech Recognition for Multiple Languages		Pingchuan Ma, Stavros Petridis, Maja Pantic
	Q&A session
2:00 - 2:30 (CDT)	Invited talk	David Brang
2:30 - 3:30 (CDT)	Coffee break & posters
3:30 - 4:00 (CDT)	Invited talk	Carl Vondrick
4:00 - 5:00 (CDT)	Invited paper talks		Chair: Ruohan Gao
	Taming visually guided sound generation		Vladimir Iashin, Esa Rahtu
	Audio-Visual Speech Codecs: Rethinking Audio-Visual Speech Enhancement by Re-Synthesis		Karren Yang, Dejan Marković, Steven Krenn, Vasu Agrawal, Alexander Richard
	Learning to Answer Questions in Dynamic Audio-Visual Scenarios		Guangyao Li, Yake Wei, Yapeng Tian, Chenliang Xu, Ji-Rong Wen, Di Hu
	Q&A session
	Cross-Modal Perceptionist: Can Face Geometry be Gleaned from Voices?		Cross-Modal Perceptionist: Can Face Geometry be Gleaned from Voices?
	Learning Hierarchical Cross-Modal Association for Co-Speech Gesture Generation		Xian Liu, Qianyi Wu, Hang Zhou, Yinghao Xu, Rui Qian, Xinyi Lin, Xiaowei Zhou, Wayne Wu, Bo Dai, Bolei Zhou
	Sound and Visual Representation Learning with Multiple Pretraining Tasks		Arun Balajee Vasudevan, Dengxin Dai, Luc Van Gool
	Active Audio-Visual Separation of Dynamic Sound Sources		Sagnik Majumder, Ziad Al-Halah, Kristen Grauman
	Q&A sessions
5:00 - 5:30 (CDT)	Invited talk	Hilde Kuehne
5:30 - 6:00 (CDT)	Invited talk	Pedro Morgado

Presentation instructions

Authors of accepted papers will present a 5-minute talk about their work. You may either present in person, or submit a video. For the latter option, please submit by June 15th (11:59 PST) to CMT, following the previous CVPR oral instructions here (uploading as a .mp4 file).
We'll have two paper presentation sessions: 9am - 11am and 1pm - 2pm. Each session will be a mix of in-person and video presentations. Throughout the paper sessions, there will be short Q&A sessions for all of the papers that precede them. We'll also release recordings on our website for offline viewing. We'll post the paper schedule in the coming weeks.
You are welcome to optionally present a poster during the lunch and coffee breaks. We unfortunately are unable to offer a hybrid option for posters.
Please also submit the camera ready version of your paper via CMT by June 17th (11:59 PST). Papers will be available on our website.
Looking forward to seeing you there!