Sight and Sound - CVPR 2025

Presentation instructions

Previous workshops: 2018, 2019, 2020, 2021, 2022, 2023, 2024

Authors of accepted papers will present a 5-minute talk about their work. You may either present in person, or submit a video. For the latter option, please submit by June 9th (11:59 PST) to CMT as a .mp4 file. Please submit the video as a supplementary file on CMT, along with the PDF for your paper.
We'll have a paper presentation session from 9am - 10:30am. There will be a mix of in-person and video presentations. Throughout the paper session, there will be short Q&A sessions for all of the papers that precede them. We'll also release recordings on our website for offline viewing. We'll post the paper schedule in the coming weeks.
You are welcome to optionally present a poster at the end of the workshop, during the lunch break. Please note that CVPR workshop posters sometimes are hosted in a different room, which may be some distance from the workshop itself.
Please also submit the camera ready version of your paper via CMT by June 5th (11:59 PST). Papers will be available on our website.
Looking forward to seeing you there!

Schedule

9:00 (CT)	Welcome
9:00 - 10:15 (CT)	Paper session
	CAV-MAE Sync: Improving Contrastive Audio-Visual Mask Autoencoders via Fine-Grained Alignment		Edson Araujo, Andrew Rouditchenko, Yuan Gong, Saurabhchand Bhati, Samuel Thomas, Brian Kingsbury, Leonid Karlinsky, Rogerio Feris, James R. Glass, Hilde Kuehne
	Diagnosing and Treating Audio-Video Fake Detection		Marcel Klemt, Carlotta Segna, Anna Rohrbach
	UWAV: Uncertainty-weighted Weakly-supervised Audio-Visual Video Parsing		Yung-Hsuan Lai, Janek Ebbers, Yu-Chiang Frank Wang, François Germain, Michael Jeffrey Jones, Moitreya Chatterjee
	STM2DVG: Synthetically Trained Music to Dance Video Generation leveraging Latent Diffusion Framework		No Kap Park
	Q&A session
	Seeing Speech and Sound: Distinguishing and Locating Audio Sources in Visual Scenes		Hyeonggon Ryu, Seongyu Kim, Joon Son Chung, Arda Senocak
	AVS-Net: Audio-Visual Scale Net for Self-supervised Monocular Metric Depth Estimation		Xiaohu LIU, Sascha Hornauer, Fabien Moutarde, Jialiang Lu
	SAVGBench: Benchmarking Spatially Aligned Audio-Video Generation		Kazuki Shimada, Christian Simon, Takashi Shibuya, Shusuke Takahashi, Yuki Mitsufuji
	Q&A session
	BGM2Pose: Active 3D Human Pose Estimation with Non-Stationary Sounds		Yuto Shibata, Yusuke Oumi, Go Irie, Akisato Kimura, Yoshimitsu Aoki, Mariko Isogawa
	Visual Sound Source Localization: Assessing Performance with Both Positive and Negative Audio		Xavier Juanola, Giovana Morais, Gloria Haro, Magdalena Fuentes
	VGGSounder: Audio-Visual Evaluations for Foundation Models		Daniil Zverev, Thaddäus Wiedemer, Ameya Prabhu, Wieland Brendel, Matthias Bethge, Almut Sophia Koepke
	Q&A session
10:15 - 10:30 (CT)	Coffee Break
10:30 - 11:00 (CT)	Invited talk	James Rehg Learning to Infer Audio-Visual Attention in Social Communication
11:00 - 11:30 (CT)	Invited talk	David Harwath Sight and Sound with Large Language Models: Applications to Video Dubbing and Spatial Sound Understanding
11:30 - 12:00 (CT)	Invited talk	Ziyang Chen Learning Sight and Sound through Generative Models
12:00 - 12:30 (CT)	Invited talk	Stella Yu