Previous workshops: 2018, 2019, 2020, 2021, 2022, 2023, 2024
- Authors of accepted papers will present a 5-minute talk about their work. You may either present in person, or submit a video. For the latter option, please submit by June 9th (11:59 PST) to CMT as a .mp4 file. Please submit the video as a supplementary file on CMT, along with the PDF for your paper.
- We'll have a paper presentation session from 9am - 10:30am. There will be a mix of in-person and video presentations. Throughout the paper session, there will be short Q&A sessions for all of the papers that precede them. We'll also release recordings on our website for offline viewing. We'll post the paper schedule in the coming weeks.
- You are welcome to optionally present a poster at the end of the workshop, during the lunch break. Please note that CVPR workshop posters sometimes are hosted in a different room, which may be some distance from the workshop itself.
- Please also submit the camera ready version of your paper via CMT by June 5th (11:59 PST). Papers will be available on our website.
- Looking forward to seeing you there!
Schedule
| 9:00 (CT) | Welcome | ||
| 9:00 - 10:15 (CT) | Paper session | ||
| CAV-MAE Sync: Improving Contrastive Audio-Visual Mask Autoencoders via Fine-Grained Alignment | Edson Araujo, Andrew Rouditchenko, Yuan Gong, Saurabhchand Bhati, Samuel Thomas, Brian Kingsbury, Leonid Karlinsky, Rogerio Feris, James R. Glass, Hilde Kuehne | ||
| Diagnosing and Treating Audio-Video Fake Detection | Marcel Klemt, Carlotta Segna, Anna Rohrbach | ||
| UWAV: Uncertainty-weighted Weakly-supervised Audio-Visual Video Parsing | Yung-Hsuan Lai, Janek Ebbers, Yu-Chiang Frank Wang, François Germain, Michael Jeffrey Jones, Moitreya Chatterjee | ||
| STM2DVG: Synthetically Trained Music to Dance Video Generation leveraging Latent Diffusion Framework | No Kap Park | ||
| Q&A session | |||
| Seeing Speech and Sound: Distinguishing and Locating Audio Sources in Visual Scenes | Hyeonggon Ryu, Seongyu Kim, Joon Son Chung, Arda Senocak | ||
| AVS-Net: Audio-Visual Scale Net for Self-supervised Monocular Metric Depth Estimation | Xiaohu LIU, Sascha Hornauer, Fabien Moutarde, Jialiang Lu | ||
| SAVGBench: Benchmarking Spatially Aligned Audio-Video Generation | Kazuki Shimada, Christian Simon, Takashi Shibuya, Shusuke Takahashi, Yuki Mitsufuji | ||
| Q&A session | |||
| BGM2Pose: Active 3D Human Pose Estimation with Non-Stationary Sounds | Yuto Shibata, Yusuke Oumi, Go Irie, Akisato Kimura, Yoshimitsu Aoki, Mariko Isogawa | ||
| Visual Sound Source Localization: Assessing Performance with Both Positive and Negative Audio | Xavier Juanola, Giovana Morais, Gloria Haro, Magdalena Fuentes | ||
| VGGSounder: Audio-Visual Evaluations for Foundation Models | Daniil Zverev, Thaddäus Wiedemer, Ameya Prabhu, Wieland Brendel, Matthias Bethge, Almut Sophia Koepke | ||
| Q&A session | 10:15 - 10:30 (CT) | Coffee Break | 10:30 - 11:00 (CT) | Invited talk |
James Rehg Learning to Infer Audio-Visual Attention in Social Communication |
![]() |
11:00 - 11:30 (CT) | Invited talk |
David Harwath Sight and Sound with Large Language Models: Applications to Video Dubbing and Spatial Sound Understanding |
![]() |
| 11:30 - 12:00 (CT) | Invited talk |
Ziyang Chen Learning Sight and Sound through Generative Models |
![]() |
12:00 - 12:30 (CT) | Invited talk |
Stella Yu | ![]() |
Organizers
Andrew Owens University of Michigan |
Jiajun Wu Stanford |
Arsha Nagrani |
Triantafyllos Afouras Meta |
Ruohan Gao Meta / University of Maryland |
Hang Zhao Tsinghua University |
![]() Ziyang Chen University of Michigan |
William Freeman MIT/Google |
Andrew Zisserman Oxford |
Kristen Grauman UT Austin / Meta | Antonio Torralba MIT | Jean-Charles Bazin Meta |



