Previous workshops: 2018, 2019, 2020, 2021, 2022, 2023, 2024
- Authors of accepted papers will present a 5-minute talk about their work. You may either present in person, or submit a video. For the latter option, please submit by June 9th (11:59 PST) to CMT as a .mp4 file. Please submit the video as a supplementary file on CMT, along with the PDF for your paper.
- We'll have a paper presentation session from 9am - 10:30am. There will be a mix of in-person and video presentations. Throughout the paper session, there will be short Q&A sessions for all of the papers that precede them. We'll also release recordings on our website for offline viewing. We'll post the paper schedule in the coming weeks.
- You are welcome to optionally present a poster at the end of the workshop, during the lunch break. Please note that CVPR workshop posters sometimes are hosted in a different room, which may be some distance from the workshop itself.
- Please also submit the camera ready version of your paper via CMT by June 5th (11:59 PST). Papers will be available on our website.
- Looking forward to seeing you there!
Schedule
9:00 (CT) | Welcome | ||
9:00 - 10:15 (CT) | Paper session | ||
CAV-MAE Sync: Improving Contrastive Audio-Visual Mask Autoencoders via Fine-Grained Alignment | Edson Araujo, Andrew Rouditchenko, Yuan Gong, Saurabhchand Bhati, Samuel Thomas, Brian Kingsbury, Leonid Karlinsky, Rogerio Feris, James R. Glass, Hilde Kuehne | ||
Diagnosing and Treating Audio-Video Fake Detection | Marcel Klemt, Carlotta Segna, Anna Rohrbach | ||
UWAV: Uncertainty-weighted Weakly-supervised Audio-Visual Video Parsing | Yung-Hsuan Lai, Janek Ebbers, Yu-Chiang Frank Wang, François Germain, Michael Jeffrey Jones, Moitreya Chatterjee | ||
STM2DVG: Synthetically Trained Music to Dance Video Generation leveraging Latent Diffusion Framework | No Kap Park | ||
Q&A session | |||
Seeing Speech and Sound: Distinguishing and Locating Audio Sources in Visual Scenes | Hyeonggon Ryu, Seongyu Kim, Joon Son Chung, Arda Senocak | ||
AVS-Net: Audio-Visual Scale Net for Self-supervised Monocular Metric Depth Estimation | Xiaohu LIU, Sascha Hornauer, Fabien Moutarde, Jialiang Lu | ||
SAVGBench: Benchmarking Spatially Aligned Audio-Video Generation | Kazuki Shimada, Christian Simon, Takashi Shibuya, Shusuke Takahashi, Yuki Mitsufuji | ||
Q&A session | |||
BGM2Pose: Active 3D Human Pose Estimation with Non-Stationary Sounds | Yuto Shibata, Yusuke Oumi, Go Irie, Akisato Kimura, Yoshimitsu Aoki, Mariko Isogawa | ||
Visual Sound Source Localization: Assessing Performance with Both Positive and Negative Audio | Xavier Juanola, Giovana Morais, Gloria Haro, Magdalena Fuentes | ||
VGGSounder: Audio-Visual Evaluations for Foundation Models | Daniil Zverev, Thaddäus Wiedemer, Ameya Prabhu, Wieland Brendel, Matthias Bethge, Almut Sophia Koepke | ||
Q&A session | 10:15 - 10:30 (CT) | Coffee Break | 10:30 - 11:00 (CT) | Invited talk |
James Rehg Learning to Infer Audio-Visual Attention in Social Communication |
![]() |
11:00 - 11:30 (CT) | Invited talk |
David Harwath Sight and Sound with Large Language Models: Applications to Video Dubbing and Spatial Sound Understanding |
![]() |
11:30 - 12:00 (CT) | Invited talk |
Ziyang Chen Learning Sight and Sound through Generative Models |
![]() |
12:00 - 12:30 (CT) | Invited talk |
Stella Yu | ![]() |
Organizers
![]() Andrew Owens University of Michigan |
![]() Jiajun Wu Stanford |
![]() Arsha Nagrani |
![]() Triantafyllos Afouras Meta |
![]() Ruohan Gao Meta / University of Maryland |
![]() Hang Zhao Tsinghua University |
![]() Ziyang Chen University of Michigan |
![]() William Freeman MIT/Google |
![]() Andrew Zisserman Oxford |
![]() Kristen Grauman UT Austin / Meta | ![]() Antonio Torralba MIT | ![]() Jean-Charles Bazin Meta |