Presentation instructions

Previous workshops: 2018, 2019, 2020, 2021, 2022, 2023, 2024

  • Authors of accepted papers will present a 5-minute talk about their work. You may either present in person, or submit a video. For the latter option, please submit by June 9th (11:59 PST) to CMT as a .mp4 file. Please submit the video as a supplementary file on CMT, along with the PDF for your paper.
  • We'll have a paper presentation session from 9am - 10:30am. There will be a mix of in-person and video presentations. Throughout the paper session, there will be short Q&A sessions for all of the papers that precede them. We'll also release recordings on our website for offline viewing. We'll post the paper schedule in the coming weeks.
  • You are welcome to optionally present a poster at the end of the workshop, during the lunch break. Please note that CVPR workshop posters sometimes are hosted in a different room, which may be some distance from the workshop itself.
  • Please also submit the camera ready version of your paper via CMT by June 5th (11:59 PST). Papers will be available on our website.
  • Looking forward to seeing you there!

Schedule

9:00 (CT) Welcome
9:00 - 10:15 (CT) Paper session
CAV-MAE Sync: Improving Contrastive Audio-Visual Mask Autoencoders via Fine-Grained AlignmentEdson Araujo, Andrew Rouditchenko, Yuan Gong, Saurabhchand Bhati, Samuel Thomas, Brian Kingsbury, Leonid Karlinsky, Rogerio Feris, James R. Glass, Hilde Kuehne
Diagnosing and Treating Audio-Video Fake DetectionMarcel Klemt, Carlotta Segna, Anna Rohrbach
UWAV: Uncertainty-weighted Weakly-supervised Audio-Visual Video ParsingYung-Hsuan Lai, Janek Ebbers, Yu-Chiang Frank Wang, François Germain, Michael Jeffrey Jones, Moitreya Chatterjee
STM2DVG: Synthetically Trained Music to Dance Video Generation leveraging Latent Diffusion FrameworkNo Kap Park
Q&A session
Seeing Speech and Sound: Distinguishing and Locating Audio Sources in Visual ScenesHyeonggon Ryu, Seongyu Kim, Joon Son Chung, Arda Senocak
AVS-Net: Audio-Visual Scale Net for Self-supervised Monocular Metric Depth EstimationXiaohu LIU, Sascha Hornauer, Fabien Moutarde, Jialiang Lu
SAVGBench: Benchmarking Spatially Aligned Audio-Video GenerationKazuki Shimada, Christian Simon, Takashi Shibuya, Shusuke Takahashi, Yuki Mitsufuji
Q&A session
BGM2Pose: Active 3D Human Pose Estimation with Non-Stationary SoundsYuto Shibata, Yusuke Oumi, Go Irie, Akisato Kimura, Yoshimitsu Aoki, Mariko Isogawa
Visual Sound Source Localization: Assessing Performance with Both Positive and Negative AudioXavier Juanola, Giovana Morais, Gloria Haro, Magdalena Fuentes
VGGSounder: Audio-Visual Evaluations for Foundation ModelsDaniil Zverev, Thaddäus Wiedemer, Ameya Prabhu, Wieland Brendel, Matthias Bethge, Almut Sophia Koepke
Q&A session


10:15 - 10:30 (CT) Coffee Break
10:30 - 11:00 (CT) Invited talk
James Rehg
Learning to Infer Audio-Visual Attention in Social Communication
 
11:00 - 11:30 (CT) Invited talk
David Harwath
Sight and Sound with Large Language Models: Applications to Video Dubbing and Spatial Sound Understanding
 
11:30 - 12:00 (CT) Invited talk
Ziyang Chen
Learning Sight and Sound through Generative Models
 
12:00 - 12:30 (CT) Invited talk
Stella Yu
 

Organizers


Andrew Owens
University of Michigan

Jiajun Wu
Stanford

Arsha Nagrani
Google

Triantafyllos Afouras
Meta

Ruohan Gao
Meta /
University of Maryland

Hang Zhao
Tsinghua University

Ziyang Chen
University of Michigan


William Freeman
MIT/Google

Andrew Zisserman
Oxford

Kristen Grauman
UT Austin / Meta

Antonio Torralba
MIT

Jean-Charles Bazin
Meta