Presentation instructions

Previous workshops: 2018, 2019, 2020, 2021, 2022, 2023, 2024, 2025

  • Authors of accepted papers will present a 5-minute talk about their work. You may either present in person, or submit a video. For the latter option, please submit by June 2nd (11:59 PST) to CMT as a .mp4 file. Please submit the video as a supplementary file on CMT, along with the PDF for your paper.
  • We'll have a paper presentation session from 9am - 11:30am. There will be a mix of in-person and video presentations. Throughout the paper session, there will be short Q&A sessions for all of the papers that precede them. We'll also release recordings on our website for offline viewing. We'll post the paper schedule in the coming weeks.
  • You are welcome to optionally present a poster at the end of the workshop, during the lunch break. Please note that CVPR workshop posters sometimes are hosted in a different room, which may be some distance from the workshop itself.
  • Looking forward to seeing you there!

Partial Schedule

9:00 (MT) Welcome
9:00 - 10:00 (MT) Paper session #1
Omni-MMSI: Toward Identity-attributed Social Interaction UnderstandingXinpeng Li, Bolin Lai, Hardy Chen, Shijian Deng, Cihang Xie, Yuyin Zhou, James M. Rehg, Yapeng Tian
MMAudio-LABEL: Audio Event Labeling via Audio Generation for Silent VideoKazuya Tateishi
MMAudioReverbs: Video‑Guided Acoustic Modeling for Dereverberation and Room Impulse Response EstimationAkira Takahashi, Ryosuke Sawata, Shusuke Takakahashi, Yuki Mitsufuji
Omni-Judge: Can Omni-LLMs Serve as Human-Aligned Judges for Text-Conditioned Audio-Video Generation?Susan Liang, Chao Huang, Filippos Bellos, Yolo Tang, Qianxiang Shen, Jing Bi, Luchuan Song, Zeliang Zhang, Jason Corso, Chenliang Xu
DiFlowDubber: Discrete Flow Matching for Automated Video Dubbing via Cross-Modal Alignment and SynchronizationSon Nguyen, Thanh Tran, Jeongsoo Choi, Nghia Huynh, Son Hy, Van Nguyen
Few-shot Acoustic Synthesis with Multimodal Flow MatchingAmandine Brunetto
10:00 - 10:45 (MT) Coffee Break
10:45 - 11:30 (MT) Paper session #2
Precise Video-to-Audio Generation with Cross-Modal Alignment in Latent SpaceThanh Tran, Son Nguyen, Luong Tran, Khanh Pham, Paarth Neekhara, Shehzeen Hussain, Van Nguyen
CounterFlow: A Two-Phase Inference-Time Sampling for Counterfactual Video Foley GenerationGyubin Lee, Junwon Lee, Juhan Nam
Do Audio-Visual Large Language Models Really See and Hear?Ramaneswaran Selvakumar, Kaousheik Jayakumar, Sakshi S, Sreyan Ghosh, Ruohan Gao, Dinesh Manocha
Some Modalities are More Equal Than Others: Decoding and Architecting Multimodal Integration in MLLMsTianle Chen, Chaitanya Chakka, Arjun Akula, Xavier Thomas, Deepti Ghadiyaram
Rethinking Video-and-Text-to-Audio Generation through Multimodal CoverageSangyeop Yeo, Yoojin Jang, Saad Mourafik, Arda Senocak, Jaejun Yoo

Organizers


Andrew Owens
University of Michigan

Jiajun Wu
Stanford

Arsha Nagrani
Google

Triantafyllos Afouras
Meta

Ruohan Gao
Meta /
University of Maryland

Hang Zhao
Tsinghua University

Ziyang Chen
University of Michigan


William Freeman
MIT/Google

Andrew Zisserman
Oxford

Kristen Grauman
UT Austin / Meta

Antonio Torralba
MIT

Jean-Charles Bazin
Meta