Previous workshops: 2018, 2019, 2020, 2021, 2022, 2023, 2024, 2025
- Authors of accepted papers will present a 5-minute talk about their work. You may either present in person, or submit a video. For the latter option, please submit by June 2nd (11:59 PST) to CMT as a .mp4 file. Please submit the video as a supplementary file on CMT, along with the PDF for your paper.
- We'll have a paper presentation session from 9am - 11:30am. There will be a mix of in-person and video presentations. Throughout the paper session, there will be short Q&A sessions for all of the papers that precede them. We'll also release recordings on our website for offline viewing. We'll post the paper schedule in the coming weeks.
- Looking forward to seeing you there!
Schedule
| 9:00 (MT) | Welcome | ||
| 9:00 - 10:00 (MT) | Paper session #1 | ||
| Omni-MMSI: Toward Identity-attributed Social Interaction Understanding | Xinpeng Li, Bolin Lai, Hardy Chen, Shijian Deng, Cihang Xie, Yuyin Zhou, James M. Rehg, Yapeng Tian | ||
| MMAudio-LABEL: Audio Event Labeling via Audio Generation for Silent Video | Kazuya Tateishi | ||
| MMAudioReverbs: Video‑Guided Acoustic Modeling for Dereverberation and Room Impulse Response Estimation | Akira Takahashi, Ryosuke Sawata, Shusuke Takakahashi, Yuki Mitsufuji | ||
| Omni-Judge: Can Omni-LLMs Serve as Human-Aligned Judges for Text-Conditioned Audio-Video Generation? | Susan Liang, Chao Huang, Filippos Bellos, Yolo Tang, Qianxiang Shen, Jing Bi, Luchuan Song, Zeliang Zhang, Jason Corso, Chenliang Xu | ||
| DiFlowDubber: Discrete Flow Matching for Automated Video Dubbing via Cross-Modal Alignment and Synchronization | Son Nguyen, Thanh Tran, Jeongsoo Choi, Nghia Huynh, Son Hy, Van Nguyen | ||
| Few-shot Acoustic Synthesis with Multimodal Flow Matching | Amandine Brunetto | 10:00 - 10:45 (MT) | Coffee Break |
| 10:45 - 11:30 (MT) | Paper session #2 | ||
| Precise Video-to-Audio Generation with Cross-Modal Alignment in Latent Space | Thanh Tran, Son Nguyen, Luong Tran, Khanh Pham, Paarth Neekhara, Shehzeen Hussain, Van Nguyen | ||
| CounterFlow: A Two-Phase Inference-Time Sampling for Counterfactual Video Foley Generation | Gyubin Lee, Junwon Lee, Juhan Nam | ||
| Do Audio-Visual Large Language Models Really See and Hear? | Ramaneswaran Selvakumar, Kaousheik Jayakumar, Sakshi S, Sreyan Ghosh, Ruohan Gao, Dinesh Manocha | ||
| Some Modalities are More Equal Than Others: Decoding and Architecting Multimodal Integration in MLLMs | Tianle Chen, Chaitanya Chakka, Arjun Akula, Xavier Thomas, Deepti Ghadiyaram | ||
| Rethinking Video-and-Text-to-Audio Generation through Multimodal Coverage | Sangyeop Yeo, Yoojin Jang, Saad Mourafik, Arda Senocak, Jaejun Yoo | ||
| 11:30 - 12:00 (MT) | Invited talk |
Sophia Koepke |
|
| 12:00 - 1:30 (MT) | Lunch | ||
| 1:30 - 2:00 (MT) | Invited talk |
Dinesh Manocha |
![]() |
| 2:00 - 2:30 (MT) | Invited talk |
Eli Shlizerman |
![]() |
| 2:30 - 3:00 (MT) | Invited talk |
Ruohan Gao |
![]() |
| 3:00 - 3:30 (MT) | Coffee Break | ||
| 3:30 - 4:00 (MT) | Invited talk |
Yake Wei |
![]() |
| 4:00 - 5:00 (MT) | Panel discussion | ||
Organizers
Andrew Owens University of Michigan |
Jiajun Wu Stanford |
Arsha Nagrani |
Triantafyllos Afouras Meta |
Ruohan Gao Meta / University of Maryland |
Hang Zhao Tsinghua University |
![]() Ziyang Chen University of Michigan |
William Freeman MIT/Google |
Andrew Zisserman Oxford |
Kristen Grauman UT Austin / Meta | Antonio Torralba MIT | Jean-Charles Bazin Meta |




