Presentation instructions

Previous workshops: 2018, 2019, 2020, 2021, 2022, 2023, 2024, 2025

  • Authors of accepted papers will present a 5-minute talk about their work. You may either present in person, or submit a video. For the latter option, please submit by June 2nd (11:59 PST) to CMT as a .mp4 file. Please submit the video as a supplementary file on CMT, along with the PDF for your paper.
  • We'll have a paper presentation session from 9am - 11:30am. There will be a mix of in-person and video presentations. Throughout the paper session, there will be short Q&A sessions for all of the papers that precede them. We'll also release recordings on our website for offline viewing. We'll post the paper schedule in the coming weeks.
  • Looking forward to seeing you there!

Schedule

9:00 (MT) Welcome
9:00 - 10:00 (MT) Paper session #1
Omni-MMSI: Toward Identity-attributed Social Interaction UnderstandingXinpeng Li, Bolin Lai, Hardy Chen, Shijian Deng, Cihang Xie, Yuyin Zhou, James M. Rehg, Yapeng Tian
MMAudio-LABEL: Audio Event Labeling via Audio Generation for Silent VideoKazuya Tateishi
MMAudioReverbs: Video‑Guided Acoustic Modeling for Dereverberation and Room Impulse Response EstimationAkira Takahashi, Ryosuke Sawata, Shusuke Takakahashi, Yuki Mitsufuji
Omni-Judge: Can Omni-LLMs Serve as Human-Aligned Judges for Text-Conditioned Audio-Video Generation?Susan Liang, Chao Huang, Filippos Bellos, Yolo Tang, Qianxiang Shen, Jing Bi, Luchuan Song, Zeliang Zhang, Jason Corso, Chenliang Xu
DiFlowDubber: Discrete Flow Matching for Automated Video Dubbing via Cross-Modal Alignment and SynchronizationSon Nguyen, Thanh Tran, Jeongsoo Choi, Nghia Huynh, Son Hy, Van Nguyen
Few-shot Acoustic Synthesis with Multimodal Flow MatchingAmandine Brunetto
10:00 - 10:45 (MT) Coffee Break
10:45 - 11:30 (MT) Paper session #2
Precise Video-to-Audio Generation with Cross-Modal Alignment in Latent SpaceThanh Tran, Son Nguyen, Luong Tran, Khanh Pham, Paarth Neekhara, Shehzeen Hussain, Van Nguyen
CounterFlow: A Two-Phase Inference-Time Sampling for Counterfactual Video Foley GenerationGyubin Lee, Junwon Lee, Juhan Nam
Do Audio-Visual Large Language Models Really See and Hear?Ramaneswaran Selvakumar, Kaousheik Jayakumar, Sakshi S, Sreyan Ghosh, Ruohan Gao, Dinesh Manocha
Some Modalities are More Equal Than Others: Decoding and Architecting Multimodal Integration in MLLMsTianle Chen, Chaitanya Chakka, Arjun Akula, Xavier Thomas, Deepti Ghadiyaram
Rethinking Video-and-Text-to-Audio Generation through Multimodal CoverageSangyeop Yeo, Yoojin Jang, Saad Mourafik, Arda Senocak, Jaejun Yoo
11:30 - 12:00 (MT) Invited talk
Sophia Koepke
12:00 - 1:30 (MT) Lunch
1:30 - 2:00 (MT) Invited talk
Dinesh Manocha
 
2:00 - 2:30 (MT) Invited talk
Eli Shlizerman
 
2:30 - 3:00 (MT) Invited talk
Ruohan Gao
 
3:00 - 3:30 (MT) Coffee Break
3:30 - 4:00 (MT) Invited talk
Yake Wei
 
4:00 - 5:00 (MT) Panel discussion

Organizers


Andrew Owens
University of Michigan

Jiajun Wu
Stanford

Arsha Nagrani
Google

Triantafyllos Afouras
Meta

Ruohan Gao
Meta /
University of Maryland

Hang Zhao
Tsinghua University

Ziyang Chen
University of Michigan


William Freeman
MIT/Google

Andrew Zisserman
Oxford

Kristen Grauman
UT Austin / Meta

Antonio Torralba
MIT

Jean-Charles Bazin
Meta