In-person: The workshop is in West 207.
Virtual: For CVPR attendees, the Zoom link can be found here. The workshop will also be streamed via YouTube.

9:00 - 9:05 (PT) Welcome
9:05 - 10:00 (PT) Paper session #1Chair:

Beyond Visual Field of View: Perceiving 3D Environment with Echoes and Vision Lingyu Zhu, Esa Rahtu, Hang Zhao
DiffAVA: Personalized Text-to-Audio Generation with Visual Alignment Shentong Mo, Jing Shi, Yapeng Tian
AV-NeRF: Learning Neural Fields for Real-World Audio-Visual Scene Synthesis Susan Liang, Chao Huang, Yapeng Tian, Anurag Kumar, Chenliang Xu
Diff-Foley: Synchronized Video-to-Audio Synthesis with Latent Diffusion Models Simian Luo, Chuanhao Yan, Chenxu Hu, Hang Zhao
Q&A session
Audio-Visual Action Prediction with Soft-Boundary in Egocentric Videos Luchuan Song, Jing Bi, Chao Huang, Chenliang Xu
Toward an Audio-Visual Dataset of Spatial Recordings of Real Scenes with Spatiotemporal Annotations of Sound Events Kazuki Shimada, Archontis Politis, Parthasaarathy Ariyakulam Sudarsanam, Daniel A. Krause, Kengo Uchida, Sharath Adavanne, Aapo Hakala, Yuichiro Koyama, Naoya Takahashi, Shusuke Takahashi, Tuomas Virtanen, Yuki Mitsufuji
Jointly Learning Visual and Auditory Speech Representations from Raw Data - Extended Abstract Alexandros Haliassos, Pingchuan Ma, Rodrigo Mira, Stavros Petridis, Maja Pantic
Vision Transformers are Parameter-Efficient Audio-Visual Learners Yan-Bo Lin, Yi-Lin Sung, Jie Lei, Mohit Bansal, Gedas Bertasius
Q&A session
10:00 - 10:45 Posters & Coffee Break
10:45 - 11:15 (PT) Invited talk
Dima Damen
11:15 - 11:45 (PT) Invited talk
Gedas Bertasius
11:45 - 1:00 (PT) Lunch
1:00 - 2:00 (PT) Paper session #2

Auto-AVSR: Audio-Visual Speech Recognition with Automatic Labels - Extended Abstract Pingchuan Ma, Alexandros Haliassos, Adriana Fernandez-Lopez, Honglie Chen, Stavros Petridis, Maja Pantic
LA-VocE: Low-SNR Audio-visual Speech Enhancement using Neural Vocoders - Extended Abstract Rodrigo Mira, Buye Xu, Jacob Donley, Anurag Kumar, Stavros Petridis, Vamsi Krishna K Ithapu, Maja Pantic
AV-SAM: Segment Anything Model Meets Audio-Visual Localization and Segmentation Shentong Mo, Yapeng Tian
Balanced Audiovisual Dataset for Imbalance Analysis Wenke Xia, Xu Zhao, Xincheng Pang, Changqing Zhang, Di Hu
Q&A session
Towards Robust Image-in-Audio Deep Steganography Jaume Ros, Margarita Geleta, Jordi Pons, Xavier Giro-i-Nieto
CLIPSynth: Learning Text-to-audio Synthesis from Videos Hao-Wen Dong, Gunnar A Sigurdsson, Chenyang Tao, Jiun-Yu Kao, Yu-Hsiang Lin, Anjali Narayan-Chen, arpit gupta, Tagyoung Chung, Jing Huang, Nanyun Peng, Wenbo Zhao
Language-Guided Music Recommendation for Video via Prompt Analogies Daniel McKee, Justin Salamon, Josef Sivic, Bryan Russell
Disentangled Audio-Driven NeRF: Talking Head Generation with Detailed Identity-Specific Micro expressions Seo Young Lee, Seongsu Ha, Joonseok Lee
Speed Co-Augmentation for Unsupervised Audio-Visual Pre-training Jiangliu Wang, Jianbo Jiao, Yibing Song, Stephen L James, Zhan Tong, Chongjian GE, Pieter Abbeel, Yunhui Liu
Q&A session
2:00 - 2:30 (PT) Invited talk
Yapeng Tian
2:30 - 3:00 (PT) Invited talk
Kristen Grauman
3:00 - 3:30 (PT) Posters & Coffee Break
3:30 - 4:30 (PT) Invited paper talksChair: Ruohan Gao

Sparse in Space and Time: Audio-visual Synchronization with Trainable SelectorsVladimir Iashin, Weidi Xie, Esa Rahtu, Andrew Zisserman
MM-Diffusion: Learning Multi-Modal Diffusion Models for Joint Audio and Video GenerationLudan Ruan, Yiyang Ma, Huan Yang, Huiguo He, Bei Liu, Jianlong Fu, Nicholas Jing Yuan, Qin Jin, Baining Guo
Learning State-Aware Visual Representations from Audible InteractionsHimangi Mittal, Pedro Morgado, Unnat Jain, Abhinav Gupta
Q&A session
Chat2Map: Efficient Scene Mapping from Multi-Ego ConversationsSagnik Majumder, Hao Jiang, Pierre Moulon, Ethan Henderson, Paul Calamia, Kristen Grauman, Vamsi Krishna Ithapu
Sound to Visual Scene Generation by Audio-to-Visual Latent AlignmentKim Sung-Bin, Arda Senocak, Hyunwoo Ha, Andrew Owens, Tae-Hyun Oh
Language-Guided Audio-Visual Source Separation via Trimodal ConsistencyReuben Tan, Arijit Ray, Andrea Burns, Bryan A. Plummer, Justin Salamon, Oriol Nieto, Bryan Russell, Kate Saenko
Q&A session

4:30 - 5:00 (PT) Invited talk
Changan Chen
5:00 - 5:30 (PT) Invited talk
Vamsi Ithapu

Presentation instructions

Previous workshops: 2018, 2019, 2020, 2021, 2022

  • Authors of accepted papers will present a 5-minute talk about their work. You may either present in person, or submit a video. For the latter option, please submit by June 15th (11:59 PST) to CMT as a .mp4 file. Please submit the video as a supplementary file on CMT, along with the PDF for your paper.
  • We'll have two paper presentation sessions: 9-10am and 1-2pm. Each session will be a mix of in-person and video presentations. Throughout the paper sessions, there will be short Q&A sessions for all of the papers that precede them. We'll also release recordings on our website for offline viewing. We'll post the paper schedule in the coming weeks.
  • You are welcome to optionally present a poster during the lunch and coffee breaks. We unfortunately are unable to offer a hybrid option for posters.
  • Please also submit the camera ready version of your paper via CMT by June 10th (11:59 PST). Papers will be available on our website.
  • Looking forward to seeing you there!


Andrew Owens
University of Michigan

Jiajun Wu

Arsha Nagrani

Triantafyllos Afouras

Ruohan Gao

Hang Zhao

William Freeman

Andrew Zisserman

Kristen Grauman
UT Austin / Meta

Antonio Torralba

Jean-Charles Bazin

On-site coordinators

Ziyang Chen
University of Michigan

Changan Chen
UT Austin