9:00 - 9:05 (PT) |
Welcome |
9:05 - 10:00 (PT) |
Paper session #1 | Chair:
|
| Beyond Visual Field of View: Perceiving 3D Environment with Echoes and Vision | Lingyu Zhu, Esa Rahtu, Hang Zhao |
| DiffAVA: Personalized Text-to-Audio Generation with Visual Alignment | Shentong Mo, Jing Shi, Yapeng Tian |
| AV-NeRF: Learning Neural Fields for Real-World Audio-Visual Scene Synthesis | Susan Liang, Chao Huang, Yapeng Tian, Anurag Kumar, Chenliang Xu |
| Diff-Foley: Synchronized Video-to-Audio Synthesis with Latent Diffusion Models | Simian Luo, Chuanhao Yan, Chenxu Hu, Hang Zhao |
| Q&A session |
| Audio-Visual Action Prediction with Soft-Boundary in Egocentric Videos | Luchuan Song, Jing Bi, Chao Huang, Chenliang Xu |
| Toward an Audio-Visual Dataset of Spatial Recordings of Real Scenes with Spatiotemporal Annotations of Sound Events | Kazuki Shimada, Archontis Politis, Parthasaarathy Ariyakulam Sudarsanam, Daniel A. Krause, Kengo Uchida, Sharath Adavanne, Aapo Hakala, Yuichiro Koyama, Naoya Takahashi, Shusuke Takahashi, Tuomas Virtanen, Yuki Mitsufuji |
| Jointly Learning Visual and Auditory Speech Representations from Raw Data - Extended Abstract | Alexandros Haliassos, Pingchuan Ma, Rodrigo Mira, Stavros Petridis, Maja Pantic |
| Vision Transformers are Parameter-Efficient Audio-Visual Learners | Yan-Bo Lin, Yi-Lin Sung, Jie Lei, Mohit Bansal, Gedas Bertasius |
| Q&A session |
10:00 - 10:45 |
Posters & Coffee Break |
10:45 - 11:15 (PT) |
Invited talk
|
Dima Damen |
 |
11:15 - 11:45 (PT) |
Invited talk
|
Gedas Bertasius |
 |
11:45 - 1:00 (PT) |
Lunch |
1:00 - 2:00 (PT) |
Paper session #2 |
|
| Auto-AVSR: Audio-Visual Speech Recognition with Automatic Labels - Extended Abstract | Pingchuan Ma, Alexandros Haliassos, Adriana Fernandez-Lopez, Honglie Chen, Stavros Petridis, Maja Pantic |
| LA-VocE: Low-SNR Audio-visual Speech Enhancement using Neural Vocoders - Extended Abstract | Rodrigo Mira, Buye Xu, Jacob Donley, Anurag Kumar, Stavros Petridis, Vamsi Krishna K Ithapu, Maja Pantic |
| AV-SAM: Segment Anything Model Meets Audio-Visual Localization and Segmentation | Shentong Mo, Yapeng Tian |
| Balanced Audiovisual Dataset for Imbalance Analysis | Wenke Xia, Xu Zhao, Xincheng Pang, Changqing Zhang, Di Hu |
| Q&A session |
| Towards Robust Image-in-Audio Deep Steganography | Jaume Ros, Margarita Geleta, Jordi Pons, Xavier Giro-i-Nieto |
| CLIPSynth: Learning Text-to-audio Synthesis from Videos | Hao-Wen Dong, Gunnar A Sigurdsson, Chenyang Tao, Jiun-Yu Kao, Yu-Hsiang Lin, Anjali Narayan-Chen, arpit gupta, Tagyoung Chung, Jing Huang, Nanyun Peng, Wenbo Zhao |
| Language-Guided Music Recommendation for Video via Prompt Analogies | Daniel McKee, Justin Salamon, Josef Sivic, Bryan Russell |
| Disentangled Audio-Driven NeRF: Talking Head Generation with Detailed Identity-Specific Micro expressions | Seo Young Lee, Seongsu Ha, Joonseok Lee |
| Speed Co-Augmentation for Unsupervised Audio-Visual Pre-training | Jiangliu Wang, Jianbo Jiao, Yibing Song, Stephen L James, Zhan Tong, Chongjian GE, Pieter Abbeel, Yunhui Liu |
| Q&A session |
2:00 - 2:30 (PT) |
Invited talk
|
Yapeng Tian |
 |
2:30 - 3:00 (PT) |
Invited talk
|
Kristen Grauman |
 |
3:00 - 3:30 (PT) |
Posters & Coffee Break |
3:30 - 4:30 (PT) |
Invited paper talks | Chair: Ruohan Gao
|
| Sparse in Space and Time: Audio-visual Synchronization with Trainable Selectors | Vladimir Iashin, Weidi Xie, Esa Rahtu, Andrew Zisserman |
| MM-Diffusion: Learning Multi-Modal Diffusion Models for Joint Audio and Video Generation | Ludan Ruan, Yiyang Ma, Huan Yang, Huiguo He, Bei Liu, Jianlong Fu, Nicholas Jing Yuan, Qin Jin, Baining Guo |
| Learning State-Aware Visual Representations from Audible Interactions | Himangi Mittal, Pedro Morgado, Unnat Jain, Abhinav Gupta |
| Q&A session |
| Chat2Map: Efficient Scene Mapping from Multi-Ego Conversations | Sagnik Majumder, Hao Jiang, Pierre Moulon, Ethan Henderson, Paul Calamia, Kristen Grauman, Vamsi Krishna Ithapu |
| Sound to Visual Scene Generation by Audio-to-Visual Latent Alignment | Kim Sung-Bin, Arda Senocak, Hyunwoo Ha, Andrew Owens, Tae-Hyun Oh |
| Language-Guided Audio-Visual Source Separation via Trimodal Consistency | Reuben Tan, Arijit Ray, Andrea Burns, Bryan A. Plummer, Justin Salamon, Oriol Nieto, Bryan Russell, Kate Saenko |
| Q&A session
|
4:30 - 5:00 (PT) |
Invited talk
|
Changan Chen |
 |
5:00 - 5:30 (PT) |
Invited talk
|
Vamsi Ithapu |
 |