9:00 - 9:05 (PT) |
Welcome |
9:05 - 10:00 (PT) |
Paper session #1 |
|
| Laughing Matters: Introducing Audio-Driven Laughing-Face Generation with Diffusion Models - Extended Abstract | Antoni Bigata Casademunt, Rodrigo Mira, Nikita Drobyshev, Konstantinos Vougioukas, Stavros Petridis, Maja Pantic |
| Can CLIP Help Visual Sound Localization? | Sooyoung Park, Arda Senocak, Joon Son Chung |
| Learning Continual Audio-Visual Sound Separation Models | Weiguo Pian, Yiyang Nan, Shijian Deng, Shentong Mo, Yunhui Guo, Yapeng Tian |
| BRAVEn: Improving Self-Supervised Pre-training for Visual and Auditory Speech Recognition - Extended Abstract | Alexandros Haliassos, Andreas Zinonos, Rodrigo Mira, Stavros Petridis, Maja Pantic |
| Q&A session |
| Audio-Visual Autism Behavior Recognition with Multimodal Large Language Models | Shijian Deng, Erin Kosloski, Siddhi Patel, Zeke A Barnett, Yiyang Nan, Alexander M Kaplan, Sisira Aarukapalli, William Doan, Matthew Wang, Harsh Singh, Rollins Pamela, Yapeng Tian |
| Dataset distillation for audio-visual datasets | Saksham Singh Kushwaha, Siva Sai Nagender Vasireddy, Kai Wang, Yapeng Tian |
| AVQA-CoT: When CoT Meets Question Answering in Audio-Visual Scenarios | Guangyao Li, Henghui Du, Di Hu |
| Q&A session |
10:00 - 10:30 (PT) |
Posters & Coffee Break |
10:30 - 11:15 (PT) |
Paper session #2 |
|
| ViSpeR: Multilingual Visual Speech Recognition | Yasser Abdelaziz Dahou Djilali, Sanath Narayan, Eustache Le Bihan, Ankit Singh, Hakim Hacid |
| Looking Similar, Sounding Different: Leveraging Counterfactual Cross-Modal Pairs for Audiovisual Representation Learning | Nikhil Singh, Chih-Wei Wu, Iroro Orife, Mahdi M. Kalayeh |
| Q&A session |
| AVHuMAR: Audio-Visual Target Speech Extraction with Pre-trained AV-HuBERT and Mask-And-Recover Strategy | Wenxuan Wu, Xueyuan Chen, Xixin Wu, Haizhou Li, Helen Meng |
| AV-Mamba: Cross-Modality Selective State Space Models for Audio-Visual Question Answering | Ziru Huang, Jia Li, Wenjie Zhao, Yunhui Guo, Yapeng Tian |
| SparseVSR: Lightweight and Noise Robust Visual Speech Recognition - Extended Abstract | Adriana Fernandez-Lopez, Honglie Chen, Pingchuan Ma, Alexandros Haliassos, Stavros Petridis, Maja Pantic |
| Q&A session
|
11:15 - 11:45 (PT) |
Invited talk
|
Alexander Richard |
 |
11:45 - 1:00 (PT) |
Lunch |
1:00 - 2:00 (PT) |
Invited papers | Chair: Ziyang Chen
|
| SoundingActions: Learning How Actions Sound from Narrated Egocentric Videos | Changan Chen, Kumar Ashutosh, Rohit Girdhar, David Harwath, Kristen Grauman |
| Q&A session |
| TIM: A Time Interval Machine for Audio-Visual Action Recognition | Jacob Chalk, Jaesung Huh, Evangelos Kazakos, Andrew Zisserman, Dima Damen |
| The Audio-Visual Conversational Graph: From an Egocentric-Exocentric Perspective | Wenqi Jia, Miao Liu, Hao Jiang, Ishwarya Ananthabhotla, James M. Rehg, Vamsi Krishna Ithapu, Ruohan Gao |
| Q&A session |
| MeLFusion: Synthesizing Music from Image and Language Cues using Diffusion Models | Sanjoy Chowdhury, Sayan Nag, Joseph K J, Balaji Vasan Srinivasan, Dinesh Manocha |
| Learning Spatial Features from Audio-Visual Correspondence in Egocentric Videos | Sagnik Majumder, Ziad Al-Halah, Kristen Grauman |
| Separating the "Chirp" from the "Chat": Self-supervised Visual Grounding of Sound and Language | Mark Hamilton, Andrew Zisserman, John R. Hershey, William T. Freeman |
| Q&A session
|
2:00 - 2:30 (PT) |
Invited talk
|
Ruohan Gao |
 |
2:30 - 3:00 (PT) |
Invited talk
|
Shyam Gollakota |
 |
3:00 - 3:30 (PT) |
Coffee Break |
3:30 - 4:00 (PT) |
Invited talk
|
Hilde Kuehne |
 |
4:00 - 4:30 (PT) |
Invited talk
|
Samuel Clarke |
 |
4:30 - 5:00 (PT) |
Invited talk
|
Tengda Han |
 |