9:00 - 9:05 (CDT) |
Welcome |
9:05 - 10:30 (CDT) |
Paper session #1 | Chair: Triantafyllos Afouras
|
| Quantized GAN for Complex Music Generation from Dance Videos | Ye Zhu, Kyle B Olszewski, Yu Wu, Panos Achlioptas, Menglei Chai, Jian Ren, Yan Yan, Sergey Tulyakov |
| Synchronisation of Lips and Voices | Venkatesh Shenoy Kadandale, Juan Felipe Montesinos, Gloria Haro |
| A Model You Can Hear: Audio Classification with Playable Prototypes | Romain Loiseau, Baptiste Bouvier, Teytaut Yann, Elliot Vincent, Mathieu Aubry, loic landrieu |
| Audio-Visual Object Localization in Egocentric Videos | Chao Huang, Yapeng Tian, Anurag Kumar, Chenliang Xu |
| Q&A session |
| Audio-Visual Event Localization via Recursive Joint Co-Attention | Bin Duan, Hugo M Latapie, Gaowen Liu, Yan Yan |
| The Sound of Motion: Multimodal horse motion estimation from video and audio | Ci Li, Elin Hernlund, Hedvig Kjellström, Silvia Zuffi |
| Learning Sound Localization Better From Semantically Similar Samples | Arda Senocak, Hyeonggon Ryu, Junsik Kim, In So Kweon |
| SVTS: Scalable Video-to-Speech Synthesis - Extended Abstract | Rodrigo Mira, Alexandros Haliassos, Stavros Petridis, Bjoern W. Schuller, Maja Pantic |
| Q&A session |
| Quantifying Predictive Uncertainty for Stochastic Video Synthesis from Audio | Moitreya Chatterjee, Narendra Ahuja, Anoop Cherian |
| Exploring a Probabilistic Approach to Vehicle Sound Source Localization in Urban Scenes | Julia Wilkins, Magdalena Fuentes, Luca Bondi, Shabnam Ghaffarzadegan, Bea Steers, Ali Abavisani, Juan P Bello |
| SEMI: Self-supervised Exploration via Multisensory Incongruity | Ziwen Zhuang, Jianren Wang, Hang Zhao |
| Sound Adversarial Audio-Visual Navigation | Yinfeng Yu, Changan Chen, Fuchun Sun |
| ECLIPSE: Efficient Long-range Video Retrieval using Sight and Sound | Yan-Bo Lin, Jie Lei, Mohit Bansal, Gedas Bertasius |
| Q&A session |
10:30 - 11:00 |
Coffee break & posters |
11:00 - 11:30 (CDT) |
Invited talk
|
Arsha Nagrani |
 |
11:30 - 12:00 (CDT) |
Invited talk
|
Jeannette Bohg |
 |
12:00 - 1:00 (CDT) |
Lunch |
1:00 - 2:00 (CDT) |
Paper session #2 |
|
| How to Listen? Rethinking Visual Sound Localization | Ho-Hsiang Wu, Magdalena Fuentes, Prem Seetharaman, Juan P Bello |
| Urban Sound & Sight: Dataset and benchmark for Audio-Visual Urban Scene Understanding | Magdalena Fuentes, Bea A Steers, Pablo Zinemanas, Martín Rocamora, Luca Bondi, Julia Wilkins, Qianyi Shi, Yao Hou, Samarjit Das, Xavier Serra, Juan P Bello |
| On Negative Sampling for Audio-Visual Contrastive Learning from Movies | Mahdi M. Kalayeh, Shervin Ardeshir, Kamath Nagendra, Lingyi Liu, Ashok Chandrashekar |
| Audio-visual voice separation transformer | Juan Felipe Montesinos, Venkatesh Shenoy Kadandale, Gloria Haro |
| Q&A session |
| Everything at Once - Multi-modal Fusion Transformer for Video Retrieval | Nina Shvetsova, Brian Chen, Andrew Rouditchenko, Samuel Thomas, Brian Kingsbury, Rogerio Feris, David Harwath, James Glass, Hilde Kuehne |
| Tap to the Beat: Cross-modal Music Beat Localization for Dancing Videos | Tianyi Ma, Yu Wu |
| Leveraging Real Talking Faces via Self-Supervision for Robust Forgery Detection - Extended Abstract | Alexandros Haliassos, Rodrigo Mira, Stavros Petridis, Maja Pantic |
| Visual Speech Recognition for Multiple Languages | Pingchuan Ma, Stavros Petridis, Maja Pantic |
| Q&A session |
2:00 - 2:30 (CDT) |
Invited talk
|
David Brang |
 |
2:30 - 3:30 (CDT) |
Coffee break & posters |
3:30 - 4:00 (CDT) |
Invited talk
|
Carl Vondrick |
 |
4:00 - 5:00 (CDT) |
Invited paper talks | Chair: Ruohan Gao
|
| Taming visually guided sound generation | Vladimir Iashin, Esa Rahtu |
| Audio-Visual Speech Codecs: Rethinking Audio-Visual Speech Enhancement by Re-Synthesis | Karren Yang, Dejan Marković, Steven Krenn, Vasu Agrawal, Alexander Richard |
| Learning to Answer Questions in Dynamic Audio-Visual Scenarios | Guangyao Li, Yake Wei, Yapeng Tian, Chenliang Xu, Ji-Rong Wen, Di Hu |
| Q&A session |
| Cross-Modal Perceptionist: Can Face Geometry be Gleaned from Voices? | Cross-Modal Perceptionist: Can Face Geometry be Gleaned from Voices? |
| Learning Hierarchical Cross-Modal Association for Co-Speech Gesture Generation | Xian Liu, Qianyi Wu, Hang Zhou, Yinghao Xu, Rui Qian, Xinyi Lin, Xiaowei Zhou, Wayne Wu, Bo Dai, Bolei Zhou |
| Sound and Visual Representation Learning with Multiple Pretraining Tasks | Arun Balajee Vasudevan, Dengxin Dai, Luc Van Gool |
| Active Audio-Visual Separation of Dynamic Sound Sources | Sagnik Majumder, Ziad Al-Halah, Kristen Grauman |
| Q&A sessions |
5:00 - 5:30 (CDT) |
Invited talk
|
Hilde Kuehne |
 |
5:30 - 6:00 (CDT) |
Invited talk
|
Pedro Morgado |
 |