Schedule
9:00 - 9:05 (PST) | Welcome | ||
9:05 - 11:00 (PST) | Paper session | Session chair: Arsha Nagrani | |
[Paper] [Video] | A Local-to-Global Approach to Multi-modal Movie Scene Segmentation | Anyi Rao, Linning Xu, Yu Xiong, Guodong Xu, Qingqiu Huang, Bolei Zhou, Dahua Lin | |
[Paper] [Video] | Audio-Visual SfM towards 4D reconstruction under dynamic scenes | Takashi Konno, Kenji Nishida, Katsutoshi Itoyama, Kazuhiro Nakadai | |
[Paper] [Video] | Co-Learn Sounding Object Visual Grounding and Visually Indicated Sound Separation in A Cycle | Yapeng Tian, Di Hu, Chenliang Xu | |
Q&A session | |||
[Paper] [Video] | Deep Audio Prior: Learning Sound Source Separation from a Single Audio Mixture | Yapeng Tian, Chenliang Xu, Dingzeyu Li | |
[Paper] [Video] | Weakly-Supervised Audio-Visual Video Parsing Toward Unified Multisensory Perception | Yapeng Tian, Dingzeyu Li, Chenliang Xu | |
[Paper] [Video] | What comprises a good talking-head video generation? | Lele Chen, Guofeng Cui, Ziyi Kou, Haitian Zheng, Chenliang Xu | |
Q&A session | |||
[Paper] [Video] | A Two-Stage Framework for Multiple Sound-Source Localization | Rui Qian, Di Hu, Heinrich Dinkel, Mengyue Wu, Ning Xu, Weiyao Lin | |
[Paper] [Video] | BatVision with GCC-PHAT Features for Improved Sound to Vision Predictions | Jesper Christensen, Sascha A Hornauer, Stella Yu | |
[Paper] [Video] | Heterogeneous Scene Analysis via Self-supervised Audiovisual Learning | Di Hu, Zheng Wang, Haoyi Xiong, Dong Wang, Feiping Nie, Dejing Dou | |
Q&A session | |||
[Paper] [Video] | Does Ambient Sound Help? - Audiovisual Crowd Counting | Di Hu, Lichao Mou, Qingzhong Wang, Junyu Gao, Yuansheng Hua, Dejing Dou, Xiaoxiang Zhu | |
[Paper] [Video] | An end-to-end approach for visual piano transcription | A. Sophia Koepke, Olivia Wiles, Yael Moses, Andrew Zisserman | |
[Paper] [Video] | Visual Self-Supervision by Facial Reconstruction for Speech Representation Learning | Abhinav Shukla, Stavros Petridis, Maja Pantic | |
Q&A session | |||
11:00 - 11:30 (PST) | Invited talk [Video] |
Lorenzo Torresani Self-supervised Video Models from Sound and Speech |
|
11:30 - 12:00 (PST) | Invited talk [Video] |
Linda Smith Sight, sounds, hands: Learning object names from the infant point of view |
|
12:00 - 12:30 (PST) | Invited talk [Video] |
Adam Finkelstein Optical Audio Capture: Recovering Sound from Turn-of-the-century Sonorine Postcards |
|
12:30 - 2:00 (PST) | Invited paper talks | Session chair: Ruohan Gao | |
[Paper] [Video] | What Makes Training Multi-Modal Classification Networks Hard? | Weiyao Wang, Du Tran, Matt Feiszli | |
[Paper] [Video] | Learning Individual Speaking Styles for Accurate Lip to Speech Synthesis | K R Prajwal, Rudrabha Mukhopadhyay, Vinay P. Namboodiri, C V Jawahar | |
[Paper] [Video] | Multi-modal Self-Supervision from Generalized Data Transformations | Mandela Patrick, Yuki M. Asano, Polina Kuznetsova, Ruth Fong, João F. Henriques, Geoffrey Zweig, Andrea Vedaldi | |
Q&A session | |||
[Paper] [Video] | VGGSound: A Large-Scale Audio-Visual Dataset | Honglie Chen, Weidi Xie, Andrea Vedaldi, Andrew Zisserman | |
[Paper] [Video] | Semantic Object Prediction and Spatial Sound Super-Resolution with Binaural Sounds. | Arun Balajee Vasudevan, Dengxin Dai, Luc Van Gool | |
[Paper] [Video] | Epic-Fusion: Audio-Visual Temporal Binding for Egocentric Action Recognition | Evangelos Kazakos, Arsha Nagrani, Andrew Zisserman, Dima Damen | |
[Paper] [Video] | Telling Left From Right: Learning Spatial Correspondence of Sight and Sound | Karren Yang, Bryan Russell, Justin Salamon | |
Q&A session | |||
2:00 - 2:30 (PST) | Invited talk [Video] |
Doug James Advances in Audiovisual Simulation |
|
2:30 - 3:00 (PST) | Invited talk [Video] |
David Harwath Vision as a Rosetta Stone for Speech |
|
3:00 - 3:30 (PST) | Invited talk [Video] |
Kristen Grauman Sights, Sounds, and 3D Spaces |
Summary
In recent years, there have been many advances in learning from visual and auditory data. While traditionally these modalities have been studied in isolation, researchers have increasingly been creating algorithms that learn from both modalities. This has produced many exciting developments in automatic lip-reading, multi-modal representation learning, and audio-visual action recognition.
Since pretty much every internet video has an audio track, the prospect of learning from paired audio-visual data — either with new forms of unsupervised learning, or by simply incorporating sound data into existing vision algorithms — is intuitively appealing, and this workshop will cover recent advances in this direction. But it will also touch on higher-level questions, such as what information sound conveys that vision doesn't, the merits of sound versus other "supplemental" modalities such as text and depth, and the relationship between visual motion and sound. We'll also discuss how these techniques are being used to create new audio-visual applications, such as in the fields of speech processing and video editing.
Previous workshops: 2018, 2019
Presentation instructions
Organizers
Andrew Owens
University of Michigan
Jiajun Wu
Stanford
Ruohan Gao
UT Austin
Arsha Nagrani
Oxford
Hang Zhao
Waymo
William Freeman
MIT/Google
Andrew Zisserman
Oxford
Jean-Charles Bazin
KAIST
Antonio Torralba
MIT
Kristen Grauman
UT Austin / Facebook