9:00 - 9:05 (PST) Welcome
9:05 - 11:00 (PST) Paper session Session chair: Arsha Nagrani

[Paper] [Video]  A Local-to-Global Approach to Multi-modal Movie Scene Segmentation Anyi Rao, Linning Xu, Yu Xiong, Guodong Xu, Qingqiu Huang, Bolei Zhou, Dahua Lin
[Paper]  [Video] Audio-Visual SfM towards 4D reconstruction under dynamic scenes Takashi Konno, Kenji Nishida, Katsutoshi Itoyama, Kazuhiro Nakadai
[Paper]  [Video] Co-Learn Sounding Object Visual Grounding and Visually Indicated Sound Separation in A Cycle Yapeng Tian, Di Hu, Chenliang Xu
Q&A session
[Paper]  [Video] Deep Audio Prior: Learning Sound Source Separation from a Single Audio Mixture Yapeng Tian, Chenliang Xu, Dingzeyu Li
[Paper]  [Video] Weakly-Supervised Audio-Visual Video Parsing Toward Unified Multisensory Perception Yapeng Tian, Dingzeyu Li, Chenliang Xu
[Paper]  [Video] What comprises a good talking-head video generation? Lele Chen, Guofeng Cui, Ziyi Kou, Haitian Zheng, Chenliang Xu
Q&A session
[Paper]  [Video] A Two-Stage Framework for Multiple Sound-Source Localization Rui Qian, Di Hu, Heinrich Dinkel, Mengyue Wu, Ning Xu, Weiyao Lin
[Paper]  [Video] BatVision with GCC-PHAT Features for Improved Sound to Vision Predictions Jesper Christensen, Sascha A Hornauer, Stella Yu
[Paper]  [Video] Heterogeneous Scene Analysis via Self-supervised Audiovisual Learning Di Hu, Zheng Wang, Haoyi Xiong, Dong Wang, Feiping Nie, Dejing Dou
Q&A session
[Paper]  [Video] Does Ambient Sound Help? - Audiovisual Crowd Counting Di Hu, Lichao Mou, Qingzhong Wang, Junyu Gao, Yuansheng Hua, Dejing Dou, Xiaoxiang Zhu
[Paper]  [Video]An end-to-end approach for visual piano transcription A. Sophia Koepke, Olivia Wiles, Yael Moses, Andrew Zisserman
[Paper]  [Video] Visual Self-Supervision by Facial Reconstruction for Speech Representation Learning Abhinav Shukla, Stavros Petridis, Maja Pantic
Q&A session

11:00 - 11:30 (PST) Invited talk
Lorenzo Torresani
Self-supervised Video Models from Sound and Speech
11:30 - 12:00 (PST) Invited talk
Linda Smith
Sight, sounds, hands: Learning object names from the infant point of view
12:00 - 12:30 (PST) Invited talk
Adam Finkelstein
Optical Audio Capture: Recovering Sound from Turn-of-the-century Sonorine Postcards

12:30 - 2:00 (PST) Invited paper talks Session chair: Ruohan Gao

[Paper]  [Video] What Makes Training Multi-Modal Classification Networks Hard?Weiyao Wang, Du Tran, Matt Feiszli
[Paper]  [Video] Learning Individual Speaking Styles for Accurate Lip to Speech SynthesisK R Prajwal, Rudrabha Mukhopadhyay, Vinay P. Namboodiri, C V Jawahar
[Paper]  [Video] Multi-modal Self-Supervision from Generalized Data TransformationsMandela Patrick, Yuki M. Asano, Polina Kuznetsova, Ruth Fong, João F. Henriques, Geoffrey Zweig, Andrea Vedaldi
Q&A session
[Paper]  [Video] VGGSound: A Large-Scale Audio-Visual DatasetHonglie Chen, Weidi Xie, Andrea Vedaldi, Andrew Zisserman
[Paper]  [Video] Semantic Object Prediction and Spatial Sound Super-Resolution with Binaural Sounds.Arun Balajee Vasudevan, Dengxin Dai, Luc Van Gool
[Paper]  [Video] Epic-Fusion: Audio-Visual Temporal Binding for Egocentric Action Recognition

Evangelos Kazakos, Arsha Nagrani, Andrew Zisserman, Dima Damen
[Paper]  [Video] Telling Left From Right: Learning Spatial Correspondence of Sight and Sound

Karren Yang, Bryan Russell, Justin Salamon
Q&A session

2:00 - 2:30 (PST) Invited talk
Doug James
Advances in Audiovisual Simulation
2:30 - 3:00 (PST) Invited talk
David Harwath
Vision as a Rosetta Stone for Speech
3:00 - 3:30 (PST) Invited talk
Kristen Grauman
Sights, Sounds, and 3D Spaces


In recent years, there have been many advances in learning from visual and auditory data. While traditionally these modalities have been studied in isolation, researchers have increasingly been creating algorithms that learn from both modalities. This has produced many exciting developments in automatic lip-reading, multi-modal representation learning, and audio-visual action recognition.

Since pretty much every internet video has an audio track, the prospect of learning from paired audio-visual data — either with new forms of unsupervised learning, or by simply incorporating sound data into existing vision algorithms — is intuitively appealing, and this workshop will cover recent advances in this direction. But it will also touch on higher-level questions, such as what information sound conveys that vision doesn't, the merits of sound versus other "supplemental" modalities such as text and depth, and the relationship between visual motion and sound. We'll also discuss how these techniques are being used to create new audio-visual applications, such as in the fields of speech processing and video editing.

Previous workshops: 2018, 2019

Presentation instructions

  • Authors of accepted papers can present a 5-minute (or shorter) talk about their work. Please submit the video by June 13th (11:59 PST) to CMT, following the CVPR oral instructions here (uploading as a .mp4 file).
  • We'll have a paper presentation session on 9:00am - 11:00am PST on June 15. During this session, we'll play the pre-recorded talks, with time for Q&A from authors (if they are present). We'll also release the videos on our website for offline viewing.
  • Please also submit the camera ready version of your paper via CMT by June 13th (11:59 PST). Papers will be available on our website.
  • Looking forward to seeing you there!


Andrew Owens
University of Michigan

Jiajun Wu

Ruohan Gao
UT Austin

Arsha Nagrani

Hang Zhao

William Freeman

Andrew Zisserman

Jean-Charles Bazin

Antonio Torralba

Kristen Grauman
UT Austin / Facebook