Please click here to see last year's workshop (at CVPR 2018).

In recent years, there have been many advances in learning from visual and auditory data. While traditionally these modalities have been studied in isolation, researchers have increasingly been creating algorithms that learn from both modalities. This has produced many exciting developments in automatic lip-reading, multi-modal representation learning, and audio-visual action recognition.

Since pretty much every video has an audio track, the prospect of learning from paired audio-visual data — either with new forms of unsupervised learning, or by simply incorporating sound data into existing vision algorithms — is intuitively appealing, and this workshop will cover recent advances in this direction. But it will also touch on higher-level questions, such as what information sound conveys that vision doesn't, the merits of sound versus other "supplemental" modalities such as text and depth, and the relationship between visual motion and sound. We'll also discuss how these techniques are being used to create new audio-visual applications, such as in the fields of speech processing and video editing.

Call for papers

We'll be taking paper submission! Please check back in the spring for more details. We'll be looking for work that involves vision and sound. For example, the following topics would be in scope:
  • Lip-reading
  • Intuitive physics with sound
  • Audio-visual scene understanding
  • Sound-from-vision and vision-from-sound
  • Audio-visual self-supervised learning
  • Semi-supervised learning
  • Video-to-music alignment
  • Video editing and movie trailer generation
  • Material recognition
  • Vision-inspired audio convolutional networks
  • Sound localization
  • Audio-visual speech processing