In recent years, there have been many advances in learning from visual and auditory data. While traditionally these modalities have been studied in isolation, researchers have increasingly been creating algorithms that learn from both modalities. This has produced many exciting developments in automatic lip-reading, multi-modal representation learning, and audio-visual action recognition.

Since pretty much every video has an audio track, the prospect of learning from paired audio-visual data — either with new forms of unsupervised learning, or by simply incorporating sound data into existing vision algorithms — is intuitively appealing, and this workshop will cover recent advances in this direction. But it will also touch on higher-level questions, such as what information sound conveys that vision doesn't, the merits of sound versus other "supplemental" modalities such as text and depth, and the relationship between visual motion and sound. We'll also discuss how these techniques are being used to create new audio-visual applications, such as in the fields of speech processing and video editing.


This is a half-day workshop that will take place in the afternoon on Friday, June 22, 2018.

1:30 - 1:35 Welcome
1:35 - 2:00 Paper Session 1
2:00 - 2:30 Invited Talk Antonio Torralba (MIT)
2:30 - 3:00 Invited Talk Joon Son Chung (Oxford)
3:00 - 3:30 Paper Session 2
3:30 - 4:00 Afternoon Break / Posters
4:00 - 4:30 Invited Talk William Freeman (MIT/Google)
4:30 - 5:00 Invited Talk Relja Arandjelović (DeepMind)
5:00 - 5:30 Paper Session 3

Call for papers

We're inviting submissions! If you're interested in potentially presenting a poster or giving a talk, please submit a short paper to CMT by May 1st at 11:59 PST, using this LaTeX template. We encourage submissions for work that has already been accepted in other venues, as well as new, work-in-progress submissions. The paper must be at most 4 pages, including references (a 1- or 2-page extended abstract is also fine). Accepted papers will appear on this site, and on the CVF website (but not the IEEE or CVPR proceedings). Since the papers in this workshop are at most 4 pages long, they can also be submitted to next year's CVPR.
We are looking for work that involves vision and sound. For example, the following topics would be in scope:
  • Lip-reading
  • Intuitive physics with sound
  • Audio-visual scene understanding
  • Sound-from-vision and vision-from-sound
  • Audio-visual self-supervised learning
  • Semi-supervised learning
  • Video-to-music alignment
  • Video editing and movie trailer generation
  • Material recognition
  • Vision-inspired audio convolutional networks
  • Sound localization
  • Audio-visual speech processing