Sight and Sound - CVPR 2023

Schedule

In-person: The workshop is in West 207.
Virtual: For CVPR attendees, the Zoom link can be found here. The workshop will also be streamed via YouTube.

9:00 - 9:05 (PT)	Welcome
9:05 - 10:00 (PT)	Paper session #1		Chair:
	Beyond Visual Field of View: Perceiving 3D Environment with Echoes and Vision		Lingyu Zhu, Esa Rahtu, Hang Zhao
	DiffAVA: Personalized Text-to-Audio Generation with Visual Alignment		Shentong Mo, Jing Shi, Yapeng Tian
	AV-NeRF: Learning Neural Fields for Real-World Audio-Visual Scene Synthesis		Susan Liang, Chao Huang, Yapeng Tian, Anurag Kumar, Chenliang Xu
	Diff-Foley: Synchronized Video-to-Audio Synthesis with Latent Diffusion Models		Simian Luo, Chuanhao Yan, Chenxu Hu, Hang Zhao
	Q&A session
	Audio-Visual Action Prediction with Soft-Boundary in Egocentric Videos		Luchuan Song, Jing Bi, Chao Huang, Chenliang Xu
	Toward an Audio-Visual Dataset of Spatial Recordings of Real Scenes with Spatiotemporal Annotations of Sound Events		Kazuki Shimada, Archontis Politis, Parthasaarathy Ariyakulam Sudarsanam, Daniel A. Krause, Kengo Uchida, Sharath Adavanne, Aapo Hakala, Yuichiro Koyama, Naoya Takahashi, Shusuke Takahashi, Tuomas Virtanen, Yuki Mitsufuji
	Jointly Learning Visual and Auditory Speech Representations from Raw Data - Extended Abstract		Alexandros Haliassos, Pingchuan Ma, Rodrigo Mira, Stavros Petridis, Maja Pantic
	Vision Transformers are Parameter-Efficient Audio-Visual Learners		Yan-Bo Lin, Yi-Lin Sung, Jie Lei, Mohit Bansal, Gedas Bertasius
	Q&A session
10:00 - 10:45	Posters & Coffee Break
10:45 - 11:15 (PT)	Invited talk	Dima Damen
11:15 - 11:45 (PT)	Invited talk	Gedas Bertasius
11:45 - 1:00 (PT)	Lunch
1:00 - 2:00 (PT)	Paper session #2
	Auto-AVSR: Audio-Visual Speech Recognition with Automatic Labels - Extended Abstract		Pingchuan Ma, Alexandros Haliassos, Adriana Fernandez-Lopez, Honglie Chen, Stavros Petridis, Maja Pantic
	LA-VocE: Low-SNR Audio-visual Speech Enhancement using Neural Vocoders - Extended Abstract		Rodrigo Mira, Buye Xu, Jacob Donley, Anurag Kumar, Stavros Petridis, Vamsi Krishna K Ithapu, Maja Pantic
	AV-SAM: Segment Anything Model Meets Audio-Visual Localization and Segmentation		Shentong Mo, Yapeng Tian
	Balanced Audiovisual Dataset for Imbalance Analysis		Wenke Xia, Xu Zhao, Xincheng Pang, Changqing Zhang, Di Hu
	Q&A session
	Towards Robust Image-in-Audio Deep Steganography		Jaume Ros, Margarita Geleta, Jordi Pons, Xavier Giro-i-Nieto
	CLIPSynth: Learning Text-to-audio Synthesis from Videos		Hao-Wen Dong, Gunnar A Sigurdsson, Chenyang Tao, Jiun-Yu Kao, Yu-Hsiang Lin, Anjali Narayan-Chen, arpit gupta, Tagyoung Chung, Jing Huang, Nanyun Peng, Wenbo Zhao
	Language-Guided Music Recommendation for Video via Prompt Analogies		Daniel McKee, Justin Salamon, Josef Sivic, Bryan Russell
	Disentangled Audio-Driven NeRF: Talking Head Generation with Detailed Identity-Specific Micro expressions		Seo Young Lee, Seongsu Ha, Joonseok Lee
	Speed Co-Augmentation for Unsupervised Audio-Visual Pre-training		Jiangliu Wang, Jianbo Jiao, Yibing Song, Stephen L James, Zhan Tong, Chongjian GE, Pieter Abbeel, Yunhui Liu
	Q&A session
2:00 - 2:30 (PT)	Invited talk	Yapeng Tian
2:30 - 3:00 (PT)	Invited talk	Kristen Grauman
3:00 - 3:30 (PT)	Posters & Coffee Break
3:30 - 4:30 (PT)	Invited paper talks		Chair: Ruohan Gao
	Sparse in Space and Time: Audio-visual Synchronization with Trainable Selectors		Vladimir Iashin, Weidi Xie, Esa Rahtu, Andrew Zisserman
	MM-Diffusion: Learning Multi-Modal Diffusion Models for Joint Audio and Video Generation		Ludan Ruan, Yiyang Ma, Huan Yang, Huiguo He, Bei Liu, Jianlong Fu, Nicholas Jing Yuan, Qin Jin, Baining Guo
	Learning State-Aware Visual Representations from Audible Interactions		Himangi Mittal, Pedro Morgado, Unnat Jain, Abhinav Gupta
	Q&A session
	Chat2Map: Efficient Scene Mapping from Multi-Ego Conversations		Sagnik Majumder, Hao Jiang, Pierre Moulon, Ethan Henderson, Paul Calamia, Kristen Grauman, Vamsi Krishna Ithapu
	Sound to Visual Scene Generation by Audio-to-Visual Latent Alignment		Kim Sung-Bin, Arda Senocak, Hyunwoo Ha, Andrew Owens, Tae-Hyun Oh
	Language-Guided Audio-Visual Source Separation via Trimodal Consistency		Reuben Tan, Arijit Ray, Andrea Burns, Bryan A. Plummer, Justin Salamon, Oriol Nieto, Bryan Russell, Kate Saenko
	Q&A session
4:30 - 5:00 (PT)	Invited talk	Changan Chen
5:00 - 5:30 (PT)	Invited talk	Vamsi Ithapu

Presentation instructions

Previous workshops: 2018, 2019, 2020, 2021, 2022

Authors of accepted papers will present a 5-minute talk about their work. You may either present in person, or submit a video. For the latter option, please submit by June 15th (11:59 PST) to CMT as a .mp4 file. Please submit the video as a supplementary file on CMT, along with the PDF for your paper.
We'll have two paper presentation sessions: 9-10am and 1-2pm. Each session will be a mix of in-person and video presentations. Throughout the paper sessions, there will be short Q&A sessions for all of the papers that precede them. We'll also release recordings on our website for offline viewing. We'll post the paper schedule in the coming weeks.
You are welcome to optionally present a poster during the lunch and coffee breaks. We unfortunately are unable to offer a hybrid option for posters.
Please also submit the camera ready version of your paper via CMT by June 10th (11:59 PST). Papers will be available on our website.
Looking forward to seeing you there!