Demonstration

Beyond Video-to-SFX: Video to Audio Synthesis with Environmentally Aware Speech


Table of Content

This is the demonstration page of our method. Please note that, due to limitations in the training data, the lip movements in the videos may not be synchronized with the generated speech. All audio samples are generated at a 16 kHz sampling rate. This page is intended for academic research purposes only.


Application 1: Video to Audio with Context-Aware Speech

This section demonstrates the Video to Audio with Context-Aware Speech task, where a prompt speaker, transcript, and video are provided to generate corresponding audio that aligns with the visual context with intelligible speech.

Source Video Prompt speaker & Transcript BVS (Ours) VATT

Transcript: Good football team to open goals here. What a way to do it. Well, we had one goal come to this team. To make sure 16 teams left.

Transcript: Good boy. Good boy. Come on. Good boy. That's a good boy.

Transcript: So it's going to create the overlock stitch and cut off the excess fabric at the same time. It's pretty cool.

Transcript: That bass is so much bigger than you do. I mean, it's so huge. Huge. Yes! AJ, I got one thing to tell you. Don't be jealous.

Transcript: You are wrong. Our children will not live under communism. Your children will live under freedom.

Transcript: We've got the beach to ourselves! Oh, the beach? You need a tap of that. Twenty pence please. Here you are.

Transcript: Dark honey. These are honey cells here so... Yeah, there's a brood in here.

Transcript: tired of sitting around I'm here with Terry from another edge driver in truck 138 and he's fired up the barbecue he's cooking fajitas

Transcript: Is he talking? Ew! Buddy, no! He's talking.

Application 2: Immersive Audio Background Conversion

This section demonstrates the immersive audio background conversion task, where an audio recording and a video are provided, and the speech content is adapted to match the context of the given video.

Source Video Targe Audio Converted Audio

Audio samples for our empirical finding

This section provide the audio samples that used to plot our Figure 3.

Ground Truth

Spectrogram GT

Reconstructed sample of audio semantic token

Spectrogram rec

Generated sample of speech SSL token

Spectrogram speech

Generated sample of the residual embedding of audio semantic token and speech SSL token

Spectrogram residual

Reference:

[1] Liu, Xiulong, Kun Su, and Eli Shlizerman. "Tell what you hear from what you see-video to audio generation through text." Advances in Neural Information Processing Systems 37 (2024): 101337-101366.