SoundLoCD: An Efficient Conditional Discrete Contrastive Latent Diffusion Model for Text-to-Sound Generation

Xinlei Niu¹, Jing Zhang¹, Christian Walder², Charles Patrick Martin¹ 1 Australian National University, Canberra, Australia 2 Google DeepMind, Montreal, Canada

Abstract

We present SoundLoCD, a novel text-to-sound generation framework, which incorporates a LoRA-based conditional discrete contrastive latent diffusion model. Unlike recent large-scale sound generation models, our model can be efficiently trained under limited computational resources. The integration of a contrastive learning strategy further enhances the connection between textual conditions and the generated outputs, resulting in coherent and high-fidelity performance. Our experiments demonstrate that SoundLoCD outperforms the baseline with greatly reduced computational resources. A comprehensive ablation study further validates the contribution of each component within SoundLoCD.

Architecture

The comprarison between generated sample by SoundLoCD, DiffSound[1] (baseline), and Ground Truth (Source audio)

Text description	SoundLoCD	Diffsound	Ground Truth
Food is frying then a woman speaks
Mel-spectrograms

Speaking following by laughing and clapping
Mel-spectrograms

A woman speaks as she rubs two objects together
Mel-spectrograms

A series of light horn beeps is followed by a loud steam whistle
Mel-spectrograms

An engine runs then a train horn sounds
Mel-spectrograms

A baby cries and fusses, a woman speaks, and a man speaks
Mel-spectrograms

Children cry and people talk
Mel-spectrograms

A grown man speaks as water softly runs
Mel-spectrograms

References

[1] Yang, Dongchao, et al. “Diffsound: Discrete diffusion model for text-to-sound generation.” IEEE/ACM Transactions on Audio, Speech, and Language Processing (2023).