SoundLoCD: An Efficient Conditional Discrete Contrastive Latent Diffusion Model for Text-to-Sound Generation
Xinlei Niu1, Jing Zhang1, Christian Walder2, Charles Patrick Martin1
1 Australian National University, Canberra, Australia
2 Google DeepMind, Montreal, Canada
Abstract
We present SoundLoCD, a novel text-to-sound generation framework, which incorporates a LoRA-based conditional discrete contrastive latent diffusion model. Unlike recent large-scale sound generation models, our model can be efficiently trained under limited computational resources. The integration of a contrastive learning strategy further enhances the connection between textual conditions and the generated outputs, resulting in coherent and high-fidelity performance. Our experiments demonstrate that SoundLoCD outperforms the baseline with greatly reduced computational resources. A comprehensive ablation study further validates the contribution of each component within SoundLoCD.
Architecture
The comprarison between generated sample by SoundLoCD, DiffSound[1] (baseline), and Ground Truth (Source audio)
Text description
SoundLoCD
Diffsound
Ground Truth
Food is frying then a woman speaks
Mel-spectrograms
Speaking following by laughing and clapping
Mel-spectrograms
A woman speaks as she rubs two objects together
Mel-spectrograms
A series of light horn beeps is followed by a loud steam whistle
Mel-spectrograms
An engine runs then a train horn sounds
Mel-spectrograms
A baby cries and fusses, a woman speaks, and a man speaks
Mel-spectrograms
Children cry and people talk
Mel-spectrograms
A grown man speaks as water softly runs
Mel-spectrograms
References
[1] Yang, Dongchao, et al. “Diffsound: Discrete diffusion model for text-to-sound generation.” IEEE/ACM Transactions on Audio, Speech, and Language Processing (2023).