SoundMorpher: Perceptually-Uniform Sound Morphing with Diffusion Model

Return index page

Additional Ablation Study

This page provides additional details and ablation study on Convex CFG scheduling used in SoundMorpher.



Background for Classifier-free Guidance (CFG).

Controllable generation can be achieved by using guidance at each sampling step in diffusion model. When a conditional and unconditional diffusion models are jointly trained, samples can be obtained by CFG [1]. In AudioLDM2 [2], the the conditional and unconditional noise esitimation becomes
\[ \hat{\epsilon}_{\theta}(z_t,t,E) := w \epsilon_\theta(z_t,t,E) + (1-w) \epsilon_\theta (z_t,t,\varnothing) \]
where \(w\) determines the guidance scale.

Convex CFG scheduling.

Followin [3], we involve a convex CFG scheduling in SoundMorpher to boost morphing quality which is defined as
\[ w_\alpha = w_{max} - 2(w_{max} - w_{min})|\alpha-0.5| \]
where \(w\) is the guidance scale, \(\alpha\) is the morph factor.\(w_{max}\) and \(w_{min}\) are predefined maximum and minimum guidance scales.

Ablation study on classifier-free guidance (CFG) scales

In this experiment, we explore impacts of CFG scales on SoundMorpher, we conduct an ablation study on music morphing task with \(N=15\) on different sets of max-min CFG scales in TABLE I. According to our experimental results, maximum scale controls correspondence quality and smoothness quality of morphed results, whereas higher maximum scale leads to a lower \(MFCCs_{\mathcal{E}}\) and higher \(CDPAM_{mean} \pm CDPA_{std}\). In contrast, minimum scale controls intermediate quality of morphed results, where higher minimum scale leads to higher \(CDPAM_{T}\).
Table1

Example 1

Source music

Spectrogram

Target music

Spectrogram




\(w_{max} = 3, w_{min} = 1.5\)

Spectrogram

\(w_{max} = 4, w_{min} = 1.5\)

Spectrogram

\(w_{max} = 5, w_{min} = 1.5\)

Spectrogram

\(w_{max} = 4, w_{min} = 2\)

Spectrogram

\(w_{max} = 3, w_{min} = 5\)

Spectrogram

Example 2

Source music

Spectrogram

Target music

Spectrogram




\(w_{max} = 3, w_{min} = 1.5\)

Spectrogram

\(w_{max} = 4, w_{min} = 1.5\)

Spectrogram

\(w_{max} = 5, w_{min} = 1.5\)

Spectrogram

\(w_{max} = 4, w_{min} = 2\)

Spectrogram

\(w_{max} = 3, w_{min} = 5\)

Spectrogram
Return index page

Reference

[1] Ho, J., & Salimans, T. (2022). Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598.

[2] Liu, H., Yuan, Y., Liu, X., Mei, X., Kong, Q., Tian, Q., ... & Plumbley, M. D. (2024). Audioldm 2: Learning holistic audio generation with self-supervised pretraining. IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[3] Yang, Z., Yu, Z., Xu, Z., Singh, J., Zhang, J., Campbell, D., ... & Hartley, R. (2023). Impus: Image morphing with perceptually-uniform sampling using diffusion models. arXiv preprint arXiv:2311.06792.