Latent Fourier Transform
Abstract
We introduce the Latent Fourier Transform (LatentFT), a framework that provides novel frequency-domain controls for generative audio models. LatentFT combines a diffusion autoencoder with a latent-space Fourier transform to separate musical patterns by timescale. By masking in the latent frequency domain during training, our method yields representations that can be manipulated coherently at inference. This allows us to generate musical variations and blends from reference examples while preserving characteristics at user-specified timescales. LatentFT parallels the role of the equalizer in audio production: while traditional equalizers operates on audible frequencies to shape timbre, LatentFT operates on latent frequencies to shape musical structure. Experiments and listening tests show that LatentFT improves condition adherence and quality compared to baselines. We also present a technique for hearing latent frequencies in isolation, and show different musical attributes reside in different regions of the latent spectrum. Our results show how frequency-domain control in latent space provides an intuitive, continuous frequency axis for conditioning and blending, advancing us toward more interpretable and interactive generative audio models.
Method
The Latent Fourier Transform. We encode audio into a series of latent vectors and take their Fourier transform, resulting in a latent spectrum. During training (red) , this spectrum is masked randomly and used to reconstruct the input. During inference (blue), the user specifies a spectral mask, which selects features from the input at specific timescales and conditions a generative process.
Conditional Generation
We show several qualitative examples for conditional generation. First, we play the input music clip: