The introduction of diffusion-based generative models has revolutionized the field of generative AI over the last few years, leading to rapid improvements in the quality and controllability of generated images, video, and audio. Diffusion models working in the latent encoding space of a pre-trained autoencoder, termed “latent diffusion models”, provide significant speed improvements to the training and inference of diffusion models.
One of the main issues with generating audio using diffusion models is that diffusion models are usually trained to generate a fixed-size output. For example, an audio diffusion model might be trained on 30-second audio clips, and will only be able to generate audio in 30-second chunks. This is an issue when training on and trying to generate audio of greatly varying lengths, as is the case when generating full songs.
Audio diffusion models tend to be trained on randomly cropped chunks of audio from longer audio files, cropped or padded to fit the diffusion model’s training length. In the case of music, this causes the model to tend to generate arbitrary sections of a song, which may start or end in the middle of a musical phrase.
We introduce Stable Audio, a latent diffusion model architecture for audio conditioned on text metadata as well as audio file duration and start time, allowing for control over the content and length of the generated audio. This additional timing conditioning allows us to generate audio of a specified length up to the training window size.
Working with a heavily downsampled latent representation of audio allows for much faster inference times compared to raw audio. Using the latest advancements in diffusion sampling techniques, our flagship Stable Audio model is able to render 95 seconds of stereo audio at a 44.1 kHz sample rate in less than one second on an NVIDIA A100 GPU.
arXiv: https://arxiv.org/abs/2402.04825
Code: https://github.com/Stability-AI/stable-audio-tools
Metrics: https://github.com/Stability-AI/stable-audio-metrics
Demo: https://stability-ai.github.io/stable-audio-demo/