Our groundbreaking Stable Audio AudioSparx 2.0 model has been designed to generate full tracks with coherent structure at 3 minutes and 10 seconds. Our new model is available for everyone to generate full tracks on our Stable Audio product.
Key features:
Stable Audio 2.0 sets itself apart from other state-of-the-art models as it can generate songs up to three minutes in length, complete with structured compositions that include an intro, development, and outro, as well as stereo sound effects.
Below are a few examples of full length track generations:
Lo-fi funk
Country instrumental
Pop, Pop-Electronic, Ballad, Billboard, Drum Machine, Bass, Lush Synthesizer Pads, Synthesizer Arp, Synth Bass, Vocal Sample Chops, Percussion, Honest, Heart-Felt, Melancholic, Vibe, Cool, Modern, Atmospheric, well-arranged composition, 115 BPM
Piano melody begins a melancholic journey, full orchestral climax, the swells of the orchestral instrumentals
The architecture of the Stable Audio 2.0 latent diffusion model is specifically designed to enable the generation of full tracks with coherent structures. To achieve this, we have adapted all components of the system for improved performance over long time scales. A new, highly compressed autoencoder compresses raw audio waveforms into much shorter representations. For the diffusion model, we employ a diffusion transformer (DiT), akin to that used in Stable Diffusion 3, in place of the previous U-Net, as it is more adept at manipulating data over long sequences. The combination of these two elements results in a model capable of recognizing and reproducing the large-scale structures that are essential for high-quality musical compositions.
Autoencoder diagram:
DiT diagram: