StableAudio

Stable Audio 2.0 Model

60stable-audio-audiosparx-v2-0
Return to User Guide

Stable Audio AudioSparx v2.0

Model
61Detail

Our groundbreaking Stable Audio AudioSparx 2.0 model has been designed to generate full tracks with coherent structure at 3 minutes and 10 seconds. Our new model is available for everyone to generate full tracks on our Stable Audio product.

Key features:

  • Stable Audio 2.0 sets a new standard in AI generated audio, producing high-quality, full tracks with coherent musical structure up to three minutes in length at 44.1KHz stereo.
  • The new model introduces audio-to-audio generation by allowing users to upload and transform samples using natural language prompts.
  • Stable Audio 2.0 was exclusively trained on a licensed dataset from the AudioSparx music library, honoring opt-out requests and ensuring fair compensation for creators.

Full length tracks

62Examples

Stable Audio 2.0 sets itself apart from other state-of-the-art models as it can generate songs up to three minutes in length, complete with structured compositions that include an intro, development, and outro, as well as stereo sound effects.

Below are a few examples of full length track generations:

1

Lo-fi funk

2

Country instrumental

3

Pop, Pop-Electronic, Ballad, Billboard, Drum Machine, Bass, Lush Synthesizer Pads, Synthesizer Arp, Synth Bass, Vocal Sample Chops, Percussion, Honest, Heart-Felt, Melancholic, Vibe, Cool, Modern, Atmospheric, well-arranged composition, 115 BPM

4

Piano melody begins a melancholic journey, full orchestral climax, the swells of the orchestral instrumentals

Research overview

63Research overview

The architecture of the Stable Audio 2.0 latent diffusion model is specifically designed to enable the generation of full tracks with coherent structures. To achieve this, we have adapted all components of the system for improved performance over long time scales. A new, highly compressed autoencoder compresses raw audio waveforms into much shorter representations. For the diffusion model, we employ a diffusion transformer (DiT), akin to that used in Stable Diffusion 3, in place of the previous U-Net, as it is more adept at manipulating data over long sequences. The combination of these two elements results in a model capable of recognizing and reproducing the large-scale structures that are essential for high-quality musical compositions.

Autoencoder diagram:

Autoencoder diagram

DiT diagram:

DiT diagram