Stable Audio 3.0

Key Takeaways

Stable Audio 3.0 raises the bar for AI-generated audio, delivering high-quality, full tracks with coherent musical structures up to three minutes in length at 44.1kHz stereo.
The latest model introduces audio-to-audio generation, allowing users to upload and transform samples using natural language prompts.
Stable Audio 3.0 was trained exclusively on a licensed dataset from the AudioSparx music library, honoring opt-out requests and ensuring fair compensation for creators.
Explore the model and start creating for free on the Stable Audio website now.

Stability AI excited to announce the launch of Stable Audio 3.0, a model that produces high-quality, full-length tracks with coherent musical structures up to three minutes long at 44.1 kHz stereo, all from a single natural language prompt.

This new version extends beyond text-to-audio capabilities, now including audio-to-audio functionality. Users can upload audio samples and transform them into a wide array of sounds using natural language prompts. This update enhances sound effect generation and style transfer. It offering artists and musicians more flexibility, control, and an elevated creative process.

Moreover, Stable Audio 3.0 builds on the foundation laid by Stable Audio 1.0, which debuted in September 2023 as the first commercially viable AI music generation tool capable of producing high-quality 44.1kHz music using latent diffusion technology. It was recognized as one of TIME’s Best Inventions of 2023.

This new model is now available for free on the Stable Audio website and will soon be accessible via API.

New Features

Our most advanced audio model yet expands the creative toolkit for artists and musicians with its enhanced functionalities. With both text-to-audio and audio-to-audio capabilities, users can generate melodies, backing tracks, stems, and sound effects.

Full-Length Tracks

Stable Audio 3.0 sets itself apart from other state-of-the-art models by generating songs up to three minutes in length. It produces structured compositions with intros, developments, and outros, along with stereo sound effects.

Audio-to-Audio Generation

Stable Audio 3.0 now supports audio file uploads, enabling users to transform ideas into fully produced samples. Our Terms of Service require that uploads be free of copyrighted material, and we use advanced content recognition to maintain compliance and prevent infringement.

Variations and Sound Effects Creation

This model amplifies the creation of sound and audio effects, from the tapping on a keyboard to the roar of a crowd or the hum of city streets, offering new ways to elevate audio projects.

Style Transfer

This new feature seamlessly modifies newly generated or uploaded audio during the generation process. It allows customization of the output’s theme to align with a project’s specific style and tone.

Research

The architecture of the Stable Audio 3.0 latent diffusion model is specifically designed to generate full tracks with coherent structures. To achieve this, all system components have been adapted for improved performance over long time scales. A highly compressed autoencoder condenses raw audio waveforms into much shorter representations.

The diffusion model uses a transformer (DiT), like in Stable Diffusion 3, replacing the previous U-Net. This combination enables the model to recognize and reproduce large-scale structures crucial for high-quality compositions.

Stay tuned for the release of the research paper with additional technical details.

The Autoencoder condenses audio and reconstructs it back to its original state, capturing and reproducing essential features while filtering out less important details for more coherent generations.

A Diffusion Transformer (DiT) refines random noise into structured data incrementally, identifying intricate patterns and relationships. When combined with the Autoencoder, it gains the capability to process longer sequences, creating a deeper, more accurate interpretation from inputs.

Safeguards

Like its predecessor, Stable Audio version 3.0 is trained on data from AudioSparx, consisting of over 800,000 audio files containing music, sound effects, and single-instrument stems, along with corresponding text metadata. All of AudioSparx’s artists had the option to ‘opt out’ of the model training.

To protect creator copyrights, we partner with Audible Magic to use their content recognition (ACR) technology for audio uploads, enabling real-time content matching to prevent copyright infringement.

Read other articles in our Blog.