How to use Stable Audio Open 1.0

Stable Audio Open 1.0 is a powerful model that generates variable-length stereo audio (up to 47 seconds) at 44.1kHz from text prompts. It consists of three key components: an autoencoder that compresses waveforms into manageable sequence lengths, a T5-based text embedding for text conditioning, and a transformer-based diffusion (DiT) model that operates in the autoencoder’s latent space.

Usage

This model is compatible with:

The stable-audio-tools library
The diffusers library

Using Stable Audio with stable-audio-tools

This model is designed to work with the stable-audio-tools library for inference. For example:

import torch
import torchaudio
from einops import rearrange
from stable_audio_tools import get_pretrained_model
from stable_audio_tools.inference.generation import generate_diffusion_cond

device = "cuda" if torch.cuda.is_available() else "cpu"

# Download model
model, model_config = get_pretrained_model("stabilityai/stable-audio-open-1.0")
sample_rate = model_config["sample_rate"]
sample_size = model_config["sample_size"]

model = model.to(device)

# Set up text and timing conditioning
conditioning = [{
    "prompt": "128 BPM tech house drum loop",
    "seconds_start": 0, 
    "seconds_total": 30
}]

# Generate stereo audio
output = generate_diffusion_cond(
    model,
    steps=100,
    cfg_scale=7,
    conditioning=conditioning,
    sample_size=sample_size,
    sigma_min=0.3,
    sigma_max=500,
    sampler_type="dpmpp-3m-sde",
    device=device
)

# Rearrange audio batch to a single sequence
output = rearrange(output, "b d n -> d (b n)")

# Peak normalize, clip, convert to int16, and save to file
output = output.to(torch.float32).div(torch.max(torch.abs(output))).clamp(-1, 1).mul(32767).to(torch.int16).cpu()
torchaudio.save("output.wav", output, sample_rate)

Using Stable Audio with diffusers

Ensure you have the latest version of the diffusers library (pip install -U diffusers) before running the following code:

import torch
import soundfile as sf
from diffusers import StableAudioPipeline

pipe = StableAudioPipeline.from_pretrained("stabilityai/stable-audio-open-1.0", torch_dtype=torch.float16)
pipe = pipe.to("cuda")

# Define the prompts
prompt = "The sound of a hammer hitting a wooden surface."
negative_prompt = "Low quality."

# Set the seed for generator
generator = torch.Generator("cuda").manual_seed(0)

# Run the generation
audio = pipe(
    prompt,
    negative_prompt=negative_prompt,
    num_inference_steps=200,
    audio_end_in_s=10.0,
    num_waveforms_per_prompt=3,
    generator=generator,
).audios

output = audio[0].T.float().cpu().numpy()
sf.write("hammer.wav", output, pipe.vae.sampling_rate)

Refer to the documentation for further optimization and usage details.

Model Details

Model Type: Stable Audio Open 1.0 is a latent diffusion model based on a transformer architecture.
Language(s): English
License: Stability AI Community License.
Commercial License: To use this model commercially, please refer to Stability AI License.
Research Paper:

2407.14358v2 Download

Training Dataset

Datasets Used: The training dataset consists of 486,492 audio recordings, including 472,618 from Freesound and 13,874 from the Free Music Archive (FMA). All audio files are licensed under CC0, CC BY, or CC Sampling+. The autoencoder and DiT were trained using this data, while a publicly available pre-trained T5 model (t5-base) was used for text conditioning.

Attribution

Attribution for all audio recordings used to train Stable Audio Open 1.0 can be found in this repository:

Stable Audio

How to use Stable Audio Open 1.0

Using Stable Audio with stable-audio-tools

Model Details

Training Dataset

Attribution

Mitigations

Use and Limitations

Biases