How to use Stable Audio Open 1.0

How to use Stable Audio Open 1.0

Stable Audio Open 1.0 is a powerful model that generates variable-length stereo audio (up to 47 seconds) at 44.1kHz from text prompts. It consists of three key components: an autoencoder that compresses waveforms into manageable sequence lengths, a T5-based text embedding for text conditioning, and a transformer-based diffusion (DiT) model that operates in the autoencoder’s latent space.

Usage

This model is compatible with:

Using Stable Audio with stable-audio-tools

This model is designed to work with the stable-audio-tools library for inference. For example:

import torch
import torchaudio
from einops import rearrange
from stable_audio_tools import get_pretrained_model
from stable_audio_tools.inference.generation import generate_diffusion_cond

device = "cuda" if torch.cuda.is_available() else "cpu"

# Download model
model, model_config = get_pretrained_model("stabilityai/stable-audio-open-1.0")
sample_rate = model_config["sample_rate"]
sample_size = model_config["sample_size"]

model = model.to(device)

# Set up text and timing conditioning
conditioning = [{
    "prompt": "128 BPM tech house drum loop",
    "seconds_start": 0, 
    "seconds_total": 30
}]

# Generate stereo audio
output = generate_diffusion_cond(
    model,
    steps=100,
    cfg_scale=7,
    conditioning=conditioning,
    sample_size=sample_size,
    sigma_min=0.3,
    sigma_max=500,
    sampler_type="dpmpp-3m-sde",
    device=device
)

# Rearrange audio batch to a single sequence
output = rearrange(output, "b d n -> d (b n)")

# Peak normalize, clip, convert to int16, and save to file
output = output.to(torch.float32).div(torch.max(torch.abs(output))).clamp(-1, 1).mul(32767).to(torch.int16).cpu()
torchaudio.save("output.wav", output, sample_rate)

Using Stable Audio with diffusers

Ensure you have the latest version of the diffusers library (pip install -U diffusers) before running the following code:

import torch
import soundfile as sf
from diffusers import StableAudioPipeline

pipe = StableAudioPipeline.from_pretrained("stabilityai/stable-audio-open-1.0", torch_dtype=torch.float16)
pipe = pipe.to("cuda")

# Define the prompts
prompt = "The sound of a hammer hitting a wooden surface."
negative_prompt = "Low quality."

# Set the seed for generator
generator = torch.Generator("cuda").manual_seed(0)

# Run the generation
audio = pipe(
    prompt,
    negative_prompt=negative_prompt,
    num_inference_steps=200,
    audio_end_in_s=10.0,
    num_waveforms_per_prompt=3,
    generator=generator,
).audios

output = audio[0].T.float().cpu().numpy()
sf.write("hammer.wav", output, pipe.vae.sampling_rate)

Refer to the documentation for further optimization and usage details.

Model Details

  • Model Type: Stable Audio Open 1.0 is a latent diffusion model based on a transformer architecture.
  • Language(s): English
  • License: Stability AI Community License.
  • Commercial License: To use this model commercially, please refer to Stability AI License.
  • Research Paper:

Training Dataset

  • Datasets Used: The training dataset consists of 486,492 audio recordings, including 472,618 from Freesound and 13,874 from the Free Music Archive (FMA). All audio files are licensed under CC0, CC BY, or CC Sampling+. The autoencoder and DiT were trained using this data, while a publicly available pre-trained T5 model (t5-base) was used for text conditioning.

Attribution

Attribution for all audio recordings used to train Stable Audio Open 1.0 can be found in this repository:

  • Freesound attribution [csv]
  • FMA attribution [csv]

Mitigations

A thorough analysis was conducted to ensure no unauthorized copyrighted music was included in the training data. Identified music samples from Freesound were processed through Audible Magic’s identification services to remove any suspected copyrighted content. The final dataset consists of 266,324 CC0, 194,840 CC-BY, and 11,454 CC Sampling+ audio recordings.

For the FMA subset, a metadata search against a large copyrighted music database (https://www.kaggle.com/datasets/maharshipandya/-spotify-tracks-dataset) was conducted, with flagged content reviewed by humans, leaving 8,967 CC-BY and 4,907 CC0 tracks.

Use and Limitations

Intended Use:

  • Research and experimentation in AI-based music and audio generation.
  • Exploration of generative AI capabilities by machine learning practitioners and artists.

Out-of-Scope Use Cases:

  • The model should not be used in downstream applications without further risk evaluation.
  • It should not be used to create or disseminate hostile or alienating audio content.

Limitations:

  • Inability to generate realistic vocals.
  • Performance may vary with non-English descriptions.
  • Not all music styles and cultures are equally represented.
  • Better suited for generating sound effects and field recordings.
  • Prompt engineering might be necessary for optimal results.

Biases

The dataset may lack cultural diversity, and as a result, the model might not perform equally well across all music genres and sound effects. The generated samples will reflect biases present in the training data.


Posted