Stable Audio Open Paper

Stable Audio Open Paper

StabilityAI excited to announce the release of the research paper for Stable Audio Open! This open-weight text-to-audio model generates high-quality stereo audio at 44.1kHz from text prompts. Perfect for synthesizing realistic sounds and field recordings, it runs on consumer-grade GPUs, making it accessible for academic and artistic use.

Details

The research paper on Stable Audio Open details the architecture and training process of Stability AI’s new open-weights text-to-audio model, which is trained using Creative Commons data. The model’s weights are available on Hugging Face and are released under the Stability AI Community License.

This license permits non-commercial use and commercial use for individuals or organizations with annual revenues of up to $1 million. For enterprise licenses, please contact support.

The model can generate high-quality stereo audio at 44.1kHz from text prompts, suitable for synthesizing realistic sounds and field recordings. It runs on consumer-grade GPUs, making it accessible for academic and artistic purposes.

Architecture

Stable Audio Open introduces a text-to-audio model with three key components:

  1. Autoencoder: Compresses waveforms into a manageable sequence length.
  2. T5-based Text Embedding: Used for text conditioning.
  3. Transformer-based Diffusion Model (DiT): Operates in the latent space of the autoencoder.

The model generates variable-length stereo audio at 44.1kHz, up to 47 seconds. The autoencoder achieves a low latent rate of 21.5Hz, suitable for music and audio. Stable Audio Open is a variant of Stable Audio 2.0, trained on a different dataset (Creative Commons data). This architecture is similar but uses T5 text conditioning instead of CLAP.

Training Data

Stable Audio Open was trained using nearly 500,000 recordings licensed under CC-0, CC-BY, or CC-Sampling+. The dataset consists of 472,618 recordings from Freesound and 13,874 from the Free Music Archive (FMA).

To ensure no copyrighted material was included, the content was carefully curated by identifying music samples in Freesound using the PANNs audio tagger. These samples were sent to Audible Magic’s content detection company to ensure the removal of potential copyrighted music from the dataset.

Use Cases

Stable Audio Open can be fine-tuned to customize audio generation, such as adapting the length of generated content or meeting the specific needs of various industries and creative projects. Users can train the model locally with A6000 GPUs. For assistance with prompting, check out some tips for Stable Audio 2.0.

Conclusions

The release of Stable Audio Open marks a significant milestone in open-source audio AI. It offers high-quality stereo sound generation at 44.1kHz and is designed to run on consumer-grade GPUs, emphasizing data transparency. Despite limitations in areas such as speech and music generation, the model’s accessibility and performance make it a valuable tool for researchers and artists alike, pushing the boundaries of what is possible with open audio AI.

Read related articles:


Posted