Make-An-Audio 2
Text-to-audio generation technology based on diffusion model
Tags:AI audio toolsAI audio processing toolsPreview:
Introduce:
Make-An-Audio 2 is a text-to-audio generation technology based on a diffusion model, developed by researchers from Zhejiang University, ByteDance and the Chinese University of Hong Kong. The technique improves the quality of generated audio by parsing text using pre-trained large language models (LLMs), optimizing semantic alignment and temporal consistency. It has also designed a diffusion denoising device based on feedforward Transformer to improve the performance of variable-length audio generation and enhance the extraction of time information. In addition, the scarcity of time data is addressed by using LLMs to convert large amounts of audio label data into audio text datasets.
Stakeholders:
The target audience for this technology is researchers and developers in the field of audio synthesis, as well as application scenarios that require high-quality text-to-audio conversion, such as automatic dubbing, audiobook production, etc. Make-An-Audio 2, through its advanced technology, is able to generate high quality audio that is semantically aligned and time consistent with the text content to meet the needs of these users.
Usage Scenario Examples:
- Automatically generates audio book background sound effects and dialogue
- Automatically add narration and sound effects to video content
- Create a virtual character’s voice for use in games or animations
The features of the tool:
- Parse text using pre-trained Large language models (LLMs) to optimize temporal information capture
- A structured text encoder is introduced to help learn semantic alignment in diffusion denoising process
- A diffusion denoising device based on feedforward Transformer is designed to improve the performance of variable-length audio generation
- Enhance and transform audio label data with LLMs to alleviate time data scarcity
- Surpass baseline models on both objective and subjective metrics, significantly improving temporal information understanding, semantic consistency, and sound quality
Steps for Use:
- Step 1: Prepare natural language text as input
- Step 2: Parse the Text using the Text Encoder of Make-An-Audio 2
- Step 3: Structured text encoder AIDS learning semantic alignment
- Step 4: Generate audio using a diffusion denoiser
- Step 5: Adjust the length and time controls for the generated audio
- Step 6: Modify the structured input as needed to precisely control the time
- Step 7: Generate the final audio output
Tool’s Tabs: Text to audio, diffusion model