项目地址:https://github.com/open-mmlab/Amphion
TTS: Text-to-Speech
Amphion achieves state-of-the-art performance when compared with existing open-source repositories on text-to-speech (TTS) systems. It supports the following models or architectures:
-
FastSpeech2: A non-autoregressive TTS architecture that utilizes feed-forward Transformer blocks.
-
VITS: An end-to-end TTS architecture that utilizes conditional variational autoencoder with adversarial learning
-
Vall-E: A zero-shot TTS architecture that uses a neural codec language model with discrete codes.
-
NaturalSpeech2: An architecture for TTS that utilizes a latent diffusion model to generate natural-sounding voices.
SVC: Singing Voice Conversion
-
Ampion supports multiple content-based features from various pretrained models, including WeNet, Whisper, and ContentVec. Their specific roles in SVC has been investigated in our NeurIPS 2023 workshop paper.
-
Amphion implements several state-of-the-art model architectures, including diffusion-, transformer-, VAE- and flow-based models. The diffusion-based architecture uses Bidirectional dilated CNN as a backend and supports several sampling algorithms such as DDPM, DDIM, and PNDM. Additionally, it supports single-step inference based on the Consistency Model.
TTA: Text-to-Audio
Amphion supports the TTA with a latent diffusion model. It is designed like AudioLDM Make-an-Audio and AUDIT. It is also the official implementation of the text-to-audio generation part of our NeurIPS 2023 paper.
Vocoder
-
Amphion supports various widely-used neural vocoders, including:
-
GAN-based vocoders: MelGAN, HiFi-GAN, NSF-HiFiGAN, BigVGAN, APNet.
-
Flow-based vocoders: WaveGlow.
-
Diffusion-based vocoders: Diffwave.
-
Auto-regressive based vocoders: WaveNet, WaveRNN.
-
-
Amphion provides the official implementation of Multi-Scale Constant-Q Transfrom Discriminator. It can be used to enhance any architecture GAN-based vocoders during training, and keep the inference stage (such as memory or speed) unchanged.
Evaluation
Amphion provides a comprehensive objective evaluation of the generated audio. The evaluation metrics contain:
-
F0 Modeling: F0 Pearson Coefficients, F0 Periodicity Root Mean Square Error, F0 Root Mean Square Error, Voiced/Unvoiced F1 Score, etc.
-
Energy Modeling: Energy Root Mean Square Error, Energy Pearson Coefficients, etc.
-
Intelligibility: Character/Word Error Rate, which can be calculated based on Whisper and more.
-
Spectrogram Distortion: Frechet Audio Distance (FAD), Mel Cepstral Distortion (MCD), Multi-Resolution STFT Distance (MSTFT), Perceptual Evaluation of Speech Quality (PESQ), Short Time Objective Intelligibility (STOI), etc.
-
Speaker Similarity: Cosine similarity, which can be calculated based on RawNet3, WeSpeaker, and more.
Datasets
Amphion unifies the data preprocess of the open-source datasets including AudioCaps, LibriTTS, LJSpeech, M4Singer, Opencpop, OpenSinger, SVCC, VCTK, and more. The supported dataset list can be seen here (updating).
📀 Installation
git clone https://github.com/open-mmlab/Amphion.git
cd Amphion
# Install Python Environment
conda create --name amphion python=3.9.15
conda activate amphion
# Install Python Packages Dependencies
sh env.sh
🐍 Usage in Python
We detail the instructions of different tasks in the following recipes:
-
Text-to-Speech (TTS)
-
Singing Voice Conversion (SVC)
-
Text-to-Audio (TTA)
-
Vocoder
-
Evaluation
🙏 Acknowled
-
ming024's FastSpeech2 and jaywalnut310's VITS for model architecture code.
-
lifeiteng's VALL-E for training pipeline and model architecture design.
-
WeNet, Whisper, ContentVec, and RawNet3 for pretrained models and inference code.
-
HiFi-GAN for GAN-based Vocoder's architecture design and training strategy.
-
Encodec for well-organized GAN Discriminator's architecture and basic blocks.
-
Latent Diffusion for model architecture design.
-
TensorFlowTTS for preparing the MFA tools.
©️ License
Amphion is under the MIT License. It is free for both research and commercial use cases.
📚 Citations
Stay tuned, Coming soon!