OpenAI 发布 GPT4o 之后,使得越来越多的人都开始幻想属于自己的AI“伴侣”,这最让人惊艳的就是他们出色的TTS技术。而在此之前,主流的开源TTS有 XTTS 2 和 Bark。而近日,一个名为 ChatTTS 文本转语音项目爆火出圈,引来大家极大的关注。
项目地址:https://github.com/2noise/ChatTTS/tree/main
官方描述:ChatTTS 是专为 LLM 助手等对话场景设计的文本到语音模型。它支持中英文两种语言。我们的模型经过 100,000+ 小时的中英文训练。HuggingFace 上的开源版本是不含 SFT 的 40,000 小时预训练模型。
特点
- 对话式 TTS: ChatTTS 针对基于对话的任务进行了优化,可实现自然且富有表现力的语音合成。它支持多人发言,促进互动对话。
- 细粒度控制: 该模型可预测和控制细粒度的前音特征,包括笑声、停顿和插话。
- 更好的拟声: ChatTTS 在拟声方面超越了大多数开源 TTS 模型。我们提供预训练模型,以支持进一步的研究和开发。
commit fb54155b47404dbf7f61f230b117cf36b577ffec
Code
import ChatTTS
from IPython.display import Audio
chat = ChatTTS.Chat()
chat.load_models()
texts = ["作者写百草园,以“乐”为中心,以简约生动的文字,描绘了一个奇趣无穷的儿童乐园,其间穿插“美女蛇”的传说和冬天雪地捕鸟的故事,动静结合,详略得当,趣味无穷。",]
wavs = chat.infer(texts, use_decoder=True)
Audio(wavs[0], rate=24_000, autoplay=True)
高级用法
###################################
# Sample a speaker from Gaussian.
import torch
std, mean = torch.load('ChatTTS/asset/spk_stat.pt').chunk(2)
rand_spk = torch.randn(768) * std + mean
params_infer_code = {
'spk_emb': rand_spk, # add sampled speaker
'temperature': .3, # using custom temperature
'top_P': 0.7, # top P decode
'top_K': 20, # top K decode
}
###################################
# For sentence level manual control.
# use oral_(0-9), laugh_(0-2), break_(0-7)
# to generate special token in text to synthesize.
params_refine_text = {
'prompt': '[oral_2][laugh_0][break_6]'
}
wav = chat.infer("作者写百草园,以“乐”为中心,以简约生动的文字,描绘了一个奇趣无穷的儿童乐园,其间穿插“美女蛇”的传说和冬天雪地捕鸟的故事,动静结合,详略得当,趣味无穷。", params_refine_text=params_refine_text, params_infer_code=params_infer_code)
Audio(wav[0], rate=24_000, autoplay=True)
###################################
# For word level manual control.
text = 'What is [uv_break]your favorite english food?[laugh][lbreak]'
wav = chat.infer(text, skip_refine_text=True, params_infer_code=params_infer_code)
Audio(wav[0], rate=24_000, autoplay=True)
这里进度条跟不上4090的速度,而且作者没写Audio,导致我误认为挂了。。。(我在秋山涉面前,86太慢了)
commit 1c022eeebe577ba3651f4e568fa2dccabaf16e78
拉了新版的
except:
197 self.logger.log(logging.WARNING, f'Package nemo_text_processing not found! \
198 Run: conda install -c conda-forge pynini=2.1.5 && pip install nemo_text_processing')
--> 199 self.normalizer[lang] = partial(Normalizer(input_case='cased', lang=lang).normalize, verbose=False, punct_post_process=True)
UnboundLocalError: local variable 'Normalizer' referenced before assignment
我看了一下 https://github.com/2noise/ChatTTS/issues/164,把提示的 conda install -c conda-forge pynini=2.1.5 && pip install nemo_text_processing
运行一遍。
秀儿,怎么不把 requirements.txt更新一下?
Code
# Import necessary libraries and configure settings
import torch
import torchaudio
torch._dynamo.config.cache_size_limit = 64
torch._dynamo.config.suppress_errors = True
torch.set_float32_matmul_precision('high')
import ChatTTS
from IPython.display import Audio
# Initialize and load the model:
chat = ChatTTS.Chat()
chat.load_models(compile=False) # Set to True for better performance
# Define the text input for inference (Support Batching)
texts = [
"So we found being competitive and collaborative was a huge way of staying motivated towards our goals, so one person to call when you fall off, one person who gets you back on then one person to actually do the activity with.",
]
# Perform inference and play the generated audio
wavs = chat.infer(texts)
Audio(wavs[0], rate=24_000, autoplay=True)
# Save the generated audio
torchaudio.save("output.wav", torch.from_numpy(wavs[0]), 24000)