注意

单击此处下载完整的示例代码

Tacotron2 文本转语音¶

作者: Yao-Yuan Yang, Moto Hira

概述¶

本教程展示如何使用 torchaudio 中预训练的 Tacotron2 构建文本转语音管道。

文本转语音管道流程如下：

文本预处理

首先，输入文本被编码成一个符号列表。在本教程中，我们将使用英文字母作为符号。
频谱图生成

从编码后的文本生成频谱图。我们使用 Tacotron2 模型来实现这一点。
时域转换

最后一步是将频谱图转换为波形。从频谱图生成语音的过程也称为声码器。在本教程中，使用了三种不同的声码器：WaveRNN、GriffinLim 和 Nvidia 的 WaveGlow。

下图说明了整个过程。

https://download.pytorch.org/torchaudio/tutorial-assets/tacotron2_tts_pipeline.png

torchaudio.pipelines.Tacotron2TTSBundle 封装了所有相关的组件，但本教程也将涵盖其内部工作流程。

准备¶

import torch
import torchaudio

torch.random.manual_seed(0)
device = "cuda" if torch.cuda.is_available() else "cpu"

print(torch.__version__)
print(torchaudio.__version__)
print(device)

2.10.0.dev20251013+cu126
2.8.0a0+1d65bbe
cuda

import IPython
import matplotlib.pyplot as plt

文本处理¶

基于字符的编码¶

本节将介绍基于字符的编码工作原理。

由于预训练的 Tacotron2 模型需要特定的符号表，因此 torchaudio 提供了相同的功能。然而，为了便于理解，我们将首先手动实现编码。

首先，我们定义符号集 '_-!\'(),.:;? abcdefghijklmnopqrstuvwxyz'。然后，我们将输入文本中的每个字符映射到表中相应符号的索引。表中不存在的符号将被忽略。

symbols = "_-!'(),.:;? abcdefghijklmnopqrstuvwxyz"
look_up = {s: i for i, s in enumerate(symbols)}
symbols = set(symbols)


def text_to_sequence(text):
    text = text.lower()
    return [look_up[s] for s in text if s in symbols]


text = "Hello world! Text to speech!"
print(text_to_sequence(text))

[19, 16, 23, 23, 26, 11, 34, 26, 29, 23, 15, 2, 11, 31, 16, 35, 31, 11, 31, 26, 11, 30, 27, 16, 16, 14, 19, 2]

如上所述，符号表和索引必须与预训练的 Tacotron2 模型期望的一致。torchaudio 随预训练模型一起提供了相同的转换。您可以像这样实例化并使用该转换。

processor = torchaudio.pipelines.TACOTRON2_WAVERNN_CHAR_LJSPEECH.get_text_processor()

text = "Hello world! Text to speech!"
processed, lengths = processor(text)

print(processed)
print(lengths)

tensor([[19, 16, 23, 23, 26, 11, 34, 26, 29, 23, 15,  2, 11, 31, 16, 35, 31, 11,
         31, 26, 11, 30, 27, 16, 16, 14, 19,  2]])
tensor([28], dtype=torch.int32)

注意：我们手动编码的输出与 torchaudio 的 text_processor 输出匹配（这意味着我们正确地重新实现了库的内部工作方式）。它接受单个文本或文本列表作为输入。当提供文本列表时，返回的 lengths 变量表示输出批次中每个已处理令牌的有效长度。

可以如下检索中间表示

print([processor.tokens[i] for i in processed[0, : lengths[0]]])

['h', 'e', 'l', 'l', 'o', ' ', 'w', 'o', 'r', 'l', 'd', '!', ' ', 't', 'e', 'x', 't', ' ', 't', 'o', ' ', 's', 'p', 'e', 'e', 'c', 'h', '!']

频谱图生成¶

Tacotron2 是我们用来从编码文本生成频谱图的模型。有关模型的详细信息，请参阅论文。

使用预训练权重实例化 Tacotron2 模型很容易，但请注意，Tacotron2 模型需要由匹配的文本处理器进行处理。

torchaudio.pipelines.Tacotron2TTSBundle 将匹配的模型和处理器捆绑在一起，以便于创建管道。

有关可用的 bundle 及其用法，请参阅 Tacotron2TTSBundle。

bundle = torchaudio.pipelines.TACOTRON2_WAVERNN_CHAR_LJSPEECH
processor = bundle.get_text_processor()
tacotron2 = bundle.get_tacotron2().to(device)

text = "Hello world! Text to speech!"

with torch.inference_mode():
    processed, lengths = processor(text)
    processed = processed.to(device)
    lengths = lengths.to(device)
    spec, _, _ = tacotron2.infer(processed, lengths)


_ = plt.imshow(spec[0].cpu().detach(), origin="lower", aspect="auto")

Downloading: "https://download.pytorch.org/torchaudio/models/tacotron2_english_characters_1500_epochs_wavernn_ljspeech.pth" to /root/.cache/torch/hub/checkpoints/tacotron2_english_characters_1500_epochs_wavernn_ljspeech.pth

1%
2%
3%
5%
6%
7%
8%
9%
0%
2%
3%
4%
5%
6%
7%
9%
0%
1%
2%
3%
4%
6%
7%
8%
9%
0%
1%
3%
4%
5%
6%
7%
8%
0%
1%
2%
3%
4%
5%
7%
8%
9%
0%
1%
2%
4%
5%
6%
7%
8%
9%
1%
2%
3%
4%
5%
6%
8%
9%
0%
1%
2%
3%
4%
6%
7%
8%
9%
0%
1%
3%
4%
5%
6%
7%
8%
0%
1%
2%
3%
4%
5%
7%
8%
9%
0%
1%
2%
4%
5%
6%
7%
8%
9%
1%
2%
3%
4%
5%
6%
8%
9%
0%
1%
2%
3%
5%
6%
7%
8%
9%
0%
2%
3%
4%
5%
6%
7%
9%
0%
1%
2%
3%
4%
6%
7%
8%
9%
0%
1%
2%
4%
5%
6%
7%
8%
9%
1%
2%
3%
4%
5%
6%
8%
9%
0%
1%
2%
3%
5%
6%
7%
8%
9%
0%
2%
3%
4%
5%
6%
7%
9%
0%
1%
2%
3%
4%
6%
7%
8%
9%
0%
1%
3%
4%
5%
6%
7%
8%
0%
1%
2%
3%
4%
5%
7%
8%
9%
0%
1%
2%
3%
5%
6%
7%
8%
9%
0%
2%
3%
4%
5%
6%
7%
9%
0%
1%
2%
3%
4%
6%
7%
8%
9%
0%
1%
3%
4%
5%
6%
7%
8%
0%
1%
2%
3%
4%
5%
7%
8%
9%
0%
1%
2%
4%
5%
6%
7%
8%
9%
1%
2%
3%
4%
5%
6%
8%
9%
0%
1%
2%
3%
5%
6%
7%
8%
9%
0%
1%
3%
4%
5%
6%
7%
8%
0%
1%
2%
3%
4%
5%
7%
8%
9%
0%
1%
2%
4%
5%
6%
7%
8%
9%
1%
2%
3%
4%
5%
6%
8%
9%
0%
1%
2%
3%
5%
6%
7%
8%
9%
0%
2%
3%
4%
5%
6%
7%
9%
0%
1%
2%
3%
4%
6%
7%
8%
9%
0%
1%
2%
4%
5%
6%
7%
8%
9%
1%
2%
3%
4%
5%
6%
8%
9%
0%
1%
2%
3%
5%
6%
7%
8%
9%
0%
2%
3%
4%
5%
6%
7%
9%
0%
1%
2%
3%
4%
6%
7%
8%
9%
0%
1%
3%
4%
5%
6%
7%
8%
0%
1%
2%
3%
4%
5%
7%
8%
9%
0%
1%
2%
4%
5%
6%
7%
8%
9%
0%
2%
3%
4%
5%
6%
7%
9%
0%
1%
2%
3%
4%
6%
7%
8%
9%
0%
1%
3%
4%
5%
6%
7%
8%
0%
1%
2%
3%
4%
5%
7%
8%
9%
0%
1%
2%
4%
5%
6%
7%
8%
9%
1%
2%
3%
4%
5%
6%
8%
9%
0%
1%
2%
3%
5%
6%
7%
8%
9%
0%
1%
3%
4%
5%
6%
7%
8%
0%
1%
2%
3%
4%
5%
7%
8%
9%
0%
1%
2%
4%
5%
6%
7%
8%
9%
1%
2%
3%
4%
5%
6%
8%
9%
0%
1%
2%
3%
5%
6%
7%
8%
9%
0%
2%
3%
4%
5%
6%
7%
9%
0%
1%
2%
3%
4%
6%
7%
8%
9%
0%
1%
2%
4%
5%
6%
7%
8%
9%
1%
2%
3%
4%
5%
6%
8%
9%
0%
1%
2%
3%
5%
6%
7%
8%
9%
0%
2%
3%
4%
5%
6%
7%
9%
0%
1%
2%
3%
4%
6%
7%
8%
9%
0%
1%
3%
4%
5%
6%
7%
8%
0%
1%
2%
3%
4%
5%
7%
8%
9%
0%
1%
2%
4%
5%
6%
7%
8%
9%
0%
2%
3%
4%
5%
6%
7%
9%
0%
1%
2%
3%
4%
6%
7%
8%
9%
0%
1%
3%
4%
5%
6%
7%
8%
0%
1%
2%
3%
4%
5%
7%
8%
9%
0%
1%
2%
4%
5%
6%
7%
8%
9%
1%
2%
3%
4%
5%
6%
8%
9%
0%
1%
2%
3%
5%
6%
7%
8%
9%
0%
1%
3%
4%
5%
6%
7%
8%
0%
1%
2%
3%
4%
5%
7%
8%
9%
0%
1%
2%
4%
5%
6%
7%
8%
9%
1%
2%
3%
4%
5%
6%
8%
9%
0%
1%
2%
3%
5%
6%
7%
8%
9%
0%
2%
3%
4%
5%
6%
7%
9%
0%
1%
2%
3%
4%
6%
7%
8%
9%
0%
1%
3%
4%
5%
6%
7%
8%
9%
1%
2%
3%
4%
5%
6%
8%
9%
0%
1%
2%
3%
5%
6%
7%
8%
9%
0%
2%
3%
4%
5%
6%
7%
9%
0%
1%
2%
3%
4%
6%
7%
8%
9%
0%
1%
3%
4%
5%
6%
7%
8%
0%
1%
2%
3%
4%
5%
7%
8%
9%
0%
1%
2%
4%
5%
6%
7%
8%
9%
0%
2%
3%
4%
5%
6%
7%
9%
0%
1%
2%
3%
4%
6%
7%
8%
9%
0%
1%
3%
4%
5%
6%
7%
8%
0%
1%
2%
3%
4%
5%
7%
8%
9%
0%
1%
2%
4%
5%
6%
7%
8%
9%
1%
2%
3%
4%
5%
6%
8%
9%
0%
1%
2%
3%
5%
6%
7%
8%
9%
0%
2%
3%
4%
5%
6%
7%
8%
0%
1%
2%
3%
4%
5%
7%
8%
9%
0%
1%
2%
4%
5%
6%
7%
8%
9%
1%
2%
3%
4%
5%
6%
8%
9%
0%
0%

请注意，Tacotron2.infer 方法执行多项式采样，因此频谱图生成过程会产生随机性。

def plot():
    fig, ax = plt.subplots(3, 1)
    for i in range(3):
        with torch.inference_mode():
            spec, spec_lengths, _ = tacotron2.infer(processed, lengths)
        print(spec[0].shape)
        ax[i].imshow(spec[0].cpu().detach(), origin="lower", aspect="auto")


plot()

torch.Size([80, 183])
torch.Size([80, 196])
torch.Size([80, 184])

波形生成¶

生成频谱图后，最后一个过程是使用声码器从频谱图中恢复波形。

torchaudio 提供了基于 GriffinLim 和 WaveRNN 的声码器。

WaveRNN 声码器¶

继续上一节，我们可以从同一个 bundle 中实例化匹配的 WaveRNN 模型。

bundle = torchaudio.pipelines.TACOTRON2_WAVERNN_CHAR_LJSPEECH

processor = bundle.get_text_processor()
tacotron2 = bundle.get_tacotron2().to(device)
vocoder = bundle.get_vocoder().to(device)

text = "Hello world! Text to speech!"

with torch.inference_mode():
    processed, lengths = processor(text)
    processed = processed.to(device)
    lengths = lengths.to(device)
    spec, spec_lengths, _ = tacotron2.infer(processed, lengths)
    waveforms, lengths = vocoder(spec, spec_lengths)

Downloading: "https://download.pytorch.org/torchaudio/models/wavernn_10k_epochs_8bits_ljspeech.pth" to /root/.cache/torch/hub/checkpoints/wavernn_10k_epochs_8bits_ljspeech.pth

8%
5%
3%
0%
8%
5%
3%
0%
8%
5%
3%
0%
8%
5%
3%
0%
8%
5%
3%
0%
8%
5%
3%
0%
8%
5%
3%
0%
8%
5%
3%
0%
8%
5%
3%
0%
8%
5%
3%
0%
8%
5%
3%
0%
8%
5%
3%
0%
8%
5%
3%
0%
8%
5%
3%
0%
8%
5%
3%
0%
8%
5%
3%
0%
8%
5%
3%
0%
8%
5%
3%
0%
8%
5%
3%
0%
8%
5%
3%
0%
8%
5%
3%
0%
8%
5%
3%
0%
8%
5%
3%
0%
8%
5%
3%
0%
8%
5%
3%
0%
8%
5%
3%
0%
8%
5%
3%
0%
8%
5%
3%
0%
8%
5%
3%
0%
8%
5%
3%
0%
8%
5%
3%
0%
8%
5%
3%
0%
8%
5%
3%
0%
8%
0%

def plot(waveforms, spec, sample_rate):
    waveforms = waveforms.cpu().detach()

    fig, [ax1, ax2] = plt.subplots(2, 1)
    ax1.plot(waveforms[0])
    ax1.set_xlim(0, waveforms.size(-1))
    ax1.grid(True)
    ax2.imshow(spec[0].cpu().detach(), origin="lower", aspect="auto")
    return IPython.display.Audio(waveforms[0:1], rate=sample_rate)


plot(waveforms, spec, vocoder.sample_rate)

Griffin-Lim 声码器¶

使用 Griffin-Lim 声码器与 WaveRNN 相同。您可以使用 get_vocoder() 方法实例化声码器对象，然后传入频谱图。

bundle = torchaudio.pipelines.TACOTRON2_GRIFFINLIM_CHAR_LJSPEECH

processor = bundle.get_text_processor()
tacotron2 = bundle.get_tacotron2().to(device)
vocoder = bundle.get_vocoder().to(device)

with torch.inference_mode():
    processed, lengths = processor(text)
    processed = processed.to(device)
    lengths = lengths.to(device)
    spec, spec_lengths, _ = tacotron2.infer(processed, lengths)
waveforms, lengths = vocoder(spec, spec_lengths)

Downloading: "https://download.pytorch.org/torchaudio/models/tacotron2_english_characters_1500_epochs_ljspeech.pth" to /root/.cache/torch/hub/checkpoints/tacotron2_english_characters_1500_epochs_ljspeech.pth

1%
2%
3%
5%
6%
7%
8%
9%
0%
2%
3%
4%
5%
6%
7%
9%
0%
1%
2%
3%
4%
6%
7%
8%
9%
0%
1%
3%
4%
5%
6%
7%
8%
0%
1%
2%
3%
4%
5%
7%
8%
9%
0%
1%
2%
4%
5%
6%
7%
8%
9%
1%
2%
3%
4%
5%
6%
8%
9%
0%
1%
2%
3%
4%
6%
7%
8%
9%
0%
1%
3%
4%
5%
6%
7%
8%
0%
1%
2%
3%
4%
5%
7%
8%
9%
0%
1%
2%
4%
5%
6%
7%
8%
9%
1%
2%
3%
4%
5%
6%
8%
9%
0%
1%
2%
3%
5%
6%
7%
8%
9%
0%
2%
3%
4%
5%
6%
7%
9%
0%
1%
2%
3%
4%
6%
7%
8%
9%
0%
1%
2%
4%
5%
6%
7%
8%
9%
1%
2%
3%
4%
5%
6%
8%
9%
0%
1%
2%
3%
5%
6%
7%
8%
9%
0%
2%
3%
4%
5%
6%
7%
9%
0%
1%
2%
3%
4%
6%
7%
8%
9%
0%
1%
3%
4%
5%
6%
7%
8%
0%
1%
2%
3%
4%
5%
7%
8%
9%
0%
1%
2%
3%
5%
6%
7%
8%
9%
0%
2%
3%
4%
5%
6%
7%
9%
0%
1%
2%
3%
4%
6%
7%
8%
9%
0%
1%
3%
4%
5%
6%
7%
8%
0%
1%
2%
3%
4%
5%
7%
8%
9%
0%
1%
2%
4%
5%
6%
7%
8%
9%
1%
2%
3%
4%
5%
6%
8%
9%
0%
1%
2%
3%
5%
6%
7%
8%
9%
0%
1%
3%
4%
5%
6%
7%
8%
0%
1%
2%
3%
4%
5%
7%
8%
9%
0%
1%
2%
4%
5%
6%
7%
8%
9%
1%
2%
3%
4%
5%
6%
8%
9%
0%
1%
2%
3%
5%
6%
7%
8%
9%
0%
2%
3%
4%
5%
6%
7%
9%
0%
1%
2%
3%
4%
6%
7%
8%
9%
0%
1%
2%
4%
5%
6%
7%
8%
9%
1%
2%
3%
4%
5%
6%
8%
9%
0%
1%
2%
3%
5%
6%
7%
8%
9%
0%
2%
3%
4%
5%
6%
7%
9%
0%
1%
2%
3%
4%
6%
7%
8%
9%
0%
1%
3%
4%
5%
6%
7%
8%
0%
1%
2%
3%
4%
5%
7%
8%
9%
0%
1%
2%
4%
5%
6%
7%
8%
9%
0%
2%
3%
4%
5%
6%
7%
9%
0%
1%
2%
3%
4%
6%
7%
8%
9%
0%
1%
3%
4%
5%
6%
7%
8%
0%
1%
2%
3%
4%
5%
7%
8%
9%
0%
1%
2%
4%
5%
6%
7%
8%
9%
1%
2%
3%
4%
5%
6%
8%
9%
0%
1%
2%
3%
5%
6%
7%
8%
9%
0%
1%
3%
4%
5%
6%
7%
8%
0%
1%
2%
3%
4%
5%
7%
8%
9%
0%
1%
2%
4%
5%
6%
7%
8%
9%
1%
2%
3%
4%
5%
6%
8%
9%
0%
1%
2%
3%
5%
6%
7%
8%
9%
0%
2%
3%
4%
5%
6%
7%
9%
0%
1%
2%
3%
4%
6%
7%
8%
9%
0%
1%
2%
4%
5%
6%
7%
8%
9%
1%
2%
3%
4%
5%
6%
8%
9%
0%
1%
2%
3%
5%
6%
7%
8%
9%
0%
2%
3%
4%
5%
6%
7%
9%
0%
1%
2%
3%
4%
6%
7%
8%
9%
0%
1%
3%
4%
5%
6%
7%
8%
0%
1%
2%
3%
4%
5%
7%
8%
9%
0%
1%
2%
4%
5%
6%
7%
8%
9%
0%
2%
3%
4%
5%
6%
7%
9%
0%
1%
2%
3%
4%
6%
7%
8%
9%
0%
1%
3%
4%
5%
6%
7%
8%
0%
1%
2%
3%
4%
5%
7%
8%
9%
0%
1%
2%
4%
5%
6%
7%
8%
9%
1%
2%
3%
4%
5%
6%
8%
9%
0%
1%
2%
3%
5%
6%
7%
8%
9%
0%
1%
3%
4%
5%
6%
7%
8%
0%
1%
2%
3%
4%
5%
7%
8%
9%
0%
1%
2%
4%
5%
6%
7%
8%
9%
1%
2%
3%
4%
5%
6%
8%
9%
0%
1%
2%
3%
5%
6%
7%
8%
9%
0%
2%
3%
4%
5%
6%
7%
9%
0%
1%
2%
3%
4%
6%
7%
8%
9%
0%
1%
3%
4%
5%
6%
7%
8%
9%
1%
2%
3%
4%
5%
6%
8%
9%
0%
1%
2%
3%
5%
6%
7%
8%
9%
0%
2%
3%
4%
5%
6%
7%
9%
0%
1%
2%
3%
4%
6%
7%
8%
9%
0%
1%
3%
4%
5%
6%
7%
8%
0%
1%
2%
3%
4%
5%
7%
8%
9%
0%
1%
2%
4%
5%
6%
7%
8%
9%
0%
2%
3%
4%
5%
6%
7%
9%
0%
1%
2%
3%
4%
6%
7%
8%
9%
0%
1%
3%
4%
5%
6%
7%
8%
0%
1%
2%
3%
4%
5%
7%
8%
9%
0%
1%
2%
4%
5%
6%
7%
8%
9%
1%
2%
3%
4%
5%
6%
8%
9%
0%
1%
2%
3%
5%
6%
7%
8%
9%
0%
2%
3%
4%
5%
6%
7%
8%
0%
1%
2%
3%
4%
5%
7%
8%
9%
0%
1%
2%
4%
5%
6%
7%
8%
9%
1%
2%
3%
4%
5%
6%
8%
9%
0%
0%

plot(waveforms, spec, vocoder.sample_rate)

Waveglow 声码器¶

Waveglow 是 Nvidia 发布的一款声码器。其预训练权重已在 Torch Hub 上发布。可以使用 torch.hub 模块来实例化模型。

# Workaround to load model mapped on GPU
# https://stackoverflow.com/a/61840832
waveglow = torch.hub.load(
    "NVIDIA/DeepLearningExamples:torchhub",
    "nvidia_waveglow",
    model_math="fp32",
    pretrained=False,
)
checkpoint = torch.hub.load_state_dict_from_url(
    "https://api.ngc.nvidia.com/v2/models/nvidia/waveglowpyt_fp32/versions/1/files/nvidia_waveglowpyt_fp32_20190306.pth",  # noqa: E501
    progress=False,
    map_location=device,
)
state_dict = {key.replace("module.", ""): value for key, value in checkpoint["state_dict"].items()}

waveglow.load_state_dict(state_dict)
waveglow = waveglow.remove_weightnorm(waveglow)
waveglow = waveglow.to(device)
waveglow.eval()

with torch.no_grad():
    waveforms = waveglow.infer(spec)

/pytorch/audio/ci_env/lib/python3.11/site-packages/torch/hub.py:335: UserWarning: You are about to download and run code from an untrusted repository. In a future release, this won't be allowed. To add the repository to your trusted list, change the command to load(..., trust_repo=False) and a command prompt will appear asking for an explicit confirmation of trust, or load(..., trust_repo=True), which will assume that the prompt is to be answered with 'yes'. You can also use load(..., trust_repo='check') which will only prompt for confirmation if the repo is not already trusted. This will eventually be the default behaviour
  warnings.warn(
Downloading: "https://github.com/NVIDIA/DeepLearningExamples/zipball/torchhub" to /root/.cache/torch/hub/torchhub.zip
/root/.cache/torch/hub/NVIDIA_DeepLearningExamples_torchhub/PyTorch/Classification/ConvNets/image_classification/models/common.py:13: UserWarning: pytorch_quantization module not found, quantization will not be available
  warnings.warn(
/root/.cache/torch/hub/NVIDIA_DeepLearningExamples_torchhub/PyTorch/Classification/ConvNets/image_classification/models/efficientnet.py:17: UserWarning: pytorch_quantization module not found, quantization will not be available
  warnings.warn(
/pytorch/audio/ci_env/lib/python3.11/site-packages/torch/nn/utils/weight_norm.py:144: FutureWarning: `torch.nn.utils.weight_norm` is deprecated in favor of `torch.nn.utils.parametrizations.weight_norm`.
  WeightNorm.apply(module, name, dim)
Downloading: "https://api.ngc.nvidia.com/v2/models/nvidia/waveglowpyt_fp32/versions/1/files/nvidia_waveglowpyt_fp32_20190306.pth" to /root/.cache/torch/hub/checkpoints/nvidia_waveglowpyt_fp32_20190306.pth

plot(waveforms, spec, 22050)

脚本总运行时间： ( 1 分钟 16.512 秒)

由 Sphinx-Gallery 生成的画廊