注意

转到末尾下载完整的示例代码

使用 Torch-TensorRT 的 `torch.compile` 前端编译 GPT2¶

本示例演示了如何使用 Torch-TensorRT 的 torch.compile 前端优化先进的 GPT2 模型。在编译之前，请安装以下依赖项。

pip install -r requirements.txt

GPT2 是一种因果（单向）Transformer 模型，使用海量文本数据进行语言建模预训练。在本示例中，我们使用 HuggingFace 上提供的 GPT2 模型，并对其应用 torch.compile 来获取图模块表示。Torch-TensorRT 将此图转换为优化的 TensorRT 引擎。

导入必要的库¶

import torch
import torch_tensorrt
from transformers import AutoModelForCausalLM, AutoTokenizer

定义必要的参数¶

Torch-TensorRT 需要 GPU 才能成功编译模型。MAX_LENGTH 是生成 token 的最大长度。这对应于输入 prompt 的长度加上生成的 token 数量。

MAX_LENGTH = 32
DEVICE = torch.device("cuda:0")

模型定义¶

我们使用 AutoModelForCausalLM 类从 Hugging Face 加载预训练的 GPT2 模型。Torch-TRT 目前不支持 kv_cache，因此 use_cache=False。

with torch.no_grad():
    tokenizer = AutoTokenizer.from_pretrained("gpt2")
    model = (
        AutoModelForCausalLM.from_pretrained(
            "gpt2",
            pad_token_id=tokenizer.eos_token_id,
            use_cache=False,
            attn_implementation="eager",
        )
        .to(DEVICE)
        .eval()
    )

PyTorch 推理¶

对示例输入 prompt 进行 token 化并获取 PyTorch 模型输出。

prompt = "I enjoy walking with my cute dog"
model_inputs = tokenizer(prompt, return_tensors="pt")
input_ids = model_inputs["input_ids"].to(DEVICE)

AutoModelForCausalLM 类的 generate() API 用于通过贪婪解码进行自回归生成。

with torch.no_grad():
    pyt_gen_tokens = model.generate(
        input_ids,
        max_length=MAX_LENGTH,
        use_cache=False,
        pad_token_id=tokenizer.eos_token_id,
    )

Torch-TensorRT 编译和推理¶

输入序列长度是动态的，因此我们使用 torch._dynamo.mark_dynamic API 进行标记。我们提供此值的（最小值，最大值）范围，以便 TensorRT 提前知道要优化哪些值。通常，这将是模型的上下文长度。由于 0/1 特化，我们从 min=2 开始。

torch._dynamo.mark_dynamic(input_ids, 1, min=2, max=1023)
model.forward = torch.compile(
    model.forward,
    backend="tensorrt",
    dynamic=None,
    options={
        "enabled_precisions": {torch.float32},
        "disable_tf32": True,
        "min_block_size": 1,
    },
)

使用 TensorRT 模型进行贪婪解码的自回归生成循环。第一个 token 的生成会使用 TensorRT 编译模型，而第二个 token 会遇到重新编译（这是当前的一个问题，将在未来解决）。

with torch.no_grad():
    trt_gen_tokens = model.generate(
        inputs=input_ids,
        max_length=MAX_LENGTH,
        use_cache=False,
        pad_token_id=tokenizer.eos_token_id,
    )

解码 PyTorch 和 TensorRT 的输出句子¶

print(
    "Pytorch model generated text: ",
    tokenizer.decode(pyt_gen_tokens[0], skip_special_tokens=True),
)
print("=============================")
print(
    "TensorRT model generated text: ",
    tokenizer.decode(trt_gen_tokens[0], skip_special_tokens=True),
)

输出句子应如下所示：

"""
Pytorch model generated text:  I enjoy walking with my cute dog, but I'm not sure if I'll ever be able to walk with my dog. I'm not sure if I'll
=============================
TensorRT model generated text:  I enjoy walking with my cute dog, but I'm not sure if I'll ever be able to walk with my dog. I'm not sure if I'll
"""

脚本总运行时间： ( 0 分 0.000 秒)

由 Sphinx-Gallery 生成的画廊

使用 Torch-TensorRT 的 `torch.compile` 前端编译 GPT2¶

导入必要的库¶

定义必要的参数¶

模型定义¶

PyTorch 推理¶

Torch-TensorRT 编译和推理¶

解码 PyTorch 和 TensorRT 的输出句子¶

文档

教程

资源

使用 Torch-TensorRT 的 torch.compile 前端编译 GPT2¶

导入必要的库¶

定义必要的参数¶

模型定义¶

PyTorch 推理¶

Torch-TensorRT 编译和推理¶

解码 PyTorch 和 TensorRT 的输出句子¶

文档

教程

资源

使用 Torch-TensorRT 的 `torch.compile` 前端编译 GPT2¶