部署 Torch-TensorRT 程序¶

在编译和保存 Torch-TensorRT 程序后，对完整的 Torch-TensorRT 库不再有严格的依赖。运行编译后的程序只需要运行时库。因此，除了将完整的 Torch-TensorRT 编译器与您的应用程序一起打包外，还有几种部署程序的方法。

Torch-TensorRT 包 / libtorchtrt.so¶

程序编译后，您可以使用标准的 PyTorch API 运行它。只需要在 Python 中导入该包或在 C++ 中链接它。

运行时库¶

与 C++ 发行版一起分发的是 libtorchtrt_runtime.so。这个库只包含运行 Torch-TensorRT 程序所需的组件。而不是链接 libtorchtrt.so 或导入 torch_tensorrt，您可以在部署程序中链接 libtorchtrt_runtime.so，或者使用 DL_OPEN 或 LD_PRELOAD。对于 Python，您可以使用 torch.ops.load_library("libtorchtrt_runtime.so") 来加载运行时库。然后，您就可以像平常一样通过 PyTorch API 继续使用程序了。

注意

如果您链接 libtorchtrt_runtime.so，很可能会使用以下标志会有所帮助 -Wl,--no-as-needed -ltorchtrt -Wl,--as-needed，因为对于大多数 Torch-TensorRT 运行时应用程序，它没有直接的符号依赖于 Torch-TensorRT 运行时中的任何内容。

一个如何使用 libtorchtrt_runtime.so 的示例可以在这里找到： https://github.com/pytorch/TensorRT/tree/master/examples/torchtrt_aoti_example

插件库¶

如果您将 Torch-TensorRT 用作 TensorRT 引擎的转换器，并且您的引擎使用了 Torch-TensorRT 提供的插件，那么 Torch-TensorRT 会提供库 libtorchtrt_plugins.so，其中包含 Torch-TensorRT 在编译过程中使用的 TensorRT 插件的实现。这个库可以像其他 TensorRT 插件库一样，通过 DL_OPEN 或 LD_PRELOAD 加载。

多设备安全模式¶

多设备安全模式是 Torch-TensorRT 中的一个设置，允许用户确定运行时是否在每次推理调用之前检查设备一致性。

启用多设备安全模式后，每次推理调用都有一个不可忽略的固定成本，这就是为什么它现在默认禁用的原因。可以通过以下方便的函数进行控制，该函数兼具上下文管理器的功能。

# Enables Multi Device Safe Mode
torch_tensorrt.runtime.set_multi_device_safe_mode(True)

# Disables Multi Device Safe Mode [Default Behavior]
torch_tensorrt.runtime.set_multi_device_safe_mode(False)

# Enables Multi Device Safe Mode, then resets the safe mode to its prior setting
with torch_tensorrt.runtime.set_multi_device_safe_mode(True):
    ...

TensorRT 要求每个引擎都与调用它的活动线程中的 CUDA 上下文相关联。因此，如果活动线程中的设备发生更改（例如，在同一 Python 进程中从多个 GPU 调用引擎时），安全模式将导致 Torch-TensorRT 显示警告并相应地切换 GPU。如果未启用安全模式，则引擎设备和 CUDA 上下文设备之间可能存在不匹配，这可能导致程序崩溃。

一种在不牺牲多设备安全模式性能的情况下管理不同 GPU 上多个 TRT 引擎的技术是使用 Python 线程。每个线程负责单个 GPU 上的所有 TRT 引擎，并且每个线程的默认 CUDA 设备对应于它负责的 GPU（可以通过 torch.cuda.set_device(...) 设置）。这样，可以在同一个 Python 脚本中使用多个线程，而无需切换 CUDA 上下文并产生性能开销。

CUDA Graphs 模式¶

CUDA Graphs 模式是 Torch-TensorRT 中的一个设置，允许用户确定运行时是否在某些情况下使用 CUDA Graphs 来加速推理。

CUDA Graphs 可以通过减少内核开销来加速某些模型，更详细的文档请参见 [此处](https://pytorch.ac.cn/blog/accelerating-pytorch-with-cuda-graphs/)。

# Enables Cudagraphs Mode
torch_tensorrt.runtime.set_cudagraphs_mode(True)

# Disables Cudagraphs Mode [Default Behavior]
torch_tensorrt.runtime.set_cudagraphs_mode(False)

# Enables Cudagraphs Mode, then resets the mode to its prior setting
with torch_tensorrt.runtime.enable_cudagraphs(trt_module):
    ...

在当前实现中，使用新的输入形状（例如在动态形状情况下）将导致 CUDA Graphs 被重新录制。CUDA Graphs 录制通常不会带来延迟，未来的改进包括缓存多个输入形状的 CUDA Graphs。

动态输出分配模式¶

动态输出分配是 Torch-TensorRT 中的一项功能，它允许动态分配 TensorRT 引擎的输出缓冲区。这对于具有动态输出形状的模型很有用，特别是对于数据依赖形状的操作。动态输出分配模式不能与 CUDA Graphs 或预分配输出功能一起使用。没有动态输出分配，输出缓冲区是根据输入大小推断出的输出形状进行分配的。

有两种启用动态输出分配的场景：

1. 在编译时已确定模型需要为至少一个 TensorRT 子图进行动态输出分配。这些模型将自动启用运行时模式（并进行日志记录），并且与其他运行时模式（如 CUDA Graphs）不兼容。

转换器可以通过 requires_output_allocator=True 声明它们生成的子图将需要输出分配器，从而强制任何使用该转换器的模型自动使用输出分配器运行时模式。例如：

@dynamo_tensorrt_converter(
    torch.ops.aten.nonzero.default,
    supports_dynamic_shapes=True,
    requires_output_allocator=True,
)
def aten_ops_nonzero(
    ctx: ConversionContext,
    target: Target,
    args: Tuple[Argument, ...],
    kwargs: Dict[str, Argument],
    name: str,
) -> Union[TRTTensor, Sequence[TRTTensor]]:
    ...

用户可以通过 torch_tensorrt.runtime.enable_output_allocator 上下文管理器手动启用动态输出分配模式。

# Enables Dynamic Output Allocation Mode, then resets the mode to its prior setting
with torch_tensorrt.runtime.enable_output_allocator(trt_module):
    ...

在没有 Python 的情况下部署 Torch-TensorRT 程序¶

AOT-Inductor¶

AOTInductor 是 TorchInductor 的一个专用版本，旨在处理导出的 PyTorch 模型，对其进行优化，并生成共享库以及其他相关工件。这些编译后的工件专门用于在非 Python 环境中部署，这些环境通常用于服务器端的推理部署。

Torch-TensorRT 能够像在 Python 中一样加速 AOTInductor 导出中的子图。

dynamo_model = torch_tensorrt.compile(model, ir="dynamo", arg_inputs=[...])
torch_tensorrt.save(
    dynamo_model,
    file_path=os.path.join(os.getcwd(), "model.pt2"),
    output_format="aot_inductor",
    retrace=True,
    arg_inputs=[...],
)

然后，该工件可以在 C++ 应用程序中加载，而无需 Python 依赖即可执行。

#include <iostream>
#include <vector>

#include "torch/torch.h"
#include "torch/csrc/inductor/aoti_package/model_package_loader.h"

int main(int argc, const char* argv[]) {
// Check for correct number of command-line arguments
std::string trt_aoti_module_path = "model.pt2";

if (argc == 2) {
    trt_aoti_module_path = argv[1];
}

    std::cout << trt_aoti_module_path << std::endl;

    // Get the path to the TRT AOTI model package from the command line
    c10::InferenceMode mode;

    torch::inductor::AOTIModelPackageLoader loader(trt_aoti_module_path);
    // Assume running on CUDA
    std::vector<torch::Tensor> inputs = {torch::randn({8, 10}, at::kCUDA)};
    std::vector<torch::Tensor> outputs = loader.run(inputs);
    std::cout << "Result from the first inference:"<< std::endl;
    std::cout << outputs << std::endl;

    // The second inference uses a different batch size and it works because we
    // specified that dimension as dynamic when compiling model.pt2.
    std::cout << "Result from the second inference:"<< std::endl;
    // Assume running on CUDA
    std::cout << loader.run({torch::randn({1, 10}, at::kCUDA)}) << std::endl;

    return 0;
}

注意：与 Python 类似，在运行时，不使用 Torch-TensorRT API 来操作模型。因此，通常需要额外的标志来确保 libtorchtrt_runtime.so 被优化掉（请参见上文）。

请参阅：//examples/torchtrt_aoti_example 以获取此工作流程的完整端到端演示。

TorchScript¶

TorchScript 是 PyTorch 的一个旧的编译器堆栈，其中包括 TorchScript 程序的无 Python 解释器。Torch-TensorRT 历来使用它来在没有 Python 的情况下执行模型。即使在迁移到 TorchDynamo 之后，TorchScript 解释器仍可用于在 Python 外部运行带有 TensorRT 引擎的 PyTorch 模型。

dynamo_model = torch_tensorrt.compile(model, ir="dynamo", arg_inputs=[...])
ts_model = torch.jit.trace(dynamo_model, inputs=[...])
torch.jit.save(ts_model, os.path.join(os.getcwd(), "model.ts"),)