使用 Torch-TensorRT 编译混合精度模型¶

显式类型转换¶

考虑以下 PyTorch 模型，该模型显式将中间层转换为 FP16 运行。

class MyModule(torch.nn.Module):
    def __init__(self):
        super().__init__()
        self.linear1 = torch.nn.Linear(10,10)
        self.linear2 = torch.nn.Linear(10,30).half()
        self.linear3 = torch.nn.Linear(30,40)

    def forward(self, x):
        x = self.linear1(x)
        x = x.to(torch.float16)
        x = self.linear2(x)
        x = x.to(torch.float32)
        x = self.linear3(x)
        return x

如果我们使用 Torch-TensorRT 编译上述模型，层性能分析日志表明所有层都以 FP32 运行。这是因为 TensorRT 会选择能带来最佳性能的层对应的内核。

inputs = [torch.randn((1, 10), dtype=torch.float32).cuda()]
mod = MyModule().eval().cuda()
ep = torch.export.export(mod, tuple(inputs))
with torch_tensorrt.logging.debug():
    trt_gm = torch_tensorrt.dynamo.compile(ep,
                                        inputs=inputs,
                                        debug=True)

# Debug log info
# Layers:
# Name: __myl_MulSum_myl0_0, LayerType: kgen, Inputs: [ { Name: __mye116_dconst, Dimensions: [10,10], Format/Datatype: Float }, { Name: x, Dimensions: [10,1], Format/Datatype: Float }], Outputs: [ { Name: __myln_k_arg__bb1_2, Dimensions: [1,10], Format/Datatype: Float }], TacticName: __myl_MulSum_0xfa6c1858aea1b13b03f90165d7149ec6, StreamId: 0, Metadata:
# Name: __myl_AddResMulSum_myl0_1, LayerType: kgen, Inputs: [ { Name: __mye131_dconst, Dimensions: [10,30], Format/Datatype: Float }, { Name: __myln_k_arg__bb1_2, Dimensions: [1,10], Format/Datatype: Float }, { Name: linear1/addmm_constant_0 _ linear1/addmm_add_broadcast_to_same_shape_lhs_broadcast_constantFloat, Dimensions: [1,10], Format/Datatype: Float }], Outputs: [ { Name: __myln_k_arg__bb1_3, Dimensions: [1,30], Format/Datatype: Float }], TacticName: __myl_AddResMulSum_0xb3915d7ebfe48be45b6d49083479e12f, StreamId: 0, Metadata:
# Name: __myl_AddResMulSumAdd_myl0_2, LayerType: kgen, Inputs: [ { Name: __mye146_dconst, Dimensions: [30,40], Format/Datatype: Float }, { Name: linear3/addmm_2_constant_0 _ linear3/addmm_2_add_broadcast_to_same_shape_lhs_broadcast_constantFloat, Dimensions: [1,40], Format/Datatype: Float }, { Name: __myln_k_arg__bb1_3, Dimensions: [1,30], Format/Datatype: Float }, { Name: linear2/addmm_1_constant_0 _ linear2/addmm_1_add_broadcast_to_same_shape_lhs_broadcast_constantFloat, Dimensions: [1,30], Format/Datatype: Float }], Outputs: [ { Name: output0, Dimensions: [1,40], Format/Datatype: Float }], TacticName: __myl_AddResMulSumAdd_0xcdd0085ad25f5f45ac5fafb72acbffd6, StreamId: 0, Metadata:

为了遵守用户在模型中指定的类型（例如，在本例中，`linear2` 层以 FP16 运行），用户可以启用编译设置 `use_explicit_typing=True`。使用此选项进行编译将产生以下 TensorRT 日志

注意

如果启用 `use_explicit_typing=True`，则 `enabled_precisions` 中仅支持 torch.float32。

inputs = [torch.randn((1, 10), dtype=torch.float32).cuda()]
mod = MyModule().eval().cuda()
ep = torch.export.export(mod, tuple(inputs))
with torch_tensorrt.logging.debug():
    trt_gm = torch_tensorrt.dynamo.compile(ep,
                                        inputs=inputs,
                                        use_explicit_typing=True,
                                        debug=True)

# Debug log info
# Layers:
# Name: __myl_MulSumAddCas_myl0_0, LayerType: kgen, Inputs: [ { Name: linear1/addmm_constant_0 _ linear1/addmm_add_broadcast_to_same_shape_lhs_broadcast_constantFloat, Dimensions: [1,10], Format/Datatype: Float }, { Name: __mye112_dconst, Dimensions: [10,10], Format/Datatype: Float }, { Name: x, Dimensions: [10,1], Format/Datatype: Float }], Outputs: [ { Name: __myln_k_arg__bb1_2, Dimensions: [1,10], Format/Datatype: Half }], TacticName: __myl_MulSumAddCas_0xacf8f5dd9be2f3e7bb09cdddeac6c936, StreamId: 0, Metadata:
# Name: __myl_ResMulSumAddCas_myl0_1, LayerType: kgen, Inputs: [ { Name: __mye127_dconst, Dimensions: [10,30], Format/Datatype: Half }, { Name: linear2/addmm_1_constant_0 _ linear2/addmm_1_add_broadcast_to_same_shape_lhs_broadcast_constantHalf, Dimensions: [1,30], Format/Datatype: Half }, { Name: __myln_k_arg__bb1_2, Dimensions: [1,10], Format/Datatype: Half }], Outputs: [ { Name: __myln_k_arg__bb1_3, Dimensions: [1,30], Format/Datatype: Float }], TacticName: __myl_ResMulSumAddCas_0x5a3b318b5a1c97b7d5110c0291481337, StreamId: 0, Metadata:
# Name: __myl_ResMulSumAdd_myl0_2, LayerType: kgen, Inputs: [ { Name: __mye142_dconst, Dimensions: [30,40], Format/Datatype: Float }, { Name: linear3/addmm_2_constant_0 _ linear3/addmm_2_add_broadcast_to_same_shape_lhs_broadcast_constantFloat, Dimensions: [1,40], Format/Datatype: Float }, { Name: __myln_k_arg__bb1_3, Dimensions: [1,30], Format/Datatype: Float }], Outputs: [ { Name: output0, Dimensions: [1,40], Format/Datatype: Float }], TacticName: __myl_ResMulSumAdd_0x3fad91127c640fd6db771aa9cde67db0, StreamId: 0, Metadata:

现在，`linear2` 层将以 FP16 运行，如上面的日志所示。

FP32 累加¶

当设置为 `use_fp32_acc=True` 时，Torch-TensorRT 将尝试对 matmul 层使用 FP32 累加，即使输入和输出张量是 FP16。这对于对低精度累加引入的数值误差敏感的模型特别有用。

重要提示

启用 `use_fp32_acc=True` 时，**必须通过设置 `use_explicit_typing=True` 来启用显式类型转换**。如果没有 `use_explicit_typing=True`，累加类型可能无法正确遵守，您可能看不到预期的数值优势。

inputs = [torch.randn((1, 10), dtype=torch.float16).cuda()]
mod = MyModule().eval().cuda()
ep = torch.export.export(mod, tuple(inputs))
with torch_tensorrt.logging.debug():
    trt_gm = torch_tensorrt.dynamo.compile(
        ep,
        inputs=inputs,
        use_fp32_acc=True,
        use_explicit_typing=True,  # Explicit typing must be enabled
        debug=True
    )

# Debug log info
# Layers:
# Name: __myl_MulSumAddCas_myl0_0, LayerType: kgen, Inputs: [ ... ], Outputs: [ ... ], Format/Datatype: Half, Accumulation: Float
# ...

有关这些设置的更多信息，请参阅上面的显式类型转换示例。

使用 Torch-TensorRT 编译混合精度模型¶

显式类型转换¶

FP32 累加¶

文档

教程

资源