PyTorch 2 导出量化与 X86 后端通过 Inductor¶

作者: Leslie Fang, Weiwen Xia, Jiong Gong, Jerry Zhang

先决条件¶

介绍¶

本教程介绍了使用 PyTorch 2 导出量化流程来生成针对 x86 inductor 后端定制的量化模型的步骤，并解释了如何将量化模型降低到 inductor 中。

pytorch 2 导出量化流程使用 torch.export 将模型捕获到图（graph）中，并在 ATen 图上执行量化转换。这种方法预计将具有显著更高的模型覆盖率、更好的可编程性以及简化的用户体验。TorchInductor 是新的编译器后端，它将 TorchDynamo 生成的 FX 图编译成优化的 C++/Triton 内核。

这种量化 2 与 Inductor 的流程支持静态量化和动态量化。静态量化最适用于 CNN 模型，例如 ResNet-50。而动态量化更适合 NLP 模型，例如 RNN 和 BERT。有关这两种量化类型的区别，请参阅以下页面。

量化流程主要包括三个步骤：

步骤 1：基于 torch 导出机制，从即时模式（eager）模型中捕获 FX 图。
步骤 2：基于捕获的 FX 图应用量化流程，包括定义特定后端的量化器、生成带有观察器的准备模型、执行准备模型的校准或量化感知训练，以及将准备模型转换为量化模型。
步骤 3：使用 API torch.compile 将量化模型降低到 inductor 中。

这个流程的高级架构可能如下所示

float_model(Python)                          Example Input
    \                                              /
     \                                            /
—--------------------------------------------------------
|                         export                       |
—--------------------------------------------------------
                            |
                    FX Graph in ATen
                            |            X86InductorQuantizer
                            |                 /
—--------------------------------------------------------
|                      prepare_pt2e                     |
|                           |                           |
|                     Calibrate/Train                   |
|                           |                           |
|                      convert_pt2e                     |
—--------------------------------------------------------
                            |
                     Quantized Model
                            |
—--------------------------------------------------------
|                    Lower into Inductor                |
—--------------------------------------------------------
                            |
                         Inductor

结合 PyTorch 2 导出和 TorchInductor 的量化，我们通过新的量化前端获得了灵活性和生产力，并通过编译器后端获得了出色的开箱即用性能。尤其是在 Intel 第四代 (SPR) Xeon 处理器上，可以通过利用高级矩阵扩展功能进一步提升模型性能。

训练后量化¶

现在，我们将通过一个分步教程，向您展示如何将它与 torchvision resnet18 模型一起用于训练后量化。

1. 捕获 FX 图¶

我们将首先执行必要的导入，从即时模式（eager）模块中捕获 FX 图。

import torch
import torchvision.models as models
import copy
from torchao.quantization.pt2e.quantize_pt2e import prepare_pt2e, convert_pt2e
import torchao.quantization.pt2e.quantizer.x86_inductor_quantizer as xiq
from torchao.quantization.pt2e.quantizer.x86_inductor_quantizer import X86InductorQuantizer
from torch.export import export

# Create the Eager Model
model_name = "resnet18"
model = models.__dict__[model_name](pretrained=True)

# Set the model to eval mode
model = model.eval()

# Create the data, using the dummy data here as an example
traced_bs = 50
x = torch.randn(traced_bs, 3, 224, 224).contiguous(memory_format=torch.channels_last)
example_inputs = (x,)

# Capture the FX Graph to be quantized
with torch.no_grad():
    # Note: requires torch >= 2.6
    exported_model = export(
        model,
        example_inputs
    ).module()

接下来，我们将对 FX Module 进行量化。

2. 应用量化¶

捕获要量化的 FX Module 后，我们将导入 X86 CPU 的后端量化器，并配置如何量化模型。

quantizer = X86InductorQuantizer()
quantizer.set_global(xiq.get_default_x86_inductor_quantization_config())

注意

X86InductorQuantizer 中的默认量化配置对激活和权重均使用 8 位。

当向量神经网络指令不可用时，oneDNN 后端会默默地选择假定乘法为 7 位 x 8 位的内核。换句话说，在没有向量神经网络指令的 CPU 上运行时，可能会出现潜在的数值饱和和准确性问题。

默认情况下，量化配置是针对静态量化的。要应用动态量化，在获取配置时添加参数 is_dynamic=True。

quantizer = X86InductorQuantizer()
quantizer.set_global(xiq.get_default_x86_inductor_quantization_config(is_dynamic=True))

导入特定于后端的 Quantizer 后，我们将准备模型以进行训练后量化。 prepare_pt2e 将 BatchNorm 算子折叠到前面的 Conv2d 算子中，并在模型中的适当位置插入观察器。

prepared_model = prepare_pt2e(exported_model, quantizer)

现在，在观察器被插入到模型后，我们将校准 prepared_model。此步骤仅对静态量化是必需的。

# We use the dummy data as an example here
prepared_model(*example_inputs)

# Alternatively: user can define the dataset to calibrate
# def calibrate(model, data_loader):
#     model.eval()
#     with torch.no_grad():
#         for image, target in data_loader:
#             model(image)
# calibrate(prepared_model, data_loader_test)  # run calibration on sample data

最后，我们将校准后的模型转换为量化模型。convert_pt2e 接受一个校准过的模型并生成一个量化模型。

converted_model = convert_pt2e(prepared_model)

完成这些步骤后，我们就完成了量化流程的运行，并将获得量化模型。

3. 降低到 Inductor¶

获得量化模型后，我们将进一步将其降低到 inductor 后端。默认的 Inductor Wrapper 会生成 Python 代码来调用生成的内核和外部内核。此外，Inductor 支持 C++ Wrapper，它可以生成纯 C++ 代码。这允许无缝集成生成的内核和外部内核，有效减少 Python 开销。未来，利用 C++ Wrapper，我们可以扩展其功能以实现纯 C++ 部署。有关 C++ Wrapper 的更全面细节，请参阅关于Inductor C++ Wrapper 教程的专用教程。

# Optional: using the C++ wrapper instead of default Python wrapper
import torch._inductor.config as config
config.cpp_wrapper = True

with torch.no_grad():
    optimized_model = torch.compile(converted_model)

    # Running some benchmark
    optimized_model(*example_inputs)

在一个更高级的场景中，int8-mixed-bf16 量化发挥了作用。在这种情况下，卷积或 GEMM 算子会产生 BFloat16 输出数据类型，而不是 Float32，前提是没有后续的量化节点。随后，BFloat16 张量会无缝地传播到后续的逐点算子中，从而有效减少内存使用量并可能提高性能。使用此功能与常规 BFloat16 Autocast 的用法类似，只需将脚本包装在 BFloat16 Autocast 上下文中即可。

with torch.autocast(device_type="cpu", dtype=torch.bfloat16, enabled=True), torch.no_grad():
    # Turn on Autocast to use int8-mixed-bf16 quantization. After lowering into Inductor CPP Backend,
    # For operators such as QConvolution and QLinear:
    # * The input data type is consistently defined as int8, attributable to the presence of a pair
        of quantization and dequantization nodes inserted at the input.
    # * The computation precision remains at int8.
    # * The output data type may vary, being either int8 or BFloat16, contingent on the presence
    #   of a pair of quantization and dequantization nodes at the output.
    # For non-quantizable pointwise operators, the data type will be inherited from the previous node,
    # potentially resulting in a data type of BFloat16 in this scenario.
    # For quantizable pointwise operators such as QMaxpool2D, it continues to operate with the int8
    # data type for both input and output.
    optimized_model = torch.compile(converted_model)

    # Running some benchmark
    optimized_model(*example_inputs)

将所有这些代码放在一起，我们就会得到一个玩具示例代码。请注意，由于 Inductor 的 freeze 功能默认尚未开启，请使用 TORCHINDUCTOR_FREEZING=1 运行您的示例代码。

例如

TORCHINDUCTOR_FREEZING=1 python example_x86inductorquantizer_pytorch_2_1.py

通过 PyTorch 2.1 版本，TorchBench 测试套件中的所有 CNN 模型都经过了测量，并被证明与 Inductor FP32 推理路径相比有效。有关详细的基准测试数据，请参阅本文档。

量化感知训练¶

PyTorch 2 导出量化感知训练 (QAT) 现在通过 X86InductorQuantizer 在 X86 CPU 上得到支持，之后将量化模型降低到 Inductor 中。要更深入地了解 PT2 导出量化感知训练，我们建议参考专门的PyTorch 2 导出量化感知训练。

PyTorch 2 导出 QAT 流程与 PTQ 流程大体相似。

import torch
from torchao.quantization.pt2e.quantize_pt2e import (
  prepare_qat_pt2e,
  convert_pt2e,
)
import torchao.quantization.pt2e.quantizer.x86_inductor_quantizer as xiq
from torchao.quantization.pt2e.quantizer.x86_inductor_quantizer import X86InductorQuantizer

class M(torch.nn.Module):
   def __init__(self):
      super().__init__()
      self.linear = torch.nn.Linear(1024, 1000)

   def forward(self, x):
      return self.linear(x)

example_inputs = (torch.randn(1, 1024),)
m = M()

# Step 1. program capture
exported_model = torch.export.export(m, example_inputs).module()
# we get a model with aten ops

# Step 2. quantization-aware training
# Use Backend Quantizer for X86 CPU
# To apply dynamic quantization, add an argument ``is_dynamic=True`` when getting the config.
quantizer = X86InductorQuantizer()
quantizer.set_global(xiq.get_default_x86_inductor_quantization_config(is_qat=True))
prepared_model = prepare_qat_pt2e(exported_model, quantizer)

# train omitted

converted_model = convert_pt2e(prepared_model)
# we have a model with aten ops doing integer computations when possible

# move the quantized model to eval mode, equivalent to `m.eval()`
torchao.quantization.pt2e.move_exported_model_to_eval(converted_model)

# Lower the model into Inductor
with torch.no_grad():
  optimized_model = torch.compile(converted_model)
  _ = optimized_model(*example_inputs)

请注意，Inductor 的 freeze 功能默认未启用。要使用此功能，您需要使用 TORCHINDUCTOR_FREEZING=1 运行示例代码。

例如

TORCHINDUCTOR_FREEZING=1 python example_x86inductorquantizer_qat.py

结论¶

通过本教程，我们介绍了如何在 PyTorch 2 量化中使用 Inductor 与 X86 CPU。用户可以了解如何使用 X86InductorQuantizer 来量化模型并将其降低到使用 X86 CPU 设备的 inductor 中。