评价此页

★ ★ ★ ★ ★

unstable/backend_config_tutorial

在 Google Colab 中运行

(原型) PyTorch BackendConfig 教程#

作者: Andrew Or

BackendConfig API 使开发者能够将他们的后端与 PyTorch 量化集成。目前它仅支持 FX 图模式量化，但未来可能会扩展到其他量化模式。在本教程中，我们将演示如何使用此 API 为特定后端自定义量化支持。有关 BackendConfig 的动机和实现细节的更多信息，请参阅此 README。

假设我们是后端开发者，并希望将我们的后端与 PyTorch 的量化 API 集成。我们的后端仅包含两个操作：量化线性（quantized linear）和量化卷积-ReLU（quantized conv-relu）。在本节中，我们将通过使用自定义 BackendConfig 通过 prepare_fx 和 convert_fx 来量化示例模型，介绍如何实现这一点。

import torch
from torch.ao.quantization import (
    default_weight_observer,
    get_default_qconfig_mapping,
    MinMaxObserver,
    QConfig,
    QConfigMapping,
)
from torch.ao.quantization.backend_config import (
    BackendConfig,
    BackendPatternConfig,
    DTypeConfig,
    DTypeWithConstraints,
    ObservationType,
)
from torch.ao.quantization.quantize_fx import prepare_fx, convert_fx

1. 为每个量化算子推导参考模式#

对于量化线性，假设我们的后端期望参考模式为 [反量化 - fp32_linear - 量化]，并将其降低为一个单独的量化线性算子。实现这一点的方法是首先在浮点线性算子之前和之后插入量化-反量化算子，从而生成以下参考模型：

quant1 - [dequant1 - fp32_linear - quant2] - dequant2

同样，对于量化卷积-ReLU，我们希望生成以下参考模型，其中方括号中的参考模式将被降低为一个单独的量化卷积-ReLU 算子：

quant1 - [dequant1 - fp32_conv_relu - quant2] - dequant2

2. 设置带有后端约束的 DTypeConfigs#

在上面的参考模式中，DTypeConfig 中指定的输入 dtype 将作为 dtype 参数传递给 quant1，输出 dtype 将作为 dtype 参数传递给 quant2。如果输出 dtype 为 fp32，如动态量化的情况，则不会插入输出的量化-反量化对。此示例还展示了如何为特定 dtype 指定量化和比例范围的限制。

quint8_with_constraints = DTypeWithConstraints(
    dtype=torch.quint8,
    quant_min_lower_bound=0,
    quant_max_upper_bound=255,
    scale_min_lower_bound=2 ** -12,
)

# Specify the dtypes passed to the quantized ops in the reference model spec
weighted_int8_dtype_config = DTypeConfig(
    input_dtype=quint8_with_constraints,
    output_dtype=quint8_with_constraints,
    weight_dtype=torch.qint8,
    bias_dtype=torch.float)

3. 为 conv-relu 设置融合#

请注意，原始用户模型包含单独的 conv 和 relu 算子，因此我们需要先将 conv 和 relu 算子融合为一个单独的 conv-relu 算子 (fp32_conv_relu)，然后像处理线性算子一样对该算子进行量化。我们可以通过定义一个接受 3 个参数的函数来设置融合，其中第一个参数是是否为 QAT，其余参数引用融合模式的各个项。

def fuse_conv2d_relu(is_qat, conv, relu):
    """Return a fused ConvReLU2d from individual conv and relu modules."""
    return torch.ao.nn.intrinsic.ConvReLU2d(conv, relu)

4. 定义 BackendConfig#

现在我们已经拥有了所有必需的组件，然后我们开始定义我们的 BackendConfig。在这里，我们为线性算子的输入和输出使用不同的观察器（稍后会重命名），因此传递给两个量化算子 (quant1 和 quant2) 的量化参数将不同。这通常是线性（linear）和卷积（conv）等加权算子的情况。

对于 conv-relu 算子，观察类型是相同的。但是，我们需要两个 BackendPatternConfigs 来支持此算子，一个用于融合，一个用于量化。对于 conv-relu 和 linear，我们使用上面定义的 DTypeConfig。

linear_config = BackendPatternConfig() \
    .set_pattern(torch.nn.Linear) \
    .set_observation_type(ObservationType.OUTPUT_USE_DIFFERENT_OBSERVER_AS_INPUT) \
    .add_dtype_config(weighted_int8_dtype_config) \
    .set_root_module(torch.nn.Linear) \
    .set_qat_module(torch.nn.qat.Linear) \
    .set_reference_quantized_module(torch.ao.nn.quantized.reference.Linear)

# For fusing Conv2d + ReLU into ConvReLU2d
# No need to set observation type and dtype config here, since we are not
# inserting quant-dequant ops in this step yet
conv_relu_config = BackendPatternConfig() \
    .set_pattern((torch.nn.Conv2d, torch.nn.ReLU)) \
    .set_fused_module(torch.ao.nn.intrinsic.ConvReLU2d) \
    .set_fuser_method(fuse_conv2d_relu)

# For quantizing ConvReLU2d
fused_conv_relu_config = BackendPatternConfig() \
    .set_pattern(torch.ao.nn.intrinsic.ConvReLU2d) \
    .set_observation_type(ObservationType.OUTPUT_USE_DIFFERENT_OBSERVER_AS_INPUT) \
    .add_dtype_config(weighted_int8_dtype_config) \
    .set_root_module(torch.nn.Conv2d) \
    .set_qat_module(torch.ao.nn.intrinsic.qat.ConvReLU2d) \
    .set_reference_quantized_module(torch.ao.nn.quantized.reference.Conv2d)

backend_config = BackendConfig("my_backend") \
    .set_backend_pattern_config(linear_config) \
    .set_backend_pattern_config(conv_relu_config) \
    .set_backend_pattern_config(fused_conv_relu_config)

5. 设置满足后端约束的 QConfigMapping#

为了使用上面定义的算子，用户必须定义一个满足 DTypeConfig 中指定的约束的 QConfig。有关更多详细信息，请参阅 DTypeConfig 的文档。然后，我们将此 QConfig 用于我们希望量化的模式中使用的所有模块。

# Note: Here we use a quant_max of 127, but this could be up to 255 (see `quint8_with_constraints`)
activation_observer = MinMaxObserver.with_args(quant_min=0, quant_max=127, eps=2 ** -12)
qconfig = QConfig(activation=activation_observer, weight=default_weight_observer)

# Note: All individual items of a fused pattern, e.g. Conv2d and ReLU in
# (Conv2d, ReLU), must have the same QConfig
qconfig_mapping = QConfigMapping() \
    .set_object_type(torch.nn.Linear, qconfig) \
    .set_object_type(torch.nn.Conv2d, qconfig) \
    .set_object_type(torch.nn.BatchNorm2d, qconfig) \
    .set_object_type(torch.nn.ReLU, qconfig)

6. 通过 prepare 和 convert 量化模型#

最后，我们通过将定义的 BackendConfig 传递给 prepare 和 convert 来量化模型。这将生成一个量化的线性模块和一个融合的量化 conv-relu 模块。

class MyModel(torch.nn.Module):
    def __init__(self, use_bn: bool):
        super().__init__()
        self.linear = torch.nn.Linear(10, 3)
        self.conv = torch.nn.Conv2d(3, 3, 3)
        self.bn = torch.nn.BatchNorm2d(3)
        self.relu = torch.nn.ReLU()
        self.sigmoid = torch.nn.Sigmoid()
        self.use_bn = use_bn

    def forward(self, x):
        x = self.linear(x)
        x = self.conv(x)
        if self.use_bn:
            x = self.bn(x)
        x = self.relu(x)
        x = self.sigmoid(x)
        return x

example_inputs = (torch.rand(1, 3, 10, 10, dtype=torch.float),)
model = MyModel(use_bn=False)
prepared = prepare_fx(model, qconfig_mapping, example_inputs, backend_config=backend_config)
prepared(*example_inputs)  # calibrate
converted = convert_fx(prepared, backend_config=backend_config)

>>> print(converted)

GraphModule(
  (linear): QuantizedLinear(in_features=10, out_features=3, scale=0.012136868201196194, zero_point=67, qscheme=torch.per_tensor_affine)
  (conv): QuantizedConvReLU2d(3, 3, kernel_size=(3, 3), stride=(1, 1), scale=0.0029353597201406956, zero_point=0)
  (sigmoid): Sigmoid()
)

def forward(self, x):
    linear_input_scale_0 = self.linear_input_scale_0
    linear_input_zero_point_0 = self.linear_input_zero_point_0
    quantize_per_tensor = torch.quantize_per_tensor(x, linear_input_scale_0, linear_input_zero_point_0, torch.quint8);  x = linear_input_scale_0 = linear_input_zero_point_0 = None
    linear = self.linear(quantize_per_tensor);  quantize_per_tensor = None
    conv = self.conv(linear);  linear = None
    dequantize_2 = conv.dequantize();  conv = None
    sigmoid = self.sigmoid(dequantize_2);  dequantize_2 = None
    return sigmoid

(7. 尝试错误的 BackendConfig 设置)#

作为一项实验，我们在这里将模型修改为使用 conv-bn-relu 而不是 conv-relu，但使用相同的 BackendConfig，该 BackendConfig 不知道如何量化 conv-bn-relu。结果，只有 linear 被量化，但 conv-bn-relu 既没有被融合也没有被量化。

>>> print(converted)

GraphModule(
  (linear): QuantizedLinear(in_features=10, out_features=3, scale=0.015307803638279438, zero_point=95, qscheme=torch.per_tensor_affine)
  (conv): Conv2d(3, 3, kernel_size=(3, 3), stride=(1, 1))
  (bn): BatchNorm2d(3, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (relu): ReLU()
  (sigmoid): Sigmoid()
)

def forward(self, x):
    linear_input_scale_0 = self.linear_input_scale_0
    linear_input_zero_point_0 = self.linear_input_zero_point_0
    quantize_per_tensor = torch.quantize_per_tensor(x, linear_input_scale_0, linear_input_zero_point_0, torch.quint8);  x = linear_input_scale_0 = linear_input_zero_point_0 = None
    linear = self.linear(quantize_per_tensor);  quantize_per_tensor = None
    dequantize_1 = linear.dequantize();  linear = None
    conv = self.conv(dequantize_1);  dequantize_1 = None
    bn = self.bn(conv);  conv = None
    relu = self.relu(bn);  bn = None
    sigmoid = self.sigmoid(relu);  relu = None
    return sigmoid

作为另一项实验，我们在这里使用不满足后端指定的 dtype 约束的默认 QConfigMapping。结果，什么也没有被量化，因为 QConfigs 被简单地忽略了。

>>> print(converted)

GraphModule(
  (linear): Linear(in_features=10, out_features=3, bias=True)
  (conv): Conv2d(3, 3, kernel_size=(3, 3), stride=(1, 1))
  (bn): BatchNorm2d(3, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (relu): ReLU()
  (sigmoid): Sigmoid()
)

def forward(self, x):
    linear = self.linear(x);  x = None
    conv = self.conv(linear);  linear = None
    bn = self.bn(conv);  conv = None
    relu = self.relu(bn);  bn = None
    sigmoid = self.sigmoid(relu);  relu = None
    return sigmoid

内置 BackendConfigs#

PyTorch 量化在 torch.ao.quantization.backend_config 命名空间下支持一些内置的原生 BackendConfigs。

get_fbgemm_backend_config：用于服务器目标设置。
get_qnnpack_backend_config：用于移动和边缘设备目标设置，也支持 XNNPACK 量化算子。
get_native_backend_config（默认）：一个 BackendConfig，支持 FBGEMM 和 QNNPACK BackendConfigs 中支持的算子模式的并集。

还有其他 BackendConfigs 正在开发中（例如，用于 TensorRT 和 x86），但目前这些大多仍处于实验阶段。如果用户希望将新的自定义后端与 PyTorch 的量化 API 集成，他们可以使用与定义原生支持的后端相同的 API 来定义自己的 BackendConfigs，就像上面的示例一样。

进一步阅读#

BackendConfig 在 FX 图模式量化中的使用方式：pytorch/pytorch

BackendConfig 的动机和实现细节：pytorch/pytorch

BackendConfig 的早期设计：pytorch/rfcs