XNNPACK 后端¶

XNNPACK 委托是 ExecuTorch 解决方案，用于在移动 CPU 上进行 CPU 执行。XNNPACK 是一个提供针对 Arm 和 x86 CPU 上的机器学习算子进行了优化的内核的库。

特性¶

广泛的算子支持，可在 Arm 和 x86 CPU 上运行，适用于任何现代智能手机。
支持多种量化方案和量化算子。
支持 fp32 和 fp16 激活。
支持 8 位量化。

目标要求¶

ARM64 (Android, iOS, macOS, Linux, 和 Windows)。
ARMv7 (带 NEON) (Android)。
ARMv6 (带 VFPv2) (Linux)。
x86 和 x86-64 (最高支持 AVX512) (Windows, Linux, Android)。

开发要求¶

XNNPACK 委托除了核心 ExecuTorch 运行时所需的系统要求外，不会引入任何额外的开发系统要求。

使用 XNNPACK 后端¶

要在导出和降低（lowering）过程中定位 XNNPACK 后端，请将 XnnpackPartitioner 的实例传递给 to_edge_transform_and_lower。下面的示例使用 torchvision 中的 MobileNet V2 模型演示了此过程。

import torch
import torchvision.models as models
from torchvision.models.mobilenetv2 import MobileNet_V2_Weights
from executorch.backends.xnnpack.partition.xnnpack_partitioner import XnnpackPartitioner
from executorch.exir import to_edge_transform_and_lower

mobilenet_v2 = models.mobilenetv2.mobilenet_v2(weights=MobileNet_V2_Weights.DEFAULT).eval()
sample_inputs = (torch.randn(1, 3, 224, 224), )

et_program = to_edge_transform_and_lower(
    torch.export.export(mobilenet_v2, sample_inputs),
    partitioner=[XnnpackPartitioner()],
).to_executorch()

with open("mv2_xnnpack.pte", "wb") as file:
    et_program.write_to_file(file)

分区器 API¶

XNNPACK 分区器 API 允许配置模型委托给 XNNPACK。传递一个没有额外参数的 XnnpackPartitioner 实例将尽可能多地在 XNNPACK 后端上运行模型。这是最常见的用例。对于高级用例，分区器通过构造函数公开以下选项：

configs：控制哪些算子被委托给 XNNPACK。默认情况下，所有可用的算子都会被委托。有关可用算子配置的完整列表，请参阅 ../config/__init__.py。
config_precisions：按数据类型过滤算子。默认情况下，委托所有精度。可以是一个或多个 ConfigPrecisionType.FP32、ConfigPrecisionType.STATIC_QUANT 或 ConfigPrecisionType.DYNAMIC_QUANT。请参阅 ConfigPrecisionType。
per_op_mode：如果为 true，则为每个算子发出单独的委托调用。这是一个高级选项，旨在在某些情况下以少量运行时开销为代价来减少内存开销。默认为 false。
verbose：如果为 true，则在降低过程中打印额外信息。

测试模型¶

生成 XNNPACK 委托的 .pte 文件后，可以使用 ExecuTorch 运行时 Python 绑定从 Python 中测试模型。这可用于对模型进行健全性检查并评估数值精度。有关更多信息，请参阅测试模型。

量化¶

XNNPACK 委托还可以用作执行对称量化模型的后端。要为 XNNPACK 后端量化 PyTorch 模型，请使用 XNNPACKQuantizer。Quantizers 是后端特定的，这意味着 XNNPACKQuantizer 被配置为量化模型，以利用 XNNPACK 库提供的量化算子。

支持的量化方案¶

XNNPACK 委托支持以下量化方案：

8 位对称权重，8 位非对称激活（通过 PT2E 量化流程）。
- 支持静态和动态激活。
- 支持每通道和每张量方案。
- 支持 linear、convolution、add、mul、cat 和 adaptive avg pool 2d 算子。

XNNPACK 目前不支持仅权重量化。

使用 PT2E 流程进行 8 位量化¶

要使用 PT2E 流程进行 8 位量化，请在导出模型之前执行以下步骤：

创建 XnnpackQuantizer 类的实例。设置量化参数。
使用 torch.export.export_for_training 为量化做准备。
调用 prepare_pt2e 来准备模型进行量化。
对于静态量化，使用代表性样本运行准备好的模型，以校准量化张量的激活范围。
调用 convert_pt2e 来量化模型。
使用标准流程导出并降低模型。

convert_pt2e 的输出是一个 PyTorch 模型，可以使用常规流程进行导出和降低。由于它是一个常规的 PyTorch 模型，也可以使用标准的 PyTorch 技术来评估量化模型的准确性。

import torch
import torchvision.models as models
from torchvision.models.mobilenetv2 import MobileNet_V2_Weights
from executorch.backends.xnnpack.quantizer.xnnpack_quantizer import XNNPACKQuantizer, get_symmetric_quantization_config
from executorch.backends.xnnpack.partition.xnnpack_partitioner import XnnpackPartitioner
from executorch.exir import to_edge_transform_and_lower
from torchao.quantization.pt2e.quantize_pt2e import convert_pt2e, prepare_pt2e

model = models.mobilenetv2.mobilenet_v2(weights=MobileNet_V2_Weights.DEFAULT).eval()
sample_inputs = (torch.randn(1, 3, 224, 224), )

qparams = get_symmetric_quantization_config(is_per_channel=True) # (1)
quantizer = XNNPACKQuantizer()
quantizer.set_global(qparams)

training_ep = torch.export.export_for_training(model, sample_inputs).module() # (2)
prepared_model = prepare_pt2e(training_ep, quantizer) # (3)

for cal_sample in [torch.randn(1, 3, 224, 224)]: # Replace with representative model inputs
	prepared_model(cal_sample) # (4) Calibrate

quantized_model = convert_pt2e(prepared_model) # (5)

et_program = to_edge_transform_and_lower( # (6)
    torch.export.export(quantized_model, sample_inputs),
    partitioner=[XnnpackPartitioner()],
).to_executorch()

有关更多信息，请参阅 PyTorch 2 导出训练后量化。

使用 quantize_ 进行 LLM 量化¶

XNNPACK 后端还支持使用 torchao quantize_ API 对模型进行量化。这通常用于 LLM，需要更高级的量化。由于 quantize_ 不区分后端，因此使用与 CPU/XNNPACK 兼容的配置非常重要。

使用 IntxWeightOnlyConfig 对嵌入进行量化（使用 weight_dtype torch.int2、torch.int4 或 torch.int8，使用 PerGroup 或 PerAxis 粒度）
使用 Int8DynamicActivationIntxWeightConfig 对线性层进行量化（使用 weight_dtype=torch.int4，使用 PerGroup 或 PerAxis 粒度）

下面是一个简单的示例，但更详细的教程（包括在流行的 LLM 基准测试上的准确性评估）可以在 torchao 文档中找到。

from torchao.quantization.granularity import PerGroup, PerAxis
from torchao.quantization.quant_api import (
    IntxWeightOnlyConfig,
    Int8DynamicActivationIntxWeightConfig,
    quantize_,
)

# Quantize embeddings with 8-bits, per channel
embedding_config = IntxWeightOnlyConfig(
    weight_dtype=torch.int8,
    granularity=PerAxis(0),
)
qunatize_(
    eager_model,
    lambda m, fqn: isinstance(m, torch.nn.Embedding),
)


# Quatize linear layers with 8-bit dynamic activations and 4-bit weights
linear_config = Int8DynamicActivationIntxWeightConfig(
    weight_dtype=torch.int4,
    weight_granularity=PerGroup(32),
)
quantize_(eager_model, linear_config)

运行时集成¶

要在设备上运行模型，请使用标准的 ExecuTorch 运行时 API。有关更多信息，请参阅设备上运行。

XNNPACK 委托默认包含在发布的 Android、iOS 和 pip 包中。从源代码构建时，在配置 CMake 构建时传递 -DEXECUTORCH_BUILD_XNNPACK=ON 以编译 XNNPACK 后端。

要链接到该后端，请将 xnnpack_backend CMake 目标添加为构建依赖项，或直接链接到 libxnnpack_backend。由于使用了静态注册，可能需要链接到 whole-archive。这通常可以通过将 "$<LINK_LIBRARY:WHOLE_ARCHIVE,xnnpack_backend>" 传递给 target_link_libraries 来实现。

# CMakeLists.txt
add_subdirectory("executorch")
...
target_link_libraries(
    my_target
    PRIVATE executorch
    extension_module_static
    extension_tensor
    optimized_native_cpu_ops_lib
    xnnpack_backend)

除了链接目标之外，使用该后端无需其他步骤。任何 XNNPACK 委托的 .pte 文件都将在已注册的后端上自动运行。