导出自定义 LLM¶

如果您有自己的 PyTorch 模型，该模型是一个 LLM，本指南将向您展示如何手动导出和降低到 ExecuTorch，其中包含与之前 export_llm 指南中涵盖的许多相同优化。

本示例使用 Karpathy 的 nanoGPT，这是一个 GPT-2 124M 的最小实现。本指南适用于其他语言模型，因为 ExecuTorch 是模型无关的。

导出到 ExecuTorch (基础)¶

导出将 PyTorch 模型转换为可在消费设备上高效运行的格式。

对于此示例，您需要 nanoGPT 模型和相应的 tokenizer 词汇表。

curl

curl https://raw.githubusercontent.com/karpathy/nanoGPT/master/model.py -O
curl https://huggingface.co/openai-community/gpt2/resolve/main/vocab.json -O

wget

wget https://raw.githubusercontent.com/karpathy/nanoGPT/master/model.py
wget https://huggingface.co/openai-community/gpt2/resolve/main/vocab.json

要将模型转换为针对独立执行优化的格式，有两个步骤。首先，使用 PyTorch 的 export 函数将 PyTorch 模型转换为中间的、平台无关的中间表示。然后使用 ExecuTorch 的 to_edge 和 to_executorch 方法为设备上的执行准备模型。这将创建一个 .pte 文件，可以在运行时由桌面或移动应用程序加载。

创建一个名为 export_nanogpt.py 的文件，其中包含以下内容

# export_nanogpt.py

import torch

from executorch.exir import EdgeCompileConfig, to_edge
from torch.nn.attention import sdpa_kernel, SDPBackend
from torch.export import export, export_for_training

from model import GPT

# Load the model.
model = GPT.from_pretrained('gpt2')

# Create example inputs. This is used in the export process to provide
# hints on the expected shape of the model input.
example_inputs = (torch.randint(0, 100, (1, model.config.block_size), dtype=torch.long), )

# Set up dynamic shape configuration. This allows the sizes of the input tensors
# to differ from the sizes of the tensors in `example_inputs` during runtime, as
# long as they adhere to the rules specified in the dynamic shape configuration.
# Here we set the range of 0th model input's 1st dimension as
# [0, model.config.block_size].
# See https://pytorch.ac.cn/executorch/main/concepts#dynamic-shapes
# for details about creating dynamic shapes.
dynamic_shape = (
    {1: torch.export.Dim("token_dim", max=model.config.block_size)},
)

# Trace the model, converting it to a portable intermediate representation.
# The torch.no_grad() call tells PyTorch to exclude training-specific logic.
with torch.nn.attention.sdpa_kernel([SDPBackend.MATH]), torch.no_grad():
    m = export_for_training(model, example_inputs, dynamic_shapes=dynamic_shape).module()
    traced_model = export(m, example_inputs, dynamic_shapes=dynamic_shape)

# Convert the model into a runnable ExecuTorch program.
edge_config = EdgeCompileConfig(_check_ir_validity=False)
edge_manager = to_edge(traced_model,  compile_config=edge_config)
et_program = edge_manager.to_executorch()

# Save the ExecuTorch program to a file.
with open("nanogpt.pte", "wb") as file:
    file.write(et_program.buffer)

要导出，请使用 python export_nanogpt.py (或 python3，取决于您的环境) 运行脚本。它将在当前目录中生成一个 nanogpt.pte 文件。

有关更多信息，请参阅导出到 ExecuTorch 和 torch.export。

后端委派¶

虽然 ExecuTorch 为所有运算符提供了可移植的跨平台实现，但它还为许多不同的目标提供了专门的后端。这些包括但不限于：通过 XNNPACK 后端加速 x86 和 ARM CPU，通过 Core ML 后端和 Metal Performance Shader (MPS) 后端加速 Apple 设备，以及通过 Vulkan 后端加速 GPU。

由于优化特定于给定的后端，每个 pte 文件都特定于导出时定位的后端。要支持多个设备，例如 Android 的 XNNPACK 加速和 iOS 的 Core ML，请为每个后端导出单独的 PTE 文件。

要在导出期间将模型委派给特定后端，ExecuTorch 使用 to_edge_transform_and_lower() 函数。此函数接收来自 torch.export 的导出的程序以及特定于后端的 partitioner 对象。partitioner 识别可以由目标后端优化计算图的某些部分。在 to_edge_transform_and_lower() 中，导出的程序被转换为 edge dialect 程序。然后 partitioner 将兼容的图部分委派给后端进行加速和优化。未委派的任何图部分都由 ExecuTorch 的默认运算符实现执行。

要将导出的模型委派给特定后端，我们需要首先从 ExecuTorch 代码库导入其 partitioner 以及 edge compile config，然后调用 to_edge_transform_and_lower。

以下是如何将 nanoGPT 委派给 XNNPACK 的示例 (如果您例如要部署到 Android 手机)

# export_nanogpt.py

# Load partitioner for Xnnpack backend
from executorch.backends.xnnpack.partition.xnnpack_partitioner import XnnpackPartitioner

# Model to be delegated to specific backend should use specific edge compile config
from executorch.backends.xnnpack.utils.configs import get_xnnpack_edge_compile_config
from executorch.exir import EdgeCompileConfig, to_edge_transform_and_lower

import torch
from torch.export import export
from torch.nn.attention import sdpa_kernel, SDPBackend
from torch.export import export_for_training

from model import GPT

# Load the nanoGPT model.
model = GPT.from_pretrained('gpt2')

# Create example inputs. This is used in the export process to provide
# hints on the expected shape of the model input.
example_inputs = (
        torch.randint(0, 100, (1, model.config.block_size - 1), dtype=torch.long),
    )

# Set up dynamic shape configuration. This allows the sizes of the input tensors
# to differ from the sizes of the tensors in `example_inputs` during runtime, as
# long as they adhere to the rules specified in the dynamic shape configuration.
# Here we set the range of 0th model input's 1st dimension as
# [0, model.config.block_size].
# See https://pytorch.ac.cn/executorch/main/concepts.html#dynamic-shapes
# for details about creating dynamic shapes.
dynamic_shape = (
    {1: torch.export.Dim("token_dim", max=model.config.block_size - 1)},
)

# Trace the model, converting it to a portable intermediate representation.
# The torch.no_grad() call tells PyTorch to exclude training-specific logic.
with torch.nn.attention.sdpa_kernel([SDPBackend.MATH]), torch.no_grad():
    m = export_for_training(model, example_inputs, dynamic_shapes=dynamic_shape).module()
    traced_model = export(m, example_inputs, dynamic_shapes=dynamic_shape)

# Convert the model into a runnable ExecuTorch program.
# To be further lowered to Xnnpack backend, `traced_model` needs xnnpack-specific edge compile config
edge_config = get_xnnpack_edge_compile_config()
# Converted to edge program and then delegate exported model to Xnnpack backend
# by invoking `to` function with Xnnpack partitioner.
edge_manager = to_edge_transform_and_lower(traced_model, partitioner = [XnnpackPartitioner()], compile_config = edge_config)
et_program = edge_manager.to_executorch()

# Save the Xnnpack-delegated ExecuTorch program to a file.
with open("nanogpt.pte", "wb") as file:
    file.write(et_program.buffer)

量化¶

量化是指使用低精度类型运行计算和存储张量的技术集合。与 32 位浮点相比，使用 8 位整数可以同时实现显著的速度提升和内存使用量的减少。量化模型的方法有很多，在所需预处理量、使用的数据类型以及对模型准确性和性能的影响方面有所不同。

由于移动设备上的计算和内存受到高度限制，因此需要某种形式的量化才能在消费电子产品上发布大型模型。特别是，大型语言模型，如 Llama2，可能需要将模型权重量化到 4 位或更低。

利用量化需要先转换模型，然后再导出。PyTorch 提供 pt2e (PyTorch 2 Export) API 来实现此目的。本示例以 XNNPACK 委派为目标进行 CPU 加速。因此，它需要使用特定于 XNNPACK 的量化器。定位不同的后端将需要使用相应的量化器。

要使用 8 位整数动态量化与 XNNPACK 委派，请调用 prepare_pt2e，通过使用代表性输入运行来校准模型，然后调用 convert_pt2e。这将更新计算图以使用可用的量化运算符。

# export_nanogpt.py

from executorch.backends.transforms.duplicate_dynamic_quant_chain import (
    DuplicateDynamicQuantChainPass,
)
from executorch.backends.xnnpack.quantizer.xnnpack_quantizer import (
    get_symmetric_quantization_config,
    XNNPACKQuantizer,
)
from torchao.quantization.pt2e.quantize_pt2e import convert_pt2e, prepare_pt2e

# Use dynamic, per-channel quantization.
xnnpack_quant_config = get_symmetric_quantization_config(
    is_per_channel=True, is_dynamic=True
)
xnnpack_quantizer = XNNPACKQuantizer()
xnnpack_quantizer.set_global(xnnpack_quant_config)

m = export_for_training(model, example_inputs).module()

# Annotate the model for quantization. This prepares the model for calibration.
m = prepare_pt2e(m, xnnpack_quantizer)

# Calibrate the model using representative inputs. This allows the quantization
# logic to determine the expected range of values in each tensor.
m(*example_inputs)

# Perform the actual quantization.
m = convert_pt2e(m, fold_quantize=False)
DuplicateDynamicQuantChainPass()(m)

traced_model = export(m, example_inputs)

此外，添加或更新 to_edge_transform_and_lower() 调用以使用 XnnpackPartitioner。这指示 ExecuTorch 通过 XNNPACK 后端为 CPU 执行优化模型。

from executorch.backends.xnnpack.partition.xnnpack_partitioner import (
    XnnpackPartitioner,
)

edge_config = get_xnnpack_edge_compile_config()
# Convert to edge dialect and lower to XNNPack.
edge_manager = to_edge_transform_and_lower(traced_model, partitioner = [XnnpackPartitioner()], compile_config = edge_config)
et_program = edge_manager.to_executorch()

with open("nanogpt.pte", "wb") as file:
    file.write(et_program.buffer)

有关更多信息，请参阅 ExecuTorch 中的量化。

性能分析和调试¶

在调用 to_edge_transform_and_lower() 来降低模型后，您可能想查看哪些部分被委派了，哪些没有。ExecuTorch 提供实用方法来洞察委派情况。您可以使用此信息来了解底层计算并诊断潜在的性能问题。模型作者可以使用此信息以与目标后端兼容的方式来构建模型。

可视化委派¶

get_delegation_info() 方法提供了对调用 to_edge_transform_and_lower() 后模型状态的摘要。

from executorch.devtools.backend_debug import get_delegation_info
from tabulate import tabulate

# ... After call to to_edge_transform_and_lower(), but before to_executorch()
graph_module = edge_manager.exported_program().graph_module
delegation_info = get_delegation_info(graph_module)
print(delegation_info.get_summary())
df = delegation_info.get_operator_delegation_dataframe()
print(tabulate(df, headers="keys", tablefmt="fancy_grid"))

对于以 XNNPACK 后端为目标的 nanoGPT，您可能会看到以下内容 (请注意，下面的数字仅用于说明目的，实际值可能有所不同)

Total  delegated  subgraphs:  145
Number  of  delegated  nodes:  350
Number  of  non-delegated  nodes:  760

	op_type	# in_delegated_graphs	# in_non_delegated_graphs
0	aten__softmax_default	12	0
1	aten_add_tensor	37	0
2	aten_addmm_default	48	0
3	aten_any_dim	0	12
	…
25	aten_view_copy_default	96	122
	…
30	Total	350	760

从表中可以看出，运算符 aten_view_copy_default 在委派图中有 96 个实例，在非委派图中有 122 个实例。要查看更详细的视图，请使用 format_delegated_graph() 方法获取整个图的格式化字符串打印输出，或者使用 print_delegated_graph() 直接打印。

from executorch.exir.backend.utils import format_delegated_graph
graph_module = edge_manager.exported_program().graph_module
print(format_delegated_graph(graph_module))

对于大型模型，这可能会产生大量输出。考虑使用“Control+F”或“Command+F”来查找您感兴趣的运算符 (例如，“aten_view_copy_default”)。观察哪些实例不在已降低的图下。

在下面的 nanoGPT 输出片段中，可以看到 transformer 模块已委派给 XNNPACK，而 where 运算符则没有。

%aten_where_self_22 : [num_users=1] = call_function[target=executorch.exir.dialects.edge._ops.aten.where.self](args = (%aten_logical_not_default_33, %scalar_tensor_23, %scalar_tensor_22), kwargs = {})
%lowered_module_144 : [num_users=1] = get_attr[target=lowered_module_144]
backend_id: XnnpackBackend
lowered graph():
    %p_transformer_h_0_attn_c_attn_weight : [num_users=1] = placeholder[target=p_transformer_h_0_attn_c_attn_weight]
    %p_transformer_h_0_attn_c_attn_bias : [num_users=1] = placeholder[target=p_transformer_h_0_attn_c_attn_bias]
    %getitem : [num_users=1] = placeholder[target=getitem]
    %sym_size : [num_users=2] = placeholder[target=sym_size]
    %aten_view_copy_default : [num_users=1] = call_function[target=executorch.exir.dialects.edge._ops.aten.view_copy.default](args = (%getitem, [%sym_size, 768]), kwargs = {})
    %aten_permute_copy_default : [num_users=1] = call_function[target=executorch.exir.dialects.edge._ops.aten.permute_copy.default](args = (%p_transformer_h_0_attn_c_attn_weight, [1, 0]), kwargs = {})
    %aten_addmm_default : [num_users=1] = call_function[target=executorch.exir.dialects.edge._ops.aten.addmm.default](args = (%p_transformer_h_0_attn_c_attn_bias, %aten_view_copy_default, %aten_permute_copy_default), kwargs = {})
    %aten_view_copy_default_1 : [num_users=1] = call_function[target=executorch.exir.dialects.edge._ops.aten.view_copy.default](args = (%aten_addmm_default, [1, %sym_size, 2304]), kwargs = {})
    return [aten_view_copy_default_1]

进一步的模型分析和调试¶

通过 ExecuTorch 的开发者工具，用户可以分析模型执行，提供模型中每个运算符的计时信息，进行模型数值调试等。

ETRecord 是在导出时生成的工件，其中包含模型图和源级元数据，将 ExecuTorch 程序与原始 PyTorch 模型链接起来。您可以查看所有分析事件，而无需 ETRecord，但有了 ETRecord，您还可以将每个事件链接到正在执行的运算符类型、模块层次结构以及原始 PyTorch 源代码的堆栈跟踪。有关更多信息，请参阅 ETRecord 文档。

在您的导出脚本中，在调用 to_edge() 和 to_executorch() 之后，调用 generate_etrecord()，并传入来自 to_edge() 的 EdgeProgramManager 和来自 to_executorch() 的 ExecuTorchProgramManager。请务必复制 EdgeProgramManager，因为调用 to_edge_transform_and_lower() 会就地修改图。

# export_nanogpt.py

import copy
from executorch.devtools import generate_etrecord

# Make the deep copy immediately after to to_edge()
edge_manager_copy = copy.deepcopy(edge_manager)

# ...
# Generate ETRecord right after to_executorch()
etrecord_path = "etrecord.bin"
generate_etrecord(etrecord_path, edge_manager_copy, et_program)

运行导出脚本，ETRecord 将被生成为 etrecord.bin。

要了解有关 ExecuTorch 开发者工具的更多信息，请参阅 ExecuTorch 开发者工具简介。