评价此页

★ ★ ★ ★ ★

recipes/recipes/profiler_recipe

在 Google Colab 中运行

注意

转到末尾下载完整的示例代码。

PyTorch Profiler#

创建于：2021年1月29日 | 最后更新：2025年7月09日 | 最后验证：未经验证

作者： Shivam Raikundalia

本教程解释了如何使用 PyTorch profiler 来测量模型算子的时间和内存消耗。

简介#

PyTorch 包含一个简单的 profiler API，当用户需要确定模型中最耗时的算子时非常有用。

在本教程中，我们将使用一个简单的 Resnet 模型来演示如何使用 profiler 分析模型性能。

先决条件#

torch >= 2.3.0

设置#

使用以下命令安装 torch 和 torchvision

pip install torch torchvision

步骤#

导入所有必要的库
实例化一个简单的 Resnet 模型
使用 profiler 分析执行时间
使用 profiler 分析内存消耗
使用追踪功能
检查堆栈跟踪
使用 profiler 分析长时间运行的任务

1. 导入所有必要的库#

在本教程中，我们将使用 torch、torchvision.models 和 profiler 模块。

import torch
import torchvision.models as models
from torch.profiler import profile, ProfilerActivity, record_function

2. 实例化一个简单的 Resnet 模型#

让我们创建一个 Resnet 模型实例并为其准备输入。

model = models.resnet18()
inputs = torch.randn(5, 3, 224, 224)

3. 使用 profiler 分析执行时间#

PyTorch profiler 通过上下文管理器启用，并接受许多参数，其中一些最有用的包括：

activities - 要分析的活动列表
- ProfilerActivity.CPU - PyTorch 算子、TorchScript 函数和用户定义的代码标签（参见下面的 record_function）；
- ProfilerActivity.CUDA - 设备上的 CUDA 核函数；
- ProfilerActivity.XPU - 设备上的 XPU 核函数；
record_shapes - 是否记录算子输入的形状；
profile_memory - 是否报告模型张量占用的内存量；

注意：在使用 CUDA 时，profiler 还会显示主机上发生的 CUDA 事件的运行时。

让我们看看如何使用 profiler 来分析执行时间。

with profile(activities=[ProfilerActivity.CPU], record_shapes=True) as prof:
    with record_function("model_inference"):
        model(inputs)

请注意，我们可以使用 record_function 上下文管理器，使用用户提供的名称（在上面的示例中，model_inference 用作标签）来标记任意代码范围。

Profiler 允许检查在包装了 profiler 上下文管理器的代码范围执行期间调用了哪些算子。如果同一时间有多个 profiler 范围处于活动状态（例如，在并行的 PyTorch 线程中），每个 profiler 上下文管理器仅跟踪其相应范围内的算子。Profiler 还会自动分析使用 torch.jit._fork 启动的异步任务，以及（在反向传播的情况下）使用 backward() 调用启动的反向传播算子。

让我们打印出上述执行的统计信息。

print(prof.key_averages().table(sort_by="cpu_time_total", row_limit=10))

输出将如下所示（省略了一些列）：

---------------------------------  ------------  ------------  ------------  ------------
                             Name      Self CPU     CPU total  CPU time avg    # of Calls
---------------------------------  ------------  ------------  ------------  ------------
                  model_inference       5.509ms      57.503ms      57.503ms             1
                     aten::conv2d     231.000us      31.931ms       1.597ms            20
                aten::convolution     250.000us      31.700ms       1.585ms            20
               aten::_convolution     336.000us      31.450ms       1.573ms            20
         aten::mkldnn_convolution      30.838ms      31.114ms       1.556ms            20
                 aten::batch_norm     211.000us      14.693ms     734.650us            20
     aten::_batch_norm_impl_index     319.000us      14.482ms     724.100us            20
          aten::native_batch_norm       9.229ms      14.109ms     705.450us            20
                       aten::mean     332.000us       2.631ms     125.286us            21
                     aten::select       1.668ms       2.292ms       8.988us           255
---------------------------------  ------------  ------------  ------------  ------------
Self CPU time total: 57.549m

这里我们看到，正如预期的那样，大部分时间花在卷积（特别是对于使用 MKL-DNN 支持编译的 PyTorch 中的 mkldnn_convolution）上。注意自 CPU 时间和 CPU 时间之间的区别——算子可以调用其他算子，自 CPU 时间不包括在子算子调用中花费的时间，而总 CPU 时间则包含它。你可以选择按自 CPU 时间排序，方法是将 sort_by="self_cpu_time_total" 传递给 table 调用。

要获得更精细的结果并包含算子输入形状，请传递 group_by_input_shape=True（注意：这需要使用 record_shapes=True 运行 profiler）。

print(
    prof.key_averages(group_by_input_shape=True).table(
        sort_by="cpu_time_total", row_limit=10
    )
)

输出可能如下所示（省略了一些列）：

---------------------------------  ------------  -------------------------------------------
                             Name     CPU total                                 Input Shapes
---------------------------------  ------------  -------------------------------------------
                  model_inference      57.503ms                                           []
                     aten::conv2d       8.008ms      [5,64,56,56], [64,64,3,3], [], ..., []]
                aten::convolution       7.956ms     [[5,64,56,56], [64,64,3,3], [], ..., []]
               aten::_convolution       7.909ms     [[5,64,56,56], [64,64,3,3], [], ..., []]
         aten::mkldnn_convolution       7.834ms     [[5,64,56,56], [64,64,3,3], [], ..., []]
                     aten::conv2d       6.332ms    [[5,512,7,7], [512,512,3,3], [], ..., []]
                aten::convolution       6.303ms    [[5,512,7,7], [512,512,3,3], [], ..., []]
               aten::_convolution       6.273ms    [[5,512,7,7], [512,512,3,3], [], ..., []]
         aten::mkldnn_convolution       6.233ms    [[5,512,7,7], [512,512,3,3], [], ..., []]
                     aten::conv2d       4.751ms  [[5,256,14,14], [256,256,3,3], [], ..., []]
---------------------------------  ------------  -------------------------------------------
Self CPU time total: 57.549ms

请注意 aten::convolution 出现两次，具有不同的输入形状。

Profiler 还可以用于分析在 GPU 上执行的模型的性能：用户可以切换 CPU、CUDA 和 XPU。

activities = [ProfilerActivity.CPU]
if torch.cuda.is_available():
    device = "cuda"
    activities += [ProfilerActivity.CUDA]
elif torch.xpu.is_available():
    device = "xpu"
    activities += [ProfilerActivity.XPU]
else:
    print(
        "Neither CUDA nor XPU devices are available to demonstrate profiling on acceleration devices"
    )
    import sys

    sys.exit(0)

sort_by_keyword = device + "_time_total"

model = models.resnet18().to(device)
inputs = torch.randn(5, 3, 224, 224).to(device)

with profile(activities=activities, record_shapes=True) as prof:
    with record_function("model_inference"):
        model(inputs)

print(prof.key_averages().table(sort_by=sort_by_keyword, row_limit=10))

（注意：第一次使用 CUDA 性能分析可能会带来额外的开销。）

生成的表格输出（省略了一些列）：

-------------------------------------------------------  ------------  ------------
                                                   Name     Self CUDA    CUDA total
-------------------------------------------------------  ------------  ------------
                                        model_inference       0.000us      11.666ms
                                           aten::conv2d       0.000us      10.484ms
                                      aten::convolution       0.000us      10.484ms
                                     aten::_convolution       0.000us      10.484ms
                             aten::_convolution_nogroup       0.000us      10.484ms
                                      aten::thnn_conv2d       0.000us      10.484ms
                              aten::thnn_conv2d_forward      10.484ms      10.484ms
void at::native::im2col_kernel<float>(long, float co...       3.844ms       3.844ms
                                      sgemm_32x32x32_NN       3.206ms       3.206ms
                                  sgemm_32x32x32_NN_vec       3.093ms       3.093ms
-------------------------------------------------------  ------------  ------------
Self CPU time total: 23.015ms
Self CUDA time total: 11.666ms

（注意：第一次使用 XPU 性能分析可能会带来额外的开销。）

生成的表格输出（省略了一些列）：

------------------------------  ------------  ------------  ------------  ------------  ------------
                          Name      Self XPU    Self XPU %     XPU total  XPU time avg    # of Calls
------------------------------  ------------  ------------  ------------  ------------  ------------
               model_inference       0.000us         0.00%       2.567ms       2.567ms             1
                  aten::conv2d       0.000us         0.00%       1.871ms      93.560us            20
             aten::convolution       0.000us         0.00%       1.871ms      93.560us            20
            aten::_convolution       0.000us         0.00%       1.871ms      93.560us            20
aten::convolution_overrideable       1.871ms        72.89%       1.871ms      93.560us            20
                      gen_conv       1.484ms        57.82%       1.484ms      74.216us            20
              aten::batch_norm       0.000us         0.00%     432.640us      21.632us            20
  aten::_batch_norm_impl_index       0.000us         0.00%     432.640us      21.632us            20
       aten::native_batch_norm     432.640us        16.85%     432.640us      21.632us            20
                  conv_reorder     386.880us        15.07%     386.880us       6.448us            60
------------------------------  ------------  ------------  ------------  ------------  ------------
Self CPU time total: 712.486ms
Self XPU time total: 2.567ms

请注意输出中的设备上核函数（例如，CUDA 的 sgemm_32x32x32_NN 或 XPU 的 gen_conv）。

4. 使用 profiler 分析内存消耗#

PyTorch profiler 还可以显示在模型算子执行期间分配（或释放）的内存量（由模型的张量使用）。在下面的输出中，“self”内存对应于算子分配（释放）的内存，不包括对其他算子的子调用。要启用内存分析功能，请传递 profile_memory=True。

model = models.resnet18()
inputs = torch.randn(5, 3, 224, 224)

with profile(
    activities=[ProfilerActivity.CPU], profile_memory=True, record_shapes=True
) as prof:
    model(inputs)

print(prof.key_averages().table(sort_by="self_cpu_memory_usage", row_limit=10))

# (omitting some columns)
# ---------------------------------  ------------  ------------  ------------
#                              Name       CPU Mem  Self CPU Mem    # of Calls
# ---------------------------------  ------------  ------------  ------------
#                       aten::empty      94.79 Mb      94.79 Mb           121
#     aten::max_pool2d_with_indices      11.48 Mb      11.48 Mb             1
#                       aten::addmm      19.53 Kb      19.53 Kb             1
#               aten::empty_strided         572 b         572 b            25
#                     aten::resize_         240 b         240 b             6
#                         aten::abs         480 b         240 b             4
#                         aten::add         160 b         160 b            20
#               aten::masked_select         120 b         112 b             1
#                          aten::ne         122 b          53 b             6
#                          aten::eq          60 b          30 b             2
# ---------------------------------  ------------  ------------  ------------
# Self CPU time total: 53.064ms

print(prof.key_averages().table(sort_by="cpu_memory_usage", row_limit=10))

输出可能如下所示（省略了一些列）：

---------------------------------  ------------  ------------  ------------
                             Name       CPU Mem  Self CPU Mem    # of Calls
---------------------------------  ------------  ------------  ------------
                      aten::empty      94.79 Mb      94.79 Mb           121
                 aten::batch_norm      47.41 Mb           0 b            20
     aten::_batch_norm_impl_index      47.41 Mb           0 b            20
          aten::native_batch_norm      47.41 Mb           0 b            20
                     aten::conv2d      47.37 Mb           0 b            20
                aten::convolution      47.37 Mb           0 b            20
               aten::_convolution      47.37 Mb           0 b            20
         aten::mkldnn_convolution      47.37 Mb           0 b            20
                 aten::max_pool2d      11.48 Mb           0 b             1
    aten::max_pool2d_with_indices      11.48 Mb      11.48 Mb             1
---------------------------------  ------------  ------------  ------------
Self CPU time total: 53.064ms

5. 使用追踪功能#

性能分析结果可以输出为 .json 追踪文件：追踪 CUDA 或 XPU 核函数。用户可以切换 CPU、CUDA 和 XPU。

activities = [ProfilerActivity.CPU]
if torch.cuda.is_available():
    device = "cuda"
    activities += [ProfilerActivity.CUDA]
elif torch.xpu.is_available():
    device = "xpu"
    activities += [ProfilerActivity.XPU]
else:
    print(
        "Neither CUDA nor XPU devices are available to demonstrate profiling on acceleration devices"
    )
    import sys

    sys.exit(0)

model = models.resnet18().to(device)
inputs = torch.randn(5, 3, 224, 224).to(device)

with profile(activities=activities) as prof:
    model(inputs)

prof.export_chrome_trace("trace.json")

您可以在 Chrome 追踪查看器（chrome://tracing）中检查被分析的算子和 CUDA/XPU 核函数的序列。

6. 检查堆栈跟踪#

Profiler 可用于分析 Python 和 TorchScript 的堆栈跟踪。

sort_by_keyword = "self_" + device + "_time_total"

with profile(
    activities=activities,
    with_stack=True,
    experimental_config=torch._C._profiler._ExperimentalConfig(verbose=True),
) as prof:
    model(inputs)

# Print aggregated stats
print(prof.key_averages(group_by_stack_n=5).table(sort_by=sort_by_keyword, row_limit=2))

输出可能如下所示（省略了一些列）：

-------------------------  -----------------------------------------------------------
                     Name  Source Location
-------------------------  -----------------------------------------------------------
aten::thnn_conv2d_forward  .../torch/nn/modules/conv.py(439): _conv_forward
                           .../torch/nn/modules/conv.py(443): forward
                           .../torch/nn/modules/module.py(1051): _call_impl
                           .../site-packages/torchvision/models/resnet.py(63): forward
                           .../torch/nn/modules/module.py(1051): _call_impl
aten::thnn_conv2d_forward  .../torch/nn/modules/conv.py(439): _conv_forward
                           .../torch/nn/modules/conv.py(443): forward
                           .../torch/nn/modules/module.py(1051): _call_impl
                           .../site-packages/torchvision/models/resnet.py(59): forward
                           .../torch/nn/modules/module.py(1051): _call_impl
-------------------------  -----------------------------------------------------------
Self CPU time total: 34.016ms
Self CUDA time total: 11.659ms

请注意 torchvision/models/resnet.py 脚本中的两个卷积和两个调用点。

（警告：堆栈跟踪会增加额外的性能分析开销。）

7. 使用 profiler 分析长时间运行的任务#

PyTorch profiler 提供了额外的 API 来处理长时间运行的任务（例如训练循环）。追踪所有执行可能会很慢并导致非常大的追踪文件。为避免这种情况，请使用可选参数：

schedule - 指定一个函数，该函数接受一个整数参数（步数）作为输入，并返回一个供 profiler 执行的操作。使用此参数的最佳方式是使用 torch.profiler.schedule 辅助函数，它可以为您生成一个计划；
on_trace_ready - 指定一个函数，该函数接受 profiler 的引用作为输入，并在每次新的追踪准备好时由 profiler 调用。

为了说明 API 的工作原理，让我们首先考虑使用 torch.profiler.schedule 辅助函数的以下示例。

from torch.profiler import schedule

my_schedule = schedule(skip_first=10, wait=5, warmup=1, active=3, repeat=2)

Profiler 假定长时间运行的任务由从零开始编号的步数组成。上面的示例定义了以下 profiler 操作序列：

参数 skip_first 告诉 profiler 它应该忽略前 10 步（skip_first 的默认值为零）；
在前 skip_first 步之后，profiler 开始执行 profiler 周期；
每个周期包含三个阶段：
- 空闲（wait=5 步），在此阶段 profiler 不活跃；
- 预热（warmup=1 步），在此阶段 profiler 开始追踪，但结果会被丢弃；此阶段用于丢弃在追踪开始时由 profiler 获得的样本，因为它们通常会因额外的开销而产生偏差；
- 主动追踪（active=3 步），在此阶段 profiler 追踪并记录数据；
可选的 repeat 参数指定周期的最大数量。默认情况下（值为零），只要任务运行，profiler 就会执行周期。

因此，在上面的示例中，profiler 将跳过前 15 步，花费接下来的 1 步进行预热，主动记录接下来的 3 步，跳过另外 5 步，花费接下来的 1 步进行预热，再主动记录 3 步。由于指定了 repeat=2 参数值，profiler 将在第一个周期完成后停止记录。

在每个周期的结束时，profiler 会调用指定的 on_trace_ready 函数，并将 profiler 本身作为参数传递。此函数用于处理新的追踪——通过获取表格输出或将输出保存到磁盘作为追踪文件。

要向 profiler 发送信号表示下一个步骤已开始，请调用 prof.step() 函数。当前的 profiler 步数存储在 prof.step_num 中。

以下示例展示了如何将上述所有概念用于 CUDA 和 XPU 核函数。

sort_by_keyword = "self_" + device + "_time_total"


def trace_handler(p):
    output = p.key_averages().table(sort_by=sort_by_keyword, row_limit=10)
    print(output)
    p.export_chrome_trace("/tmp/trace_" + str(p.step_num) + ".json")


with profile(
    activities=activities,
    schedule=torch.profiler.schedule(wait=1, warmup=1, active=2),
    on_trace_ready=trace_handler,
) as p:
    for idx in range(8):
        model(inputs)
        p.step()

了解更多#

查看以下教程，以继续您的学习：

PyTorch Profiler#

简介#

先决条件#

设置#

步骤#

1. 导入所有必要的库#

2. 实例化一个简单的 Resnet 模型#

3. 使用 profiler 分析执行时间#

4. 使用 profiler 分析内存消耗#

5. 使用追踪功能#

6. 检查堆栈跟踪#

7. 使用 profiler 分析长时间运行的任务#

了解更多#

文档

教程

资源