评价此页
torch.compile 端到端教程"

torch.compile 端到端教程#

作者: William Wen

torch.compile 是加速 PyTorch 代码的新方法! torch.compile 通过将 PyTorch 代码 JIT 编译成优化内核,使 PyTorch 代码运行得更快,同时只需要最少的代码更改。

本教程涵盖了一个端到端的真实模型训练和评估的示例,使用了 torch.compile。对于 torch.compile 的初步介绍,请参阅 torch.compile 介绍教程

必需的 pip 依赖项

  • torch >= 2.0

  • torchvision

您将学到什么
  • 如何将 torch.compile 应用于真实模型

  • torch.compile 在真实模型上的加速效果

  • torch.compile 的最初几次迭代由于编译开销,预期会更慢

先决条件
# NOTE: a modern NVIDIA GPU (H100, A100, or V100) is recommended for this tutorial in
# order to reproduce the speedup numbers shown below and documented elsewhere.

import torch
import warnings

gpu_ok = False
if torch.cuda.is_available():
    device_cap = torch.cuda.get_device_capability()
    if device_cap in ((7, 0), (8, 0), (9, 0)):
        gpu_ok = True

if not gpu_ok:
    warnings.warn(
        "GPU is not NVIDIA V100, A100, or H100. Speedup numbers may be lower "
        "than expected."
    )
/var/lib/workspace/intermediate_source/torch_compile_full_example.py:51: UserWarning:

GPU is not NVIDIA V100, A100, or H100. Speedup numbers may be lower than expected.

让我们演示一下使用 torch.compile 如何加速真实模型。我们将通过在随机数据上评估和训练一个 torchvision 模型来比较标准的 eager 模式和 torch.compile

在开始之前,我们需要定义一些实用函数。

# Returns the result of running `fn()` and the time it took for `fn()` to run,
# in seconds. We use CUDA events and synchronization for the most accurate
# measurements.
def timed(fn):
    start = torch.cuda.Event(enable_timing=True)
    end = torch.cuda.Event(enable_timing=True)
    start.record()
    result = fn()
    end.record()
    torch.cuda.synchronize()
    return result, start.elapsed_time(end) / 1000


# Generates random input and targets data for the model, where `b` is
# batch size.
def generate_data(b):
    return (
        torch.randn(b, 3, 128, 128).to().cuda(),
        torch.randint(1000, (b,)).cuda(),
    )


N_ITERS = 10

from torchvision.models import densenet121


def init_model():
    return densenet121().cuda()

首先,让我们比较推理。

请注意,在调用 torch.compile 时,我们有额外的 mode 参数,我们将在下文讨论。

model = init_model()

# Note that we generally recommend directly compiling a torch.nn.Module by calling
# its .compile() method.
model_opt = init_model()
model_opt.compile(mode="reduce-overhead")

inp = generate_data(16)[0]
with torch.no_grad():
    print("eager:", timed(lambda: model(inp))[1])
    print("compile:", timed(lambda: model_opt(inp))[1])
eager: 0.3604090576171875
/usr/local/lib/python3.10/dist-packages/torch/backends/cuda/__init__.py:131: UserWarning:

Please use the new API settings to control TF32 behavior, such as torch.backends.cudnn.conv.fp32_precision = 'tf32' or torch.backends.cuda.matmul.fp32_precision = 'ieee'. Old settings, e.g, torch.backends.cuda.matmul.allow_tf32 = True, torch.backends.cudnn.allow_tf32 = True, allowTF32CuDNN() and allowTF32CuBLAS() will be deprecated after Pytorch 2.9. Please see https://pytorch.ac.cn/docs/stable/notes/cuda.html#tensorfloat-32-tf32-on-ampere-and-later-devices (Triggered internally at /pytorch/aten/src/ATen/Context.cpp:80.)

/usr/local/lib/python3.10/dist-packages/torch/_inductor/compile_fx.py:312: UserWarning:

TensorFloat32 tensor cores for float32 matrix multiplication available but not enabled. Consider setting `torch.set_float32_matmul_precision('high')` for better performance.

compile: 51.42688671875

请注意,与 eager 模式相比,torch.compile 的完成时间要长得多。这是因为 torch.compile 在执行过程中将模型编译成优化内核。在我们的示例中,模型的结构没有改变,因此不需要重新编译。所以如果我们再运行几次优化后的模型,与 eager 模式相比,我们应该会看到显著的改进。

eager_times = []
for i in range(N_ITERS):
    inp = generate_data(16)[0]
    with torch.no_grad():
        _, eager_time = timed(lambda: model(inp))
    eager_times.append(eager_time)
    print(f"eager eval time {i}: {eager_time}")

print("~" * 10)

compile_times = []
for i in range(N_ITERS):
    inp = generate_data(16)[0]
    with torch.no_grad():
        _, compile_time = timed(lambda: model_opt(inp))
    compile_times.append(compile_time)
    print(f"compile eval time {i}: {compile_time}")
print("~" * 10)

import numpy as np

eager_med = np.median(eager_times)
compile_med = np.median(compile_times)
speedup = eager_med / compile_med
assert speedup > 1
print(
    f"(eval) eager median: {eager_med}, compile median: {compile_med}, speedup: {speedup}x"
)
print("~" * 10)
eager eval time 0: 0.01820876884460449
eager eval time 1: 0.016675840377807616
eager eval time 2: 0.016416767120361327
eager eval time 3: 0.01638400077819824
eager eval time 4: 0.016457696914672852
eager eval time 5: 0.016348159790039063
eager eval time 6: 0.016328704833984374
eager eval time 7: 0.016314367294311523
eager eval time 8: 0.01641472053527832
eager eval time 9: 0.01641164779663086
~~~~~~~~~~
compile eval time 0: 0.061233150482177735
compile eval time 1: 0.007819263935089112
compile eval time 2: 0.008339455604553223
compile eval time 3: 0.007483391761779785
compile eval time 4: 0.007483359813690186
compile eval time 5: 0.007465983867645264
compile eval time 6: 0.0074670081138610836
compile eval time 7: 0.0074670081138610836
compile eval time 8: 0.007468031883239746
compile eval time 9: 0.0074700798988342285
~~~~~~~~~~
(eval) eager median: 0.016413184165954588, compile median: 0.007476719856262207, speedup: 2.1952386182033488x
~~~~~~~~~~

确实,我们可以看到使用 torch.compile 运行我们的模型可以显著加速。加速主要来自于减少 Python 开销和 GPU 读/写,因此观察到的加速可能因模型架构和批次大小等因素而异。例如,如果模型的架构很简单,并且数据量很大,那么瓶颈将是 GPU 计算,观察到的加速可能不那么显著。

您也可能会根据选择的 mode 参数看到不同的加速结果。"reduce-overhead" 模式使用 CUDA 图来进一步减少 Python 的开销。对于您自己的模型,您可能需要尝试不同的模式来最大化加速。您可以在 此处 阅读更多关于模式的信息。

您也可能会注意到,我们第二次使用 torch.compile 运行模型比其他运行速度慢很多,尽管比第一次运行快得多。这是因为 "reduce-overhead" 模式会为 CUDA 图运行几次预热迭代。

现在,让我们比较一下训练。

model = init_model()
opt = torch.optim.Adam(model.parameters())


def train(mod, data):
    opt.zero_grad(True)
    pred = mod(data[0])
    loss = torch.nn.CrossEntropyLoss()(pred, data[1])
    loss.backward()
    opt.step()


eager_times = []
for i in range(N_ITERS):
    inp = generate_data(16)
    _, eager_time = timed(lambda: train(model, inp))
    eager_times.append(eager_time)
    print(f"eager train time {i}: {eager_time}")
print("~" * 10)

model = init_model()
opt = torch.optim.Adam(model.parameters())

# Note that because we are compiling a regular Python function, we do not
# call any .compile() method.
train_opt = torch.compile(train, mode="reduce-overhead")

compile_times = []
for i in range(N_ITERS):
    inp = generate_data(16)
    _, compile_time = timed(lambda: train_opt(model, inp))
    compile_times.append(compile_time)
    print(f"compile train time {i}: {compile_time}")
print("~" * 10)

eager_med = np.median(eager_times)
compile_med = np.median(compile_times)
speedup = eager_med / compile_med
assert speedup > 1
print(
    f"(train) eager median: {eager_med}, compile median: {compile_med}, speedup: {speedup}x"
)
print("~" * 10)
eager train time 0: 0.2882539367675781
eager train time 1: 0.05161676788330078
eager train time 2: 0.049276927947998046
eager train time 3: 0.05065420913696289
eager train time 4: 0.8006707153320313
eager train time 5: 0.05070438385009766
eager train time 6: 0.05034195327758789
eager train time 7: 0.05022825622558594
eager train time 8: 0.050223102569580076
eager train time 9: 0.05043302536010742
~~~~~~~~~~
compile train time 0: 151.00690625
compile train time 1: 2.915029052734375
compile train time 2: 0.02395030403137207
compile train time 3: 0.021402624130249022
compile train time 4: 0.020746240615844725
compile train time 5: 0.02069811248779297
compile train time 6: 0.020706304550170897
compile train time 7: 0.020715520858764647
compile train time 8: 0.02070425605773926
compile train time 9: 0.020745216369628908
~~~~~~~~~~
(train) eager median: 0.05054361724853516, compile median: 0.020745728492736815, speedup: 2.436338510177203x
~~~~~~~~~~

同样,我们可以看到 torch.compile 在第一次迭代中花费的时间更长,因为它必须编译模型,但在后续迭代中,与 eager 模式相比,我们看到了显著的加速。

我们注意到,本教程中提供的加速数字仅用于演示目的。官方加速值可以在 TorchInductor 性能仪表盘 上查看。

结论#

在本教程中,我们将 torch.compile 应用于真实模型的训练和推理,演示了加速效果。

重要的是,我们注意到编译模型的前几次迭代比 eager 模式慢,因为存在编译开销,但预期后续迭代会有加速。

对于 torch.compile 的初步介绍,请参阅 torch.compile 介绍教程

为了解决问题并更深入地理解如何将 torch.compile 应用于您的代码,请查看 torch.compile 编程模型

我们希望您会尝试 torch.compile

脚本总运行时间: (3 分钟 29.786 秒)