注意
跳转到结尾 下载完整的示例代码。
torch.compile
端到端教程#
作者: William Wen
torch.compile
是加速 PyTorch 代码的新方法! torch.compile
通过将 PyTorch 代码 JIT 编译成优化内核,使 PyTorch 代码运行得更快,同时只需要最少的代码更改。
本教程涵盖了一个端到端的真实模型训练和评估的示例,使用了 torch.compile
。对于 torch.compile
的初步介绍,请参阅 torch.compile 介绍教程。
必需的 pip 依赖项
torch >= 2.0
torchvision
如何将
torch.compile
应用于真实模型torch.compile
在真实模型上的加速效果torch.compile
的最初几次迭代由于编译开销,预期会更慢
# NOTE: a modern NVIDIA GPU (H100, A100, or V100) is recommended for this tutorial in
# order to reproduce the speedup numbers shown below and documented elsewhere.
import torch
import warnings
gpu_ok = False
if torch.cuda.is_available():
device_cap = torch.cuda.get_device_capability()
if device_cap in ((7, 0), (8, 0), (9, 0)):
gpu_ok = True
if not gpu_ok:
warnings.warn(
"GPU is not NVIDIA V100, A100, or H100. Speedup numbers may be lower "
"than expected."
)
/var/lib/workspace/intermediate_source/torch_compile_full_example.py:51: UserWarning:
GPU is not NVIDIA V100, A100, or H100. Speedup numbers may be lower than expected.
让我们演示一下使用 torch.compile
如何加速真实模型。我们将通过在随机数据上评估和训练一个 torchvision
模型来比较标准的 eager 模式和 torch.compile
。
在开始之前,我们需要定义一些实用函数。
# Returns the result of running `fn()` and the time it took for `fn()` to run,
# in seconds. We use CUDA events and synchronization for the most accurate
# measurements.
def timed(fn):
start = torch.cuda.Event(enable_timing=True)
end = torch.cuda.Event(enable_timing=True)
start.record()
result = fn()
end.record()
torch.cuda.synchronize()
return result, start.elapsed_time(end) / 1000
# Generates random input and targets data for the model, where `b` is
# batch size.
def generate_data(b):
return (
torch.randn(b, 3, 128, 128).to().cuda(),
torch.randint(1000, (b,)).cuda(),
)
N_ITERS = 10
from torchvision.models import densenet121
def init_model():
return densenet121().cuda()
首先,让我们比较推理。
请注意,在调用 torch.compile
时,我们有额外的 mode
参数,我们将在下文讨论。
model = init_model()
# Note that we generally recommend directly compiling a torch.nn.Module by calling
# its .compile() method.
model_opt = init_model()
model_opt.compile(mode="reduce-overhead")
inp = generate_data(16)[0]
with torch.no_grad():
print("eager:", timed(lambda: model(inp))[1])
print("compile:", timed(lambda: model_opt(inp))[1])
eager: 0.3604090576171875
/usr/local/lib/python3.10/dist-packages/torch/backends/cuda/__init__.py:131: UserWarning:
Please use the new API settings to control TF32 behavior, such as torch.backends.cudnn.conv.fp32_precision = 'tf32' or torch.backends.cuda.matmul.fp32_precision = 'ieee'. Old settings, e.g, torch.backends.cuda.matmul.allow_tf32 = True, torch.backends.cudnn.allow_tf32 = True, allowTF32CuDNN() and allowTF32CuBLAS() will be deprecated after Pytorch 2.9. Please see https://pytorch.ac.cn/docs/stable/notes/cuda.html#tensorfloat-32-tf32-on-ampere-and-later-devices (Triggered internally at /pytorch/aten/src/ATen/Context.cpp:80.)
/usr/local/lib/python3.10/dist-packages/torch/_inductor/compile_fx.py:312: UserWarning:
TensorFloat32 tensor cores for float32 matrix multiplication available but not enabled. Consider setting `torch.set_float32_matmul_precision('high')` for better performance.
compile: 51.42688671875
请注意,与 eager 模式相比,torch.compile
的完成时间要长得多。这是因为 torch.compile
在执行过程中将模型编译成优化内核。在我们的示例中,模型的结构没有改变,因此不需要重新编译。所以如果我们再运行几次优化后的模型,与 eager 模式相比,我们应该会看到显著的改进。
eager_times = []
for i in range(N_ITERS):
inp = generate_data(16)[0]
with torch.no_grad():
_, eager_time = timed(lambda: model(inp))
eager_times.append(eager_time)
print(f"eager eval time {i}: {eager_time}")
print("~" * 10)
compile_times = []
for i in range(N_ITERS):
inp = generate_data(16)[0]
with torch.no_grad():
_, compile_time = timed(lambda: model_opt(inp))
compile_times.append(compile_time)
print(f"compile eval time {i}: {compile_time}")
print("~" * 10)
import numpy as np
eager_med = np.median(eager_times)
compile_med = np.median(compile_times)
speedup = eager_med / compile_med
assert speedup > 1
print(
f"(eval) eager median: {eager_med}, compile median: {compile_med}, speedup: {speedup}x"
)
print("~" * 10)
eager eval time 0: 0.01820876884460449
eager eval time 1: 0.016675840377807616
eager eval time 2: 0.016416767120361327
eager eval time 3: 0.01638400077819824
eager eval time 4: 0.016457696914672852
eager eval time 5: 0.016348159790039063
eager eval time 6: 0.016328704833984374
eager eval time 7: 0.016314367294311523
eager eval time 8: 0.01641472053527832
eager eval time 9: 0.01641164779663086
~~~~~~~~~~
compile eval time 0: 0.061233150482177735
compile eval time 1: 0.007819263935089112
compile eval time 2: 0.008339455604553223
compile eval time 3: 0.007483391761779785
compile eval time 4: 0.007483359813690186
compile eval time 5: 0.007465983867645264
compile eval time 6: 0.0074670081138610836
compile eval time 7: 0.0074670081138610836
compile eval time 8: 0.007468031883239746
compile eval time 9: 0.0074700798988342285
~~~~~~~~~~
(eval) eager median: 0.016413184165954588, compile median: 0.007476719856262207, speedup: 2.1952386182033488x
~~~~~~~~~~
确实,我们可以看到使用 torch.compile
运行我们的模型可以显著加速。加速主要来自于减少 Python 开销和 GPU 读/写,因此观察到的加速可能因模型架构和批次大小等因素而异。例如,如果模型的架构很简单,并且数据量很大,那么瓶颈将是 GPU 计算,观察到的加速可能不那么显著。
您也可能会根据选择的 mode
参数看到不同的加速结果。"reduce-overhead"
模式使用 CUDA 图来进一步减少 Python 的开销。对于您自己的模型,您可能需要尝试不同的模式来最大化加速。您可以在 此处 阅读更多关于模式的信息。
您也可能会注意到,我们第二次使用 torch.compile
运行模型比其他运行速度慢很多,尽管比第一次运行快得多。这是因为 "reduce-overhead"
模式会为 CUDA 图运行几次预热迭代。
现在,让我们比较一下训练。
model = init_model()
opt = torch.optim.Adam(model.parameters())
def train(mod, data):
opt.zero_grad(True)
pred = mod(data[0])
loss = torch.nn.CrossEntropyLoss()(pred, data[1])
loss.backward()
opt.step()
eager_times = []
for i in range(N_ITERS):
inp = generate_data(16)
_, eager_time = timed(lambda: train(model, inp))
eager_times.append(eager_time)
print(f"eager train time {i}: {eager_time}")
print("~" * 10)
model = init_model()
opt = torch.optim.Adam(model.parameters())
# Note that because we are compiling a regular Python function, we do not
# call any .compile() method.
train_opt = torch.compile(train, mode="reduce-overhead")
compile_times = []
for i in range(N_ITERS):
inp = generate_data(16)
_, compile_time = timed(lambda: train_opt(model, inp))
compile_times.append(compile_time)
print(f"compile train time {i}: {compile_time}")
print("~" * 10)
eager_med = np.median(eager_times)
compile_med = np.median(compile_times)
speedup = eager_med / compile_med
assert speedup > 1
print(
f"(train) eager median: {eager_med}, compile median: {compile_med}, speedup: {speedup}x"
)
print("~" * 10)
eager train time 0: 0.2882539367675781
eager train time 1: 0.05161676788330078
eager train time 2: 0.049276927947998046
eager train time 3: 0.05065420913696289
eager train time 4: 0.8006707153320313
eager train time 5: 0.05070438385009766
eager train time 6: 0.05034195327758789
eager train time 7: 0.05022825622558594
eager train time 8: 0.050223102569580076
eager train time 9: 0.05043302536010742
~~~~~~~~~~
compile train time 0: 151.00690625
compile train time 1: 2.915029052734375
compile train time 2: 0.02395030403137207
compile train time 3: 0.021402624130249022
compile train time 4: 0.020746240615844725
compile train time 5: 0.02069811248779297
compile train time 6: 0.020706304550170897
compile train time 7: 0.020715520858764647
compile train time 8: 0.02070425605773926
compile train time 9: 0.020745216369628908
~~~~~~~~~~
(train) eager median: 0.05054361724853516, compile median: 0.020745728492736815, speedup: 2.436338510177203x
~~~~~~~~~~
同样,我们可以看到 torch.compile
在第一次迭代中花费的时间更长,因为它必须编译模型,但在后续迭代中,与 eager 模式相比,我们看到了显著的加速。
我们注意到,本教程中提供的加速数字仅用于演示目的。官方加速值可以在 TorchInductor 性能仪表盘 上查看。
结论#
在本教程中,我们将 torch.compile
应用于真实模型的训练和推理,演示了加速效果。
重要的是,我们注意到编译模型的前几次迭代比 eager 模式慢,因为存在编译开销,但预期后续迭代会有加速。
对于 torch.compile
的初步介绍,请参阅 torch.compile 介绍教程。
为了解决问题并更深入地理解如何将 torch.compile
应用于您的代码,请查看 torch.compile 编程模型。
我们希望您会尝试 torch.compile
!
脚本总运行时间: (3 分钟 29.786 秒)