注意
跳转至页面底部 下载完整示例代码。
torch.compile 端到端教程#
作者: William Wen
torch.compile 是加速 PyTorch 代码的新方法!torch.compile 通过将 PyTorch 代码即时 (JIT) 编译为优化后的内核,在几乎无需修改代码的前提下,使 PyTorch 代码运行得更快。
本教程通过一个端到端示例,演示了如何使用 torch.compile 训练和评估真实模型。如需了解 torch.compile 的基础入门知识,请查看 torch.compile 入门教程。
所需的 pip 依赖
torch >= 2.0torchvision
如何将
torch.compile应用于真实模型torch.compile在真实模型上的加速效果由于存在编译开销,
torch.compile的前几次迭代通常较慢
# NOTE: a modern NVIDIA GPU (H100, A100, or V100) is recommended for this tutorial in
# order to reproduce the speedup numbers shown below and documented elsewhere.
import torch
import warnings
gpu_ok = False
if torch.cuda.is_available():
device_cap = torch.cuda.get_device_capability()
if device_cap in ((7, 0), (8, 0), (9, 0)):
gpu_ok = True
if not gpu_ok:
warnings.warn(
"GPU is not NVIDIA V100, A100, or H100. Speedup numbers may be lower "
"than expected."
)
/var/lib/workspace/intermediate_source/torch_compile_full_example.py:51: UserWarning: GPU is not NVIDIA V100, A100, or H100. Speedup numbers may be lower than expected.
warnings.warn(
让我们演示一下如何使用 torch.compile 来加速真实模型。我们将通过在随机数据上评估和训练一个 torchvision 模型,来比较标准 Eager 模式与 torch.compile 的性能。
在开始之前,我们需要定义一些实用函数。
# Returns the result of running `fn()` and the time it took for `fn()` to run,
# in seconds. We use CUDA events and synchronization for the most accurate
# measurements.
def timed(fn):
start = torch.cuda.Event(enable_timing=True)
end = torch.cuda.Event(enable_timing=True)
start.record()
result = fn()
end.record()
torch.cuda.synchronize()
return result, start.elapsed_time(end) / 1000
# Generates random input and targets data for the model, where `b` is
# batch size.
def generate_data(b):
return (
torch.randn(b, 3, 128, 128).cuda(),
torch.randint(1000, (b,)).cuda(),
)
N_ITERS = 10
from torchvision.models import densenet121
def init_model():
return densenet121().cuda()
首先,我们来比较推理速度。
请注意,在调用 torch.compile 时,我们添加了 mode 参数,我们将在下文对此进行讨论。
model = init_model()
# Note that we generally recommend directly compiling a torch.nn.Module by calling
# its .compile() method.
model_opt = init_model()
model_opt.compile(mode="reduce-overhead")
inp = generate_data(16)[0]
with torch.no_grad():
print("eager:", timed(lambda: model(inp))[1])
print("compile:", timed(lambda: model_opt(inp))[1])
eager: 0.34809857177734377
/usr/local/lib/python3.10/dist-packages/torch/_inductor/compile_fx.py:320: UserWarning: TensorFloat32 tensor cores for float32 matrix multiplication available but not enabled. Consider setting `torch.set_float32_matmul_precision('high')` for better performance.
warnings.warn(
compile: 52.91118359375
注意,与 Eager 模式相比,torch.compile 完成任务所需的时间要长得多。这是因为 torch.compile 在执行过程中会将模型编译为优化后的内核。在我们的示例中,模型的结构没有改变,因此不需要重新编译。所以如果我们多次运行优化后的模型,应该会看到比 Eager 模式显著的性能提升。
eager_times = []
for i in range(N_ITERS):
inp = generate_data(16)[0]
with torch.no_grad():
_, eager_time = timed(lambda: model(inp))
eager_times.append(eager_time)
print(f"eager eval time {i}: {eager_time}")
print("~" * 10)
compile_times = []
for i in range(N_ITERS):
inp = generate_data(16)[0]
with torch.no_grad():
_, compile_time = timed(lambda: model_opt(inp))
compile_times.append(compile_time)
print(f"compile eval time {i}: {compile_time}")
print("~" * 10)
import numpy as np
eager_med = np.median(eager_times)
compile_med = np.median(compile_times)
speedup = eager_med / compile_med
assert speedup > 1
print(
f"(eval) eager median: {eager_med}, compile median: {compile_med}, speedup: {speedup}x"
)
print("~" * 10)
eager eval time 0: 0.017992704391479493
eager eval time 1: 0.017092607498168946
eager eval time 2: 0.01656831932067871
eager eval time 3: 0.016517120361328123
eager eval time 4: 0.016464895248413085
eager eval time 5: 0.016475135803222657
eager eval time 6: 0.016488447189331054
eager eval time 7: 0.016507904052734376
eager eval time 8: 0.016446464538574217
eager eval time 9: 0.016450559616088867
~~~~~~~~~~
compile eval time 0: 0.08503398132324219
compile eval time 1: 0.008704000473022461
compile eval time 2: 0.008994815826416015
compile eval time 3: 0.00807423973083496
compile eval time 4: 0.008157183647155761
compile eval time 5: 0.008086527824401855
compile eval time 6: 0.008127488136291505
compile eval time 7: 0.008078335762023926
compile eval time 8: 0.008053759574890136
compile eval time 9: 0.008030207633972167
~~~~~~~~~~
(eval) eager median: 0.016498175621032715, compile median: 0.00810700798034668, speedup: 2.035051113928619x
~~~~~~~~~~
确实,我们可以看到使用 torch.compile 运行模型带来了显著的加速。加速主要源于减少 Python 开销和 GPU 读/写操作,因此观察到的加速效果可能会因模型架构和批次大小等因素而异。例如,如果模型架构简单且数据量大,那么瓶颈可能在于 GPU 计算,观察到的加速效果可能就不那么显著了。
根据选择的 mode 参数,您可能会看到不同的加速结果。"reduce-overhead" 模式使用 CUDA 图来进一步减少 Python 的开销。对于您自己的模型,可能需要尝试不同的模式以获得最大加速。您可以点击此处阅读关于模式的更多信息。
您可能还会注意到,第二次使用 torch.compile 运行模型时,速度明显慢于其他后续运行,尽管它仍然比第一次运行快得多。这是因为 "reduce-overhead" 模式会针对 CUDA 图进行几次预热迭代。
现在,让我们比较训练过程。
model = init_model()
opt = torch.optim.Adam(model.parameters())
def train(mod, data):
opt.zero_grad(True)
pred = mod(data[0])
loss = torch.nn.CrossEntropyLoss()(pred, data[1])
loss.backward()
opt.step()
eager_times = []
for i in range(N_ITERS):
inp = generate_data(16)
_, eager_time = timed(lambda: train(model, inp))
eager_times.append(eager_time)
print(f"eager train time {i}: {eager_time}")
print("~" * 10)
model = init_model()
opt = torch.optim.Adam(model.parameters())
# Note that because we are compiling a regular Python function, we do not
# call any .compile() method.
train_opt = torch.compile(train, mode="reduce-overhead")
compile_times = []
for i in range(N_ITERS):
inp = generate_data(16)
_, compile_time = timed(lambda: train_opt(model, inp))
compile_times.append(compile_time)
print(f"compile train time {i}: {compile_time}")
print("~" * 10)
eager_med = np.median(eager_times)
compile_med = np.median(compile_times)
speedup = eager_med / compile_med
assert speedup > 1
print(
f"(train) eager median: {eager_med}, compile median: {compile_med}, speedup: {speedup}x"
)
print("~" * 10)
eager train time 0: 0.3411742858886719
eager train time 1: 0.05094604873657226
eager train time 2: 0.04867686462402344
eager train time 3: 0.04928204727172852
eager train time 4: 0.048527359008789066
eager train time 5: 0.0485294075012207
eager train time 6: 0.04844543838500977
eager train time 7: 0.048551902770996094
eager train time 8: 0.04838809585571289
eager train time 9: 0.04867686462402344
~~~~~~~~~~
compile train time 0: 160.207609375
compile train time 1: 2.549231689453125
compile train time 2: 0.022395904541015626
compile train time 3: 0.020785152435302736
compile train time 4: 0.020130815505981444
compile train time 5: 0.020139007568359374
compile train time 6: 0.02016147232055664
compile train time 7: 0.020136959075927736
compile train time 8: 0.020143104553222657
compile train time 9: 0.020135936737060548
~~~~~~~~~~
(train) eager median: 0.04861438369750977, compile median: 0.020152288436889647, speedup: 2.412350530301016x
~~~~~~~~~~
同样,我们可以看到 torch.compile 在第一次迭代时耗时较长,因为它必须编译模型,但在随后的迭代中,我们观察到了相比 Eager 模式的显著加速。
我们在此指出,本教程中提供的加速数据仅用于演示。官方的加速值可以在 TorchInductor 性能仪表板上查阅。
结论#
在本教程中,我们将 torch.compile 应用于真实模型的训练和推理,并展示了其带来的加速效果。
重要的是,我们要指出,由于编译开销,已编译模型的前几次迭代比 Eager 模式慢,但随后的迭代预期会有所加速。
如需了解 torch.compile 的基础入门知识,请查看 torch.compile 入门教程。
如需排查问题并深入了解如何将 torch.compile 应用于您的代码,请查看 torch.compile 编程模型。
希望您能试用一下 torch.compile!
脚本总运行时间: (3 分钟 38.957 秒)