(beta) 使用 torch.compile 编译优化器#

创建于：2024年1月24日 | 最后更新：2024年1月29日 | 最后验证：2024年11月05日

作者： Michael Lazos

优化器是训练任何深度学习模型的关键算法。由于它负责更新模型的每个参数，因此对于大型模型来说，它经常会成为训练性能的瓶颈。在本示例中，我们将把 torch.compile 应用于优化器，以观察 GPU 性能的提升。

注意

本教程需要 PyTorch 2.2.0 或更高版本。

模型设置#

在此示例中，我们将使用一个简单的线性层序列。由于我们只对优化器进行基准测试，因此模型的选择无关紧要，因为优化器性能是参数数量的函数。

根据您使用的机器，您的确切结果可能会有所不同。

import torch

model = torch.nn.Sequential(
    *[torch.nn.Linear(1024, 1024, False, device="cuda") for _ in range(10)]
)
input = torch.rand(1024, device="cuda")
output = model(input)
output.sum().backward()

设置和运行优化器基准测试#

在此示例中，我们将使用 Adam 优化器，并创建一个辅助函数来将 step() 包装在 torch.compile() 中。

注意

torch.compile 仅支持计算能力 >= 7.0 的 CUDA 设备

# exit cleanly if we are on a device that doesn't support torch.compile
if torch.cuda.get_device_capability() < (7, 0):
    print("Exiting because torch.compile is not supported on this device.")
    import sys
    sys.exit(0)


opt = torch.optim.Adam(model.parameters(), lr=0.01)


@torch.compile(fullgraph=False)
def fn():
    opt.step()


# Let's define a helpful benchmarking function:
import torch.utils.benchmark as benchmark


def benchmark_torch_function_in_microseconds(f, *args, **kwargs):
    t0 = benchmark.Timer(
        stmt="f(*args, **kwargs)", globals={"args": args, "kwargs": kwargs, "f": f}
    )
    return t0.blocked_autorange().mean * 1e6


# Warmup runs to compile the function
for _ in range(5):
    fn()

eager_runtime = benchmark_torch_function_in_microseconds(opt.step)
compiled_runtime = benchmark_torch_function_in_microseconds(fn)

assert eager_runtime > compiled_runtime

print(f"eager runtime: {eager_runtime}us")
print(f"compiled runtime: {compiled_runtime}us")

示例结果

Eager 运行时：747.2437149845064us
Compiled 运行时：392.07384741178us

另请参阅#

有关深入的技术概述，请参阅

使用 PT2 编译优化器

(beta) 使用 torch.compile 编译优化器#

模型设置#

设置和运行优化器基准测试#

另请参阅#

文档

教程

资源