注意
跳至末尾下载完整示例代码
引擎缓存¶
随着模型大小的增加,编译成本也会增加。对于像 torch.dynamo.compile
这样的 AOT (提前编译) 方法,这个成本是预先支付的。但是,如果权重发生变化、会话结束,或者您正在使用像 torch.compile
这样的 JIT (即时编译) 方法,那么当图失效时,它们会被重新编译,这个成本将重复产生。引擎缓存是一种通过将构建好的引擎保存到磁盘并在可能时重复使用来降低此成本的方法。本教程演示了如何在 PyTorch 中使用 TensorRT 的引擎缓存。引擎缓存可以通过重用先前构建的 TensorRT 引擎来显著加快后续的模型编译速度。
我们将探讨两种方法:
使用 torch_tensorrt.dynamo.compile
使用带有 TensorRT 后端的 torch.compile
该示例使用了一个预训练的 ResNet18 模型,并展示了不使用缓存、启用缓存以及重用缓存引擎进行编译时的差异。
import os
from typing import Dict, Optional
import numpy as np
import torch
import torch_tensorrt as torch_trt
import torchvision.models as models
from torch_tensorrt.dynamo._defaults import TIMING_CACHE_PATH
from torch_tensorrt.dynamo._engine_cache import BaseEngineCache
np.random.seed(0)
torch.manual_seed(0)
model = models.resnet18(pretrained=True).eval().to("cuda")
enabled_precisions = {torch.float}
min_block_size = 1
use_python_runtime = False
def remove_timing_cache(path=TIMING_CACHE_PATH):
if os.path.exists(path):
os.remove(path)
JIT 编译的引擎缓存¶
引擎缓存的主要目标是帮助加速 JIT 工作流。torch.compile
在模型构建方面提供了极大的灵活性,使其成为寻求加速工作流时可以尝试的首选工具。然而,历史上,编译成本,特别是重编译成本,对许多用户来说是一个进入门槛。如果在引擎缓存功能添加之前,由于某种原因导致子图失效,该图会从头开始重建。现在,当引擎构建时,若设置 cache_built_engines=True
,引擎将被保存到磁盘,并与其对应的 PyTorch 子图的哈希值相关联。如果在后续的编译中,无论是在当前会话还是新会话中,缓存都将拉取已构建的引擎并**重置(refit)**权重,这可以将编译时间减少几个数量级。因此,为了将新引擎插入缓存(即 cache_built_engines=True
),该引擎必须是可重置的(immutable_weights=False
)。更多详情请参见使用新权重重置 Torch-TensorRT 程序。
def torch_compile(iterations=3):
times = []
start = torch.cuda.Event(enable_timing=True)
end = torch.cuda.Event(enable_timing=True)
# The 1st iteration is to measure the compilation time without engine caching
# The 2nd and 3rd iterations are to measure the compilation time with engine caching.
# Since the 2nd iteration needs to compile and save the engine, it will be slower than the 1st iteration.
# The 3rd iteration should be faster than the 1st iteration because it loads the cached engine.
for i in range(iterations):
inputs = [torch.rand((100, 3, 224, 224)).to("cuda")]
# remove timing cache and reset dynamo just for engine caching messurement
remove_timing_cache()
torch._dynamo.reset()
if i == 0:
cache_built_engines = False
reuse_cached_engines = False
else:
cache_built_engines = True
reuse_cached_engines = True
start.record()
compiled_model = torch.compile(
model,
backend="tensorrt",
options={
"use_python_runtime": True,
"enabled_precisions": enabled_precisions,
"min_block_size": min_block_size,
"immutable_weights": False,
"cache_built_engines": cache_built_engines,
"reuse_cached_engines": reuse_cached_engines,
},
)
compiled_model(*inputs) # trigger the compilation
end.record()
torch.cuda.synchronize()
times.append(start.elapsed_time(end))
print("----------------torch_compile----------------")
print("disable engine caching, used:", times[0], "ms")
print("enable engine caching to cache engines, used:", times[1], "ms")
print("enable engine caching to reuse engines, used:", times[2], "ms")
torch_compile()
AOT 编译的引擎缓存¶
与 JIT 工作流类似,AOT 工作流也可以从引擎缓存中受益。当相同的架构或常见的子图被重新编译时,缓存将拉取先前构建的引擎并重置权重。
def dynamo_compile(iterations=3):
times = []
start = torch.cuda.Event(enable_timing=True)
end = torch.cuda.Event(enable_timing=True)
example_inputs = (torch.randn((100, 3, 224, 224)).to("cuda"),)
# Mark the dim0 of inputs as dynamic
batch = torch.export.Dim("batch", min=1, max=200)
exp_program = torch.export.export(
model, args=example_inputs, dynamic_shapes={"x": {0: batch}}
)
# The 1st iteration is to measure the compilation time without engine caching
# The 2nd and 3rd iterations are to measure the compilation time with engine caching.
# Since the 2nd iteration needs to compile and save the engine, it will be slower than the 1st iteration.
# The 3rd iteration should be faster than the 1st iteration because it loads the cached engine.
for i in range(iterations):
inputs = [torch.rand((100 + i, 3, 224, 224)).to("cuda")]
remove_timing_cache() # remove timing cache just for engine caching messurement
if i == 0:
cache_built_engines = False
reuse_cached_engines = False
else:
cache_built_engines = True
reuse_cached_engines = True
start.record()
trt_gm = torch_trt.dynamo.compile(
exp_program,
tuple(inputs),
use_python_runtime=use_python_runtime,
enabled_precisions=enabled_precisions,
min_block_size=min_block_size,
immutable_weights=False,
cache_built_engines=cache_built_engines,
reuse_cached_engines=reuse_cached_engines,
engine_cache_size=1 << 30, # 1GB
)
# output = trt_gm(*inputs)
end.record()
torch.cuda.synchronize()
times.append(start.elapsed_time(end))
print("----------------dynamo_compile----------------")
print("disable engine caching, used:", times[0], "ms")
print("enable engine caching to cache engines, used:", times[1], "ms")
print("enable engine caching to reuse engines, used:", times[2], "ms")
dynamo_compile()
自定义引擎缓存¶
默认情况下,引擎缓存存储在系统的临时目录中。缓存目录和大小限制都可以通过传递 engine_cache_dir
和 engine_cache_size
进行自定义。用户还可以通过扩展 BaseEngineCache
类来定义自己的引擎缓存实现。如果需要,这允许进行远程或共享缓存。
- 自定义引擎缓存应实现以下方法:
save
:将引擎二进制大对象(blob)保存到缓存。load
:从缓存中加载引擎二进制大对象。
缓存系统提供的哈希值是源 PyTorch 子图(降级后)的与权重无关的哈希值。该二进制大对象包含序列化的引擎、调用规范数据和 pickle 格式的权重映射信息。
以下是一个自定义引擎缓存实现的示例,该示例实现了一个 RAMEngineCache
。
class RAMEngineCache(BaseEngineCache):
def __init__(
self,
) -> None:
"""
Constructs a user held engine cache in memory.
"""
self.engine_cache: Dict[str, bytes] = {}
def save(
self,
hash: str,
blob: bytes,
):
"""
Insert the engine blob to the cache.
Args:
hash (str): The hash key to associate with the engine blob.
blob (bytes): The engine blob to be saved.
Returns:
None
"""
self.engine_cache[hash] = blob
def load(self, hash: str) -> Optional[bytes]:
"""
Load the engine blob from the cache.
Args:
hash (str): The hash key of the engine to load.
Returns:
Optional[bytes]: The engine blob if found, None otherwise.
"""
if hash in self.engine_cache:
return self.engine_cache[hash]
else:
return None
def torch_compile_my_cache(iterations=3):
times = []
engine_cache = RAMEngineCache()
start = torch.cuda.Event(enable_timing=True)
end = torch.cuda.Event(enable_timing=True)
# The 1st iteration is to measure the compilation time without engine caching
# The 2nd and 3rd iterations are to measure the compilation time with engine caching.
# Since the 2nd iteration needs to compile and save the engine, it will be slower than the 1st iteration.
# The 3rd iteration should be faster than the 1st iteration because it loads the cached engine.
for i in range(iterations):
inputs = [torch.rand((100, 3, 224, 224)).to("cuda")]
# remove timing cache and reset dynamo just for engine caching messurement
remove_timing_cache()
torch._dynamo.reset()
if i == 0:
cache_built_engines = False
reuse_cached_engines = False
else:
cache_built_engines = True
reuse_cached_engines = True
start.record()
compiled_model = torch.compile(
model,
backend="tensorrt",
options={
"use_python_runtime": True,
"enabled_precisions": enabled_precisions,
"min_block_size": min_block_size,
"immutable_weights": False,
"cache_built_engines": cache_built_engines,
"reuse_cached_engines": reuse_cached_engines,
"custom_engine_cache": engine_cache,
},
)
compiled_model(*inputs) # trigger the compilation
end.record()
torch.cuda.synchronize()
times.append(start.elapsed_time(end))
print("----------------torch_compile----------------")
print("disable engine caching, used:", times[0], "ms")
print("enable engine caching to cache engines, used:", times[1], "ms")
print("enable engine caching to reuse engines, used:", times[2], "ms")
torch_compile_my_cache()
脚本总运行时间: ( 0 分 0.000 秒)