引擎缓存¶

随着模型尺寸的增加，编译成本也会随之增加。对于像 torch.dynamo.compile 这样的 AOT 方法，此成本是预先支付的。但是，如果权重发生变化、会话结束，或者您使用的是像 torch.compile 这样的 JIT 方法，当图被失效时，它们会被重新编译，此成本会被反复支付。引擎缓存是一种通过将构建的引擎保存到磁盘并在可能时重用它们来降低此成本的方法。本教程演示了如何在 PyTorch 中使用 TensorRT 进行引擎缓存。引擎缓存可以通过重用先前构建的 TensorRT 引擎来显著加快后续的模型编译速度。

我们将探讨两种方法

使用 torch_tensorrt.dynamo.compile

使用带有 TensorRT 后端的 torch.compile

该示例使用预训练的 ResNet18 模型，并展示了无缓存编译、启用缓存编译以及重用缓存引擎之间的差异。

import os
from typing import Dict, Optional

import numpy as np
import torch
import torch_tensorrt as torch_trt
import torchvision.models as models
from torch_tensorrt.dynamo._defaults import TIMING_CACHE_PATH
from torch_tensorrt.dynamo._engine_cache import BaseEngineCache

np.random.seed(0)
torch.manual_seed(0)

model = models.resnet18(pretrained=True).to("cuda").eval()
enabled_precisions = {torch.float}
min_block_size = 1
use_python_runtime = False


def remove_timing_cache(path=TIMING_CACHE_PATH):
    if os.path.exists(path):
        os.remove(path)

JIT 编译的引擎缓存¶

引擎缓存的主要目标是帮助加速 JIT 工作流。torch.compile 在模型构建方面提供了极大的灵活性，使其成为寻找加速工作流的首选工具。然而，历史上编译成本，特别是重新编译成本，一直是许多用户入门的障碍。如果由于某种原因子图失效，在添加引擎缓存之前，该子图会被从头开始重建。现在，当引擎被构建时，使用 cache_built_engines=True，引擎会被保存到磁盘，并与对应的 PyTorch 子图的哈希值相关联。如果在后续的编译中，无论是本次会话还是新的会话，缓存都会拉取已构建的引擎并**重新适配**权重，这可以将编译时间缩短几个数量级。因此，为了将新引擎插入缓存（即 cache_built_engines=True），引擎必须是可重新适配的（immutable_weights=False）。有关更多详细信息，请参阅使用新权重重新适配 Torch-TensorRT 程序。

def torch_compile(iterations=3):
    times = []
    start = torch.cuda.Event(enable_timing=True)
    end = torch.cuda.Event(enable_timing=True)

    # The 1st iteration is to measure the compilation time without engine caching
    # The 2nd and 3rd iterations are to measure the compilation time with engine caching.
    # Since the 2nd iteration needs to compile and save the engine, it will be slower than the 1st iteration.
    # The 3rd iteration should be faster than the 1st iteration because it loads the cached engine.
    for i in range(iterations):
        inputs = [torch.rand((100, 3, 224, 224)).to("cuda")]
        # remove timing cache and reset dynamo just for engine caching messurement
        remove_timing_cache()
        torch._dynamo.reset()

        if i == 0:
            cache_built_engines = False
            reuse_cached_engines = False
        else:
            cache_built_engines = True
            reuse_cached_engines = True

        start.record()
        compiled_model = torch.compile(
            model,
            backend="tensorrt",
            options={
                "use_python_runtime": True,
                "enabled_precisions": enabled_precisions,
                "min_block_size": min_block_size,
                "immutable_weights": False,
                "cache_built_engines": cache_built_engines,
                "reuse_cached_engines": reuse_cached_engines,
            },
        )
        with torch.no_grad():
            compiled_model(*inputs)  # trigger the compilation
        end.record()
        torch.cuda.synchronize()
        times.append(start.elapsed_time(end))

    print("----------------torch_compile----------------")
    print("disable engine caching, used:", times[0], "ms")
    print("enable engine caching to cache engines, used:", times[1], "ms")
    print("enable engine caching to reuse engines, used:", times[2], "ms")


torch_compile()

AOT 编译的引擎缓存¶

与 JIT 工作流类似，AOT 工作流也可以从引擎缓存中受益。当相同的架构或常见的子图被重新编译时，缓存将拉取先前构建的引擎并重新适配权重。

def dynamo_compile(iterations=3):
    times = []
    start = torch.cuda.Event(enable_timing=True)
    end = torch.cuda.Event(enable_timing=True)

    example_inputs = (torch.randn((100, 3, 224, 224)).to("cuda"),)
    # Mark the dim0 of inputs as dynamic
    batch = torch.export.Dim("batch", min=1, max=200)
    exp_program = torch.export.export(
        model, args=example_inputs, dynamic_shapes={"x": {0: batch}}
    )

    # The 1st iteration is to measure the compilation time without engine caching
    # The 2nd and 3rd iterations are to measure the compilation time with engine caching.
    # Since the 2nd iteration needs to compile and save the engine, it will be slower than the 1st iteration.
    # The 3rd iteration should be faster than the 1st iteration because it loads the cached engine.
    for i in range(iterations):
        inputs = [torch.rand((100 + i, 3, 224, 224)).to("cuda")]
        remove_timing_cache()  # remove timing cache just for engine caching messurement
        if i == 0:
            cache_built_engines = False
            reuse_cached_engines = False
        else:
            cache_built_engines = True
            reuse_cached_engines = True

        start.record()
        trt_gm = torch_trt.dynamo.compile(
            exp_program,
            tuple(inputs),
            use_python_runtime=use_python_runtime,
            enabled_precisions=enabled_precisions,
            min_block_size=min_block_size,
            immutable_weights=False,
            cache_built_engines=cache_built_engines,
            reuse_cached_engines=reuse_cached_engines,
            engine_cache_size=1 << 30,  # 1GB
        )
        # output = trt_gm(*inputs)
        end.record()
        torch.cuda.synchronize()
        times.append(start.elapsed_time(end))

    print("----------------dynamo_compile----------------")
    print("disable engine caching, used:", times[0], "ms")
    print("enable engine caching to cache engines, used:", times[1], "ms")
    print("enable engine caching to reuse engines, used:", times[2], "ms")


dynamo_compile()

自定义引擎缓存¶

默认情况下，引擎缓存存储在系统的临时目录中。可以通过传递 engine_cache_dir 和 engine_cache_size 来自定义缓存目录和大小限制。用户还可以通过继承 BaseEngineCache 类来定义自己的引擎缓存实现。这允许进行远程或共享缓存（如果需要）。

自定义引擎缓存应实现以下方法

save：将引擎 blob 保存到缓存。
load：从缓存加载引擎 blob。

缓存系统提供的哈希值是原始 PyTorch 子图（降低后）的权重无关哈希值。Blob 包含序列化的引擎、调用规范数据和权重图信息，以 pickle 格式存储。

下面是一个实现了 RAMEngineCache 的自定义引擎缓存实现示例。

class RAMEngineCache(BaseEngineCache):
    def __init__(
        self,
    ) -> None:
        """
        Constructs a user held engine cache in memory.
        """
        self.engine_cache: Dict[str, bytes] = {}

    def save(
        self,
        hash: str,
        blob: bytes,
    ):
        """
        Insert the engine blob to the cache.

        Args:
            hash (str): The hash key to associate with the engine blob.
            blob (bytes): The engine blob to be saved.

        Returns:
            None
        """
        self.engine_cache[hash] = blob

    def load(self, hash: str) -> Optional[bytes]:
        """
        Load the engine blob from the cache.

        Args:
            hash (str): The hash key of the engine to load.

        Returns:
            Optional[bytes]: The engine blob if found, None otherwise.
        """
        if hash in self.engine_cache:
            return self.engine_cache[hash]
        else:
            return None


def torch_compile_my_cache(iterations=3):
    times = []
    engine_cache = RAMEngineCache()
    start = torch.cuda.Event(enable_timing=True)
    end = torch.cuda.Event(enable_timing=True)

    # The 1st iteration is to measure the compilation time without engine caching
    # The 2nd and 3rd iterations are to measure the compilation time with engine caching.
    # Since the 2nd iteration needs to compile and save the engine, it will be slower than the 1st iteration.
    # The 3rd iteration should be faster than the 1st iteration because it loads the cached engine.
    for i in range(iterations):
        inputs = [torch.rand((100, 3, 224, 224)).to("cuda")]
        # remove timing cache and reset dynamo just for engine caching messurement
        remove_timing_cache()
        torch._dynamo.reset()

        if i == 0:
            cache_built_engines = False
            reuse_cached_engines = False
        else:
            cache_built_engines = True
            reuse_cached_engines = True

        start.record()
        compiled_model = torch.compile(
            model,
            backend="tensorrt",
            options={
                "use_python_runtime": True,
                "enabled_precisions": enabled_precisions,
                "min_block_size": min_block_size,
                "immutable_weights": False,
                "cache_built_engines": cache_built_engines,
                "reuse_cached_engines": reuse_cached_engines,
                "custom_engine_cache": engine_cache,
            },
        )
        with torch.no_grad():
            compiled_model(*inputs)  # trigger the compilation
        end.record()
        torch.cuda.synchronize()
        times.append(start.elapsed_time(end))

    print("----------------torch_compile----------------")
    print("disable engine caching, used:", times[0], "ms")
    print("enable engine caching to cache engines, used:", times[1], "ms")
    print("enable engine caching to reuse engines, used:", times[2], "ms")


torch_compile_my_cache()

脚本总运行时间： ( 0 分 0.000 秒)

由 Sphinx-Gallery 生成的画廊

引擎缓存¶

JIT 编译的引擎缓存¶

AOT 编译的引擎缓存¶

自定义引擎缓存¶

文档

教程

资源