训练后量化 (PTQ)¶

训练后量化 (PTQ) 是一种技术，通过将传统的 FP32 激活空间映射到简化的 INT8 空间，在保持模型精度的同时减少推理所需的计算资源。TensorRT 使用校准步骤，该步骤使用目标域的样本数据执行模型，并跟踪 FP32 中的激活，以校准一个映射到 INT8 的映射，从而最大限度地减少 FP32 推理和 INT8 推理之间的信息损失。

编写 TensorRT 应用程序的用户需要设置一个校准器类，该类将为 TensorRT 校准器提供样本数据。通过 Torch-TensorRT，我们希望利用 PyTorch 中现有的基础设施来简化校准器的实现。

LibTorch 提供了 DataLoader 和 Dataset API，可简化输入数据的预处理和批处理。这些 API 通过 C++ 和 Python 接口公开，使用户更易于使用。对于 C++ 接口，我们使用 torch::Dataset 和 torch::data::make_data_loader 对象来构建和执行数据集的预处理。Python 接口中的相应功能使用 torch.utils.data.Dataset 和 torch.utils.data.DataLoader。PyTorch 文档的这一部分有更多信息 https://pytorch.ac.cn/tutorials/advanced/cpp_frontend.html#loading-data 和 https://pytorch.ac.cn/tutorials/recipes/recipes/loading_data_recipe.html。Torch-TensorRT 使用 Dataloaders 作为通用校准器实现的基类。因此，您将能够重用或快速实现一个用于目标域的 torch::Dataset，将其放入 DataLoader 中，并创建一个 INT8 校准器，您可以将其提供给 Torch-TensorRT，以便在模块编译期间运行 INT8 校准。

如何在 C++ 中创建自己的 PTQ 应用程序¶

这是 CIFAR10 的 torch::Dataset 类的示例接口

//cpp/ptq/datasets/cifar10.h
#pragma once

#include "torch/data/datasets/base.h"
#include "torch/data/example.h"
#include "torch/types.h"

#include <cstddef>
#include <string>

namespace datasets {
// The CIFAR10 Dataset
class CIFAR10 : public torch::data::datasets::Dataset<CIFAR10> {
public:
    // The mode in which the dataset is loaded
    enum class Mode { kTrain, kTest };

    // Loads CIFAR10 from un-tarred file
    // Dataset can be found https://www.cs.toronto.edu/~kriz/cifar-10-binary.tar.gz
    // Root path should be the directory that contains the content of tarball
    explicit CIFAR10(const std::string& root, Mode mode = Mode::kTrain);

    // Returns the pair at index in the dataset
    torch::data::Example<> get(size_t index) override;

    // The size of the dataset
    c10::optional<size_t> size() const override;

    // The mode the dataset is in
    bool is_train() const noexcept;

    // Returns all images stacked into a single tensor
    const torch::Tensor& images() const;

    // Returns all targets stacked into a single tensor
    const torch::Tensor& targets() const;

    // Trims the dataset to the first n pairs
    CIFAR10&& use_subset(int64_t new_size);


private:
    Mode mode_;
    torch::Tensor images_, targets_;
};
} // namespace datasets

此类的实现从 CIFAR10 数据集的二进制分发版读取，并构建两个保存图像和标签的张量。

我们使用数据集的子集进行校准，因为我们不需要完整数据集进行有效校准，而校准需要一些时间，然后定义要应用于数据集中图像的预处理，并从数据集创建 DataLoader，该 DataLoader 将对数据进行批处理。

auto calibration_dataset = datasets::CIFAR10(data_dir, datasets::CIFAR10::Mode::kTest)
                                    .use_subset(320)
                                    .map(torch::data::transforms::Normalize<>({0.4914, 0.4822, 0.4465},
                                                                            {0.2023, 0.1994, 0.2010}))
                                    .map(torch::data::transforms::Stack<>());
auto calibration_dataloader = torch::data::make_data_loader(std::move(calibration_dataset),
                                                            torch::data::DataLoaderOptions().batch_size(32)
                                                                                            .workers(2));

接下来，我们使用校准器工厂（位于 torch_tensorrt/ptq.h 中）从 calibration_dataloader 创建校准器。

#include "torch_tensorrt/ptq.h"
...

auto calibrator = torch_tensorrt::ptq::make_int8_calibrator(std::move(calibration_dataloader), calibration_cache_file, true);

在这里，我们还定义了一个用于写入校准缓存文件的位置，我们可以使用该文件在不需要数据集的情况下重用校准数据，以及是否应该使用存在的缓存文件。还有一个 torch_tensorrt::ptq::make_int8_cache_calibrator 工厂，它创建一个仅在您在存储空间有限的机器上进行引擎构建时（即没有空间容纳完整数据集）或拥有更简单的部署应用程序时才使用缓存的校准器。

校准器工厂创建一个继承自 nvinfer1::IInt8Calibrator 虚拟类（默认情况下为 nvinfer1::IInt8EntropyCalibrator2）的校准器，该类定义了校准时使用的校准算法。您可以像这样明确选择校准算法。

// MinMax Calibrator is geared more towards NLP tasks
auto calibrator = torch_tensorrt::ptq::make_int8_calibrator<nvinfer1::IInt8MinMaxCalibrator>(std::move(calibration_dataloader), calibration_cache_file, true);

然后，设置模块以进行 INT8 校准所需的所有操作就是将 `compile_settings` 中的以下编译设置设置为 torch_tensorrt::CompileSpec 结构并编译模块。

std::vector<std::vector<int64_t>> input_shape = {{32, 3, 32, 32}};
/// Configure settings for compilation
auto compile_spec = torch_tensorrt::CompileSpec({input_shape});
/// Set operating precision to INT8
compile_spec.enabled_precisions.insert(torch::kF16);
compile_spec.enabled_precisions.insert(torch::kI8);
/// Use the TensorRT Entropy Calibrator
compile_spec.ptq_calibrator = calibrator;

auto trt_mod = torch_tensorrt::CompileGraph(mod, compile_spec);

如果您已经拥有 TensorRT 的现有校准器实现，您可以直接将 `ptq_calibrator` 字段设置为指向您的校准器的指针，它也能正常工作。从这里开始，执行方式的变化不大。您仍然可以完全使用 LibTorch 作为推理的唯一接口。数据在传递到 trt_mod.forward 时应保持 FP32 精度。Torch-TensorRT 演示中有一个示例应用程序，展示了如何从在 CIFAR10 上训练 VGG16 网络到使用 Torch-TensorRT 进行 INT8 部署：https://github.com/pytorch/TensorRT/tree/master/cpp/ptq

如何在 Python 中创建自己的 PTQ 应用程序¶

Torch-TensorRT Python API 提供了一种简单便捷的方式，可以将 pytorch 数据加载器与 TensorRT 校准器一起使用。可以通过提供所需的配置来使用 DataLoaderCalibrator 类创建 TensorRT 校准器。以下代码演示了一个如何使用它的示例。

testing_dataset = torchvision.datasets.CIFAR10(
    root="./data",
    train=False,
    download=True,
    transform=transforms.Compose(
        [
            transforms.ToTensor(),
            transforms.Normalize((0.4914, 0.4822, 0.4465), (0.2023, 0.1994, 0.2010)),
        ]
    ),
)

testing_dataloader = torch.utils.data.DataLoader(
    testing_dataset, batch_size=1, shuffle=False, num_workers=1
)
calibrator = torch_tensorrt.ptq.DataLoaderCalibrator(
    testing_dataloader,
    cache_file="./calibration.cache",
    use_cache=False,
    algo_type=torch_tensorrt.ptq.CalibrationAlgo.ENTROPY_CALIBRATION_2,
    device=torch.device("cuda:0"),
)

trt_mod = torch_tensorrt.compile(model, inputs=[torch_tensorrt.Input((1, 3, 32, 32))],
                                    enabled_precisions={torch.float, torch.half, torch.int8},
                                    calibrator=calibrator,
                                    device={
                                         "device_type": torch_tensorrt.DeviceType.GPU,
                                         "gpu_id": 0,
                                         "dla_core": 0,
                                         "allow_gpu_fallback": False,
                                         "disable_tf32": False
                                     })

在存在用户想要使用的预先存在的校准缓存文件的情况下，CacheCalibrator 可以在没有任何数据加载器的情况下使用。以下示例演示了如何使用 CacheCalibrator 在 INT8 模式下使用。

calibrator = torch_tensorrt.ptq.CacheCalibrator("./calibration.cache")

trt_mod = torch_tensorrt.compile(model, inputs=[torch_tensorrt.Input([1, 3, 32, 32])],
                                      enabled_precisions={torch.float, torch.half, torch.int8},
                                      calibrator=calibrator)

如果您已经拥有现有的校准器类（直接使用 TensorRT API 实现），您可以直接将校准器字段设置为您的类，这可能非常方便。有关如何使用 Torch-TensorRT API 对 VGG 网络执行 PTQ 的演示，您可以参考 https://github.com/pytorch/TensorRT/blob/master/tests/py/test_ptq_dataloader_calibrator.py 和 https://github.com/pytorch/TensorRT/blob/master/tests/py/test_ptq_trt_calibrator.py

引用¶

Krizhevsky, A., & Hinton, G. (2009)。从微小图像中学习多层特征。

Simonyan, K., & Zisserman, A. (2014)。用于大规模图像识别的非常深度的卷积网络。arXiv 预印本 arXiv:1409.1556。

训练后量化 (PTQ)¶

如何在 C++ 中创建自己的 PTQ 应用程序¶

如何在 Python 中创建自己的 PTQ 应用程序¶

引用¶

文档

教程

资源