评价此页

TorchInductor C++ Wrapper 教程#

作者: Chunyuan Wu, Bin Bao, Jiong Gong

先决条件:#

简介#

torch.compile 中,默认后端 TorchInductor 会生成 Python 封装代码(wrapper code)用于管理内存分配和内核调用。这种设计提供了灵活性且易于调试,但在对性能敏感的环境中,Python 的解释执行特性会引入运行时的开销。

为了解决这一限制,TorchInductor 提供了一种特殊模式,可以生成 C++ 封装代码 来替代 Python 封装,从而在尽可能减少 Python 参与的情况下实现更快的执行速度。

启用 C++ 封装模式#

要为 TorchInductor 启用此 C++ 封装模式,请在代码中添加以下配置

import torch._inductor.config as config
config.cpp_wrapper = True

示例代码#

我们将使用以下模型代码作为示例

import torch
import torch._inductor.config as config

config.cpp_wrapper = True

def fn(x, y):
    return (x + y).sum()

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
x = torch.randn(128, 128, device=device)
y = torch.randn(128, 128, device=device)

opt_fn = torch.compile(fn)
result = opt_fn(x, y)

针对 CPU

使用默认 Python 封装时,TorchInductor 生成代码的主体部分如下所示

class Runner:
    def __init__(self, partitions):
        self.partitions = partitions

    def call(self, args):
        arg0_1, arg1_1 = args
        args.clear()
        assert_size_stride(arg0_1, (128, 128), (128, 1))
        assert_size_stride(arg1_1, (128, 128), (128, 1))
        buf0 = empty_strided_cpu((), (), torch.float32)
        cpp_fused_add_sum_0(arg0_1, arg1_1, buf0)
        del arg0_1
        del arg1_1
        return (buf0, )

通过开启 C++ 封装,call 函数的生成代码将变为一个 C++ 函数 inductor_entry_impl

cpp_wrapper_src = (
r'''
#include <torch/csrc/inductor/cpp_wrapper/cpu.h>
extern "C"  void  cpp_fused_add_sum_0(const float* in_ptr0,
                    const float* in_ptr1,
                    float* out_ptr0);
CACHE_TORCH_DTYPE(float32);
CACHE_TORCH_DEVICE(cpu);

void inductor_entry_impl(
    AtenTensorHandle*
        input_handles, // array of input AtenTensorHandle; handles
                        // are stolen; the array itself is borrowed
    AtenTensorHandle*
        output_handles  // array for writing output AtenTensorHandle; handles
                        // will be stolen by the caller; the array itself is
                        // borrowed)
) {
    py::gil_scoped_release_simple release;

    auto inputs = steal_from_raw_handles_to_raii_handles(input_handles, 2);
    auto arg0_1 = std::move(inputs[0]);
    auto arg1_1 = std::move(inputs[1]);
    static constexpr int64_t *int_array_0=nullptr;
    AtenTensorHandle buf0_handle;
    AOTI_TORCH_ERROR_CODE_CHECK(aoti_torch_empty_strided(0, int_array_0, int_array_0, cached_torch_dtype_float32, cached_torch_device_type_cpu,  0, &buf0_handle));
    RAIIAtenTensorHandle buf0(buf0_handle);
    cpp_fused_add_sum_0((const float*)(arg0_1.data_ptr()), (const float*)(arg1_1.data_ptr()), (float*)(buf0.data_ptr()));
    arg0_1.reset();
    arg1_1.reset();
    output_handles[0] = buf0.release();
} // inductor_entry_impl
...
'''
)

inductor_entry = CppWrapperCodeCache.load_pybinding(
    argtypes=["std::vector<AtenTensorHandle>"],
    main_code=cpp_wrapper_src,
    device_type="cpu",
    num_outputs=1,
    kernel_code=None,
)

call = _wrap_func(inductor_entry)

GPU 版本

基于相同的示例代码,GPU 的生成代码如下所示

def call(args):
    arg0_1, = args
    args.clear()
    assert_size_stride(arg0_1, (1, ), (1, ))
    with torch.cuda._DeviceGuard(0):
        torch.cuda.set_device(0) # no-op to ensure context
        buf0 = empty_strided((19, ), (1, ), device='cuda', dtype=torch.float32)
        # Source Nodes: [add, tensor], Original ATen: [aten.add, aten.lift_fresh]
        stream0 = get_cuda_stream(0)
        triton_poi_fused_add_lift_fresh_0.run(constant0, arg0_1, buf0, 19, grid=grid(19), stream=stream0)
        run_intermediate_hooks('add', buf0)
        del arg0_1
        return (buf0, )

开启 C++ 封装后,将生成如下对应的 C++ 代码

inductor_entry = CppWrapperCodeCache.load_pybinding(
    argtypes=["std::vector<AtenTensorHandle>"],
    main_code=cpp_wrapper_src,
    device_type="cuda",
    num_outputs=1,
    kernel_code=None,
)

def _wrap_func(f):
    def g(args):
        input_tensors = [arg if isinstance(arg, torch.Tensor) else torch.tensor(arg, device='cpu') for arg in args]
        input_handles = torch._C._aoti.unsafe_alloc_void_ptrs_from_tensors(input_tensors)

        args.clear()
        del input_tensors

        output_handles = f(input_handles)
        output_tensors = torch._C._aoti.alloc_tensors_by_stealing_from_void_ptrs(output_handles)
        return output_tensors

    return g

call = _wrap_func(inductor_entry)

结论#

本教程介绍了 TorchInductor 中的 C++ 封装 功能,旨在通过极小的代码修改来提升模型性能。我们阐述了该功能的初衷,详细介绍了用于启用的实验性 API,并对比了 CPU 和 GPU 后端在默认 Python 封装与新 C++ 封装下的生成结果,以说明其区别。