TorchInductor C++ Wrapper 教程#
作者: Chunyuan Wu, Bin Bao, Jiong Gong
先决条件:#
简介#
在 torch.compile 中,默认后端 TorchInductor 会生成 Python 封装代码(wrapper code)用于管理内存分配和内核调用。这种设计提供了灵活性且易于调试,但在对性能敏感的环境中,Python 的解释执行特性会引入运行时的开销。
为了解决这一限制,TorchInductor 提供了一种特殊模式,可以生成 C++ 封装代码 来替代 Python 封装,从而在尽可能减少 Python 参与的情况下实现更快的执行速度。
启用 C++ 封装模式#
要为 TorchInductor 启用此 C++ 封装模式,请在代码中添加以下配置
import torch._inductor.config as config
config.cpp_wrapper = True
示例代码#
我们将使用以下模型代码作为示例
import torch
import torch._inductor.config as config
config.cpp_wrapper = True
def fn(x, y):
return (x + y).sum()
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
x = torch.randn(128, 128, device=device)
y = torch.randn(128, 128, device=device)
opt_fn = torch.compile(fn)
result = opt_fn(x, y)
针对 CPU
使用默认 Python 封装时,TorchInductor 生成代码的主体部分如下所示
class Runner:
def __init__(self, partitions):
self.partitions = partitions
def call(self, args):
arg0_1, arg1_1 = args
args.clear()
assert_size_stride(arg0_1, (128, 128), (128, 1))
assert_size_stride(arg1_1, (128, 128), (128, 1))
buf0 = empty_strided_cpu((), (), torch.float32)
cpp_fused_add_sum_0(arg0_1, arg1_1, buf0)
del arg0_1
del arg1_1
return (buf0, )
通过开启 C++ 封装,call 函数的生成代码将变为一个 C++ 函数 inductor_entry_impl
cpp_wrapper_src = (
r'''
#include <torch/csrc/inductor/cpp_wrapper/cpu.h>
extern "C" void cpp_fused_add_sum_0(const float* in_ptr0,
const float* in_ptr1,
float* out_ptr0);
CACHE_TORCH_DTYPE(float32);
CACHE_TORCH_DEVICE(cpu);
void inductor_entry_impl(
AtenTensorHandle*
input_handles, // array of input AtenTensorHandle; handles
// are stolen; the array itself is borrowed
AtenTensorHandle*
output_handles // array for writing output AtenTensorHandle; handles
// will be stolen by the caller; the array itself is
// borrowed)
) {
py::gil_scoped_release_simple release;
auto inputs = steal_from_raw_handles_to_raii_handles(input_handles, 2);
auto arg0_1 = std::move(inputs[0]);
auto arg1_1 = std::move(inputs[1]);
static constexpr int64_t *int_array_0=nullptr;
AtenTensorHandle buf0_handle;
AOTI_TORCH_ERROR_CODE_CHECK(aoti_torch_empty_strided(0, int_array_0, int_array_0, cached_torch_dtype_float32, cached_torch_device_type_cpu, 0, &buf0_handle));
RAIIAtenTensorHandle buf0(buf0_handle);
cpp_fused_add_sum_0((const float*)(arg0_1.data_ptr()), (const float*)(arg1_1.data_ptr()), (float*)(buf0.data_ptr()));
arg0_1.reset();
arg1_1.reset();
output_handles[0] = buf0.release();
} // inductor_entry_impl
...
'''
)
inductor_entry = CppWrapperCodeCache.load_pybinding(
argtypes=["std::vector<AtenTensorHandle>"],
main_code=cpp_wrapper_src,
device_type="cpu",
num_outputs=1,
kernel_code=None,
)
call = _wrap_func(inductor_entry)
GPU 版本
基于相同的示例代码,GPU 的生成代码如下所示
def call(args):
arg0_1, = args
args.clear()
assert_size_stride(arg0_1, (1, ), (1, ))
with torch.cuda._DeviceGuard(0):
torch.cuda.set_device(0) # no-op to ensure context
buf0 = empty_strided((19, ), (1, ), device='cuda', dtype=torch.float32)
# Source Nodes: [add, tensor], Original ATen: [aten.add, aten.lift_fresh]
stream0 = get_cuda_stream(0)
triton_poi_fused_add_lift_fresh_0.run(constant0, arg0_1, buf0, 19, grid=grid(19), stream=stream0)
run_intermediate_hooks('add', buf0)
del arg0_1
return (buf0, )
开启 C++ 封装后,将生成如下对应的 C++ 代码
inductor_entry = CppWrapperCodeCache.load_pybinding(
argtypes=["std::vector<AtenTensorHandle>"],
main_code=cpp_wrapper_src,
device_type="cuda",
num_outputs=1,
kernel_code=None,
)
def _wrap_func(f):
def g(args):
input_tensors = [arg if isinstance(arg, torch.Tensor) else torch.tensor(arg, device='cpu') for arg in args]
input_handles = torch._C._aoti.unsafe_alloc_void_ptrs_from_tensors(input_tensors)
args.clear()
del input_tensors
output_handles = f(input_handles)
output_tensors = torch._C._aoti.alloc_tensors_by_stealing_from_void_ptrs(output_handles)
return output_tensors
return g
call = _wrap_func(inductor_entry)
结论#
本教程介绍了 TorchInductor 中的 C++ 封装 功能,旨在通过极小的代码修改来提升模型性能。我们阐述了该功能的初衷,详细介绍了用于启用的实验性 API,并对比了 CPU 和 GPU 后端在默认 Python 封装与新 C++ 封装下的生成结果,以说明其区别。