评价此页

★ ★ ★ ★ ★

intermediate/memory_format_tutorial

在 Google Colab 中运行

注意

跳转至末尾下载完整的示例代码。

PyTorch 中的 Channels Last 内存格式#

创建日期：2020 年 4 月 20 日 | 最后更新：2025 年 7 月 9 日 | 最后验证：2024 年 11 月 5 日

作者：Vitaly Fedyunin

您将学到什么

PyTorch 中的 Channels Last 内存格式是什么？
它如何用于提高某些算子的性能？

先决条件

PyTorch v1.5.0
支持 CUDA 的 GPU

Channels Last 内存格式是 NCHW 张量在内存中排序的一种替代方式，它保留了维度的顺序。Channels Last 张量排序的方式使得通道（channels）成为最密集（densest）的维度（即逐像素存储图像）。

例如，NCHW 张量（在本例中是两个 4x4 的具有 3 个颜色通道的图像）的经典（连续）存储方式如下：

Channels Last 内存格式的排序方式不同

PyTorch 通过利用现有的 stride 结构来支持内存格式。例如，Channels Last 格式下的 10x3x16x16 的 batch 张量将具有等于 (768, 1, 48, 3) 的 stride。

Channels Last 内存格式仅为 4D NCHW 张量实现。

内存格式 API#

以下是如何在连续（contiguous）和 Channels Last 内存格式之间转换张量的方法。

经典的 PyTorch 连续张量

import torch

N, C, H, W = 10, 3, 32, 32
x = torch.empty(N, C, H, W)
print(x.stride())  # Outputs: (3072, 1024, 32, 1)

(3072, 1024, 32, 1)

转换运算符

x = x.to(memory_format=torch.channels_last)
print(x.shape)  # Outputs: (10, 3, 32, 32) as dimensions order preserved
print(x.stride())  # Outputs: (3072, 1, 96, 3)

torch.Size([10, 3, 32, 32])
(3072, 1, 96, 3)

返回连续

x = x.to(memory_format=torch.contiguous_format)
print(x.stride())  # Outputs: (3072, 1024, 32, 1)

(3072, 1024, 32, 1)

替代选项

x = x.contiguous(memory_format=torch.channels_last)
print(x.stride())  # Outputs: (3072, 1, 96, 3)

(3072, 1, 96, 3)

格式检查

print(x.is_contiguous(memory_format=torch.channels_last))  # Outputs: True

True

to 和 contiguous 这两个 API 之间存在细微差别。我们建议在明确转换张量内存格式时坚持使用 to。

对于一般情况，这两个 API 的行为相同。但在特殊情况下，对于一个尺寸为 NCHW 的 4D 张量，当满足以下条件之一时：C==1 或 H==1 && W==1，只有 to 会生成一个合适的 stride 来表示 Channels Last 内存格式。

这是因为在上述两种情况下，张量的内存格式是模糊的，即尺寸为 N1HW 的连续张量在内存存储上既是 contiguous 也是 Channels Last。因此，它们已被视为给定内存格式的 is_contiguous，并且 contiguous 调用成为一个无操作（no-op），不会更新 stride。相反，to 会使用有意义的 stride 来重新排序张量，以正确表示所需的内存格式。

special_x = torch.empty(4, 1, 4, 4)
print(special_x.is_contiguous(memory_format=torch.channels_last))  # Outputs: True
print(special_x.is_contiguous(memory_format=torch.contiguous_format))  # Outputs: True

True
True

同样的情况也适用于显式置换 API permute。在可能发生模糊的特殊情况下，permute 不能保证产生一个能正确承载所需内存格式的 stride。我们建议使用 to 和显式内存格式来避免意外行为。

另外需要注意的是，在极端情况下，当三个非 batch 维度都等于 1 时（C==1 && H==1 && W==1），当前的实现无法将张量标记为 Channels Last 内存格式。

创建为 Channels Last

x = torch.empty(N, C, H, W, memory_format=torch.channels_last)
print(x.stride())  # Outputs: (3072, 1, 96, 3)

(3072, 1, 96, 3)

clone 保留内存格式

y = x.clone()
print(y.stride())  # Outputs: (3072, 1, 96, 3)

(3072, 1, 96, 3)

to, cuda, float … 保留内存格式

if torch.cuda.is_available():
    y = x.cuda()
    print(y.stride())  # Outputs: (3072, 1, 96, 3)

(3072, 1, 96, 3)

empty_like, *_like 运算符保留内存格式

y = torch.empty_like(x)
print(y.stride())  # Outputs: (3072, 1, 96, 3)

(3072, 1, 96, 3)

逐点运算符保留内存格式

z = x + y
print(z.stride())  # Outputs: (3072, 1, 96, 3)

(3072, 1, 96, 3)

使用 cudnn 后端的 Conv, Batchnorm 模块支持 Channels Last（仅适用于 cuDNN >= 7.6）。卷积模块不像二元逐点运算符那样，Channels Last 是其主要的内存格式。如果所有输入都为连续内存格式，则运算符输出为连续内存格式。否则，输出将为 Channels Last 内存格式。

if torch.backends.cudnn.is_available() and torch.backends.cudnn.version() >= 7603:
    model = torch.nn.Conv2d(8, 4, 3).cuda().half()
    model = model.to(memory_format=torch.channels_last)  # Module parameters need to be channels last

    input = torch.randint(1, 10, (2, 8, 4, 4), dtype=torch.float32, requires_grad=True)
    input = input.to(device="cuda", memory_format=torch.channels_last, dtype=torch.float16)

    out = model(input)
    print(out.is_contiguous(memory_format=torch.channels_last))  # Outputs: True

True

当输入张量遇到不支持 Channels Last 的运算符时，内核会自动应用置换以恢复输入张量的连续性。这会引入开销并停止 Channels Last 内存格式的传播。尽管如此，它保证了正确的输出。

性能提升#

Channels Last 内存格式的优化在 GPU 和 CPU 上均可用。在 GPU 上，在支持 Tensor Cores 的 NVIDIA 硬件上运行低精度（torch.float16）时，观察到最显著的性能提升。我们使用 NVIDIA 提供的 AMP（自动混合精度）训练脚本，在 Channels Last 格式下相比连续格式获得了超过 22% 的性能提升。我们的脚本使用了 NVIDIA 的 AMP NVIDIA/apex。

python main_amp.py -a resnet50 --b 200 --workers 16 --opt-level O2 ./data

# opt_level = O2
# keep_batchnorm_fp32 = None <class 'NoneType'>
# loss_scale = None <class 'NoneType'>
# CUDNN VERSION: 7603
# => creating model 'resnet50'
# Selected optimization level O2:  FP16 training with FP32 batchnorm and FP32 master weights.
# Defaults for this optimization level are:
# enabled                : True
# opt_level              : O2
# cast_model_type        : torch.float16
# patch_torch_functions  : False
# keep_batchnorm_fp32    : True
# master_weights         : True
# loss_scale             : dynamic
# Processing user overrides (additional kwargs that are not None)...
# After processing overrides, optimization options are:
# enabled                : True
# opt_level              : O2
# cast_model_type        : torch.float16
# patch_torch_functions  : False
# keep_batchnorm_fp32    : True
# master_weights         : True
# loss_scale             : dynamic
# Epoch: [0][10/125] Time 0.866 (0.866) Speed 230.949 (230.949) Loss 0.6735125184 (0.6735) Prec@1 61.000 (61.000) Prec@5 100.000 (100.000)
# Epoch: [0][20/125] Time 0.259 (0.562) Speed 773.481 (355.693) Loss 0.6968704462 (0.6852) Prec@1 55.000 (58.000) Prec@5 100.000 (100.000)
# Epoch: [0][30/125] Time 0.258 (0.461) Speed 775.089 (433.965) Loss 0.7877287269 (0.7194) Prec@1 51.500 (55.833) Prec@5 100.000 (100.000)
# Epoch: [0][40/125] Time 0.259 (0.410) Speed 771.710 (487.281) Loss 0.8285319805 (0.7467) Prec@1 48.500 (54.000) Prec@5 100.000 (100.000)
# Epoch: [0][50/125] Time 0.260 (0.380) Speed 770.090 (525.908) Loss 0.7370464802 (0.7447) Prec@1 56.500 (54.500) Prec@5 100.000 (100.000)
# Epoch: [0][60/125] Time 0.258 (0.360) Speed 775.623 (555.728) Loss 0.7592862844 (0.7472) Prec@1 51.000 (53.917) Prec@5 100.000 (100.000)
# Epoch: [0][70/125] Time 0.258 (0.345) Speed 774.746 (579.115) Loss 1.9698858261 (0.9218) Prec@1 49.500 (53.286) Prec@5 100.000 (100.000)
# Epoch: [0][80/125] Time 0.260 (0.335) Speed 770.324 (597.659) Loss 2.2505953312 (1.0879) Prec@1 50.500 (52.938) Prec@5 100.000 (100.000)

传递 --channels-last true 可以使模型以 Channels Last 格式运行，并获得 22% 的性能提升。

python main_amp.py -a resnet50 --b 200 --workers 16 --opt-level O2 --channels-last true ./data

# opt_level = O2
# keep_batchnorm_fp32 = None <class 'NoneType'>
# loss_scale = None <class 'NoneType'>
#
# CUDNN VERSION: 7603
#
# => creating model 'resnet50'
# Selected optimization level O2:  FP16 training with FP32 batchnorm and FP32 master weights.
#
# Defaults for this optimization level are:
# enabled                : True
# opt_level              : O2
# cast_model_type        : torch.float16
# patch_torch_functions  : False
# keep_batchnorm_fp32    : True
# master_weights         : True
# loss_scale             : dynamic
# Processing user overrides (additional kwargs that are not None)...
# After processing overrides, optimization options are:
# enabled                : True
# opt_level              : O2
# cast_model_type        : torch.float16
# patch_torch_functions  : False
# keep_batchnorm_fp32    : True
# master_weights         : True
# loss_scale             : dynamic
#
# Epoch: [0][10/125] Time 0.767 (0.767) Speed 260.785 (260.785) Loss 0.7579724789 (0.7580) Prec@1 53.500 (53.500) Prec@5 100.000 (100.000)
# Epoch: [0][20/125] Time 0.198 (0.482) Speed 1012.135 (414.716) Loss 0.7007197738 (0.7293) Prec@1 49.000 (51.250) Prec@5 100.000 (100.000)
# Epoch: [0][30/125] Time 0.198 (0.387) Speed 1010.977 (516.198) Loss 0.7113101482 (0.7233) Prec@1 55.500 (52.667) Prec@5 100.000 (100.000)
# Epoch: [0][40/125] Time 0.197 (0.340) Speed 1013.023 (588.333) Loss 0.8943189979 (0.7661) Prec@1 54.000 (53.000) Prec@5 100.000 (100.000)
# Epoch: [0][50/125] Time 0.198 (0.312) Speed 1010.541 (641.977) Loss 1.7113249302 (0.9551) Prec@1 51.000 (52.600) Prec@5 100.000 (100.000)
# Epoch: [0][60/125] Time 0.198 (0.293) Speed 1011.163 (683.574) Loss 5.8537774086 (1.7716) Prec@1 50.500 (52.250) Prec@5 100.000 (100.000)
# Epoch: [0][70/125] Time 0.198 (0.279) Speed 1011.453 (716.767) Loss 5.7595844269 (2.3413) Prec@1 46.500 (51.429) Prec@5 100.000 (100.000)
# Epoch: [0][80/125] Time 0.198 (0.269) Speed 1011.827 (743.883) Loss 2.8196096420 (2.4011) Prec@1 47.500 (50.938) Prec@5 100.000 (100.000)

以下模型列表完全支持 Channels Last，并在 Volta 设备上实现了 8%-35% 的性能提升：alexnet, mnasnet0_5, mnasnet0_75, mnasnet1_0, mnasnet1_3, mobilenet_v2, resnet101, resnet152, resnet18, resnet34, resnet50, resnext50_32x4d, shufflenet_v2_x0_5, shufflenet_v2_x1_0, shufflenet_v2_x1_5, shufflenet_v2_x2_0, squeezenet1_0, squeezenet1_1, vgg11, vgg11_bn, vgg13, vgg13_bn, vgg16, vgg16_bn, vgg19, vgg19_bn, wide_resnet101_2, wide_resnet50_2

以下模型列表完全支持 Channels Last，并在 Intel(R) Xeon(R) Ice Lake (或更新版本) CPU 上实现了 26%-76% 的性能提升：alexnet, densenet121, densenet161, densenet169, googlenet, inception_v3, mnasnet0_5, mnasnet1_0, resnet101, resnet152, resnet18, resnet34, resnet50, resnext101_32x8d, resnext50_32x4d, shufflenet_v2_x0_5, shufflenet_v2_x1_0, squeezenet1_0, squeezenet1_1, vgg11, vgg11_bn, vgg13, vgg13_bn, vgg16, vgg16_bn, vgg19, vgg19_bn, wide_resnet101_2, wide_resnet50_2

转换现有模型#

Channels Last 的支持并不局限于现有模型，因为只要输入（或某些权重）格式正确，任何模型都可以转换为 Channels Last 并通过图传播格式。

# Need to be done once, after model initialization (or load)
model = model.to(memory_format=torch.channels_last)  # Replace with your model

# Need to be done for every input
input = input.to(memory_format=torch.channels_last)  # Replace with your input
output = model(input)

然而，并非所有算子都完全支持 Channels Last（通常返回连续输出）。在上面给出的示例中，不支持 Channels Last 的层会停止内存格式的传播。尽管如此，由于我们将模型转换为 Channels Last 格式，这意味着每个具有 Channels Last 内存格式的 4 维权重的卷积层都将恢复 Channels Last 内存格式并受益于更快的内核。

但是，不支持 Channels Last 的算子会通过置换引入开销。如果想提高转换后模型的性能，可以选择检查并识别模型中不支持 Channels Last 的算子。

这意味着你需要对照支持的算子列表 pytorch/pytorch 来验证使用的算子列表，或者在 eager execution 模式下引入内存格式检查并运行你的模型。

运行以下代码后，如果算子的输出与输入的内存格式不匹配，算子将引发异常。

def contains_cl(args):
    for t in args:
        if isinstance(t, torch.Tensor):
            if t.is_contiguous(memory_format=torch.channels_last) and not t.is_contiguous():
                return True
        elif isinstance(t, list) or isinstance(t, tuple):
            if contains_cl(list(t)):
                return True
    return False


def print_inputs(args, indent=""):
    for t in args:
        if isinstance(t, torch.Tensor):
            print(indent, t.stride(), t.shape, t.device, t.dtype)
        elif isinstance(t, list) or isinstance(t, tuple):
            print(indent, type(t))
            print_inputs(list(t), indent=indent + "    ")
        else:
            print(indent, t)


def check_wrapper(fn):
    name = fn.__name__

    def check_cl(*args, **kwargs):
        was_cl = contains_cl(args)
        try:
            result = fn(*args, **kwargs)
        except Exception as e:
            print("`{}` inputs are:".format(name))
            print_inputs(args)
            print("-------------------")
            raise e
        failed = False
        if was_cl:
            if isinstance(result, torch.Tensor):
                if result.dim() == 4 and not result.is_contiguous(memory_format=torch.channels_last):
                    print(
                        "`{}` got channels_last input, but output is not channels_last:".format(name),
                        result.shape,
                        result.stride(),
                        result.device,
                        result.dtype,
                    )
                    failed = True
        if failed and True:
            print("`{}` inputs are:".format(name))
            print_inputs(args)
            raise Exception("Operator `{}` lost channels_last property".format(name))
        return result

    return check_cl


old_attrs = dict()


def attribute(m):
    old_attrs[m] = dict()
    for i in dir(m):
        e = getattr(m, i)
        exclude_functions = ["is_cuda", "has_names", "numel", "stride", "Tensor", "is_contiguous", "__class__"]
        if i not in exclude_functions and not i.startswith("_") and "__call__" in dir(e):
            try:
                old_attrs[m][i] = e
                setattr(m, i, check_wrapper(e))
            except Exception as e:
                print(i)
                print(e)


attribute(torch.Tensor)
attribute(torch.nn.functional)
attribute(torch)

如果你发现不支持 Channels Last 张量的算子，并且想贡献代码，请随时使用以下开发者指南 pytorch/pytorch。

以下代码用于恢复 torch 的属性。

for (m, attrs) in old_attrs.items():
    for (k, v) in attrs.items():
        setattr(m, k, v)

待办事项#

还有很多工作要做，例如：

解决 N1HW 和 NC11 张量的歧义；
分布式训练支持的测试；
提高算子覆盖率。

如果您有反馈和/或改进建议，请通过创建一个 issue 来告知我们。

结论#

本教程介绍了“Channels Last”内存格式，并演示了如何利用它来提升性能。有关在 CPU 上使用 Channels Last 加速视觉模型的实用示例，请参阅此处的博文：这里。

脚本总运行时间： (0 分钟 0.330 秒)