CUDA 语义#

创建于：2017年1月16日 | 最后更新于：2025年9月4日

torch.cuda 用于设置和运行 CUDA 操作。它跟踪当前选定的 GPU，您分配的所有 CUDA 张量默认都会在该设备上创建。可以使用 torch.cuda.device 上下文管理器更改选定的设备。

然而，一旦张量被分配，您就可以在其上执行操作，而无需考虑选定的设备，并且结果将始终放置在与张量相同的设备上。

默认情况下不允许跨 GPU 操作，但 copy_() 和其他具有类复制功能的函数（如 to() 和 cuda()）除外。除非您启用点对点内存访问，否则尝试在跨不同设备分布的张量上启动操作将引发错误。

下面是一个小示例，展示了这一点

cuda = torch.device('cuda')     # Default CUDA device
cuda0 = torch.device('cuda:0')
cuda2 = torch.device('cuda:2')  # GPU 2 (these are 0-indexed)

x = torch.tensor([1., 2.], device=cuda0)
# x.device is device(type='cuda', index=0)
y = torch.tensor([1., 2.]).cuda()
# y.device is device(type='cuda', index=0)

with torch.cuda.device(1):
    # allocates a tensor on GPU 1
    a = torch.tensor([1., 2.], device=cuda)

    # transfers a tensor from CPU to GPU 1
    b = torch.tensor([1., 2.]).cuda()
    # a.device and b.device are device(type='cuda', index=1)

    # You can also use ``Tensor.to`` to transfer a tensor:
    b2 = torch.tensor([1., 2.]).to(device=cuda)
    # b.device and b2.device are device(type='cuda', index=1)

    c = a + b
    # c.device is device(type='cuda', index=1)

    z = x + y
    # z.device is device(type='cuda', index=0)

    # even within a context, you can specify the device
    # (or give a GPU index to the .cuda call)
    d = torch.randn(2, device=cuda2)
    e = torch.randn(2).to(cuda2)
    f = torch.randn(2).cuda(cuda2)
    # d.device, e.device, and f.device are all device(type='cuda', index=2)

Ampere (及更高版本) 设备上的 TensorFloat-32 (TF32)#

在 PyTorch 2.9 之后，我们提供了一套新的 API 来更精细地控制 TF32 的行为，并建议使用新的 API 以获得更好的控制。我们可以按后端和算子设置 float32 精度。我们也可以为特定算子覆盖全局设置。

torch.backends.fp32_precision = "ieee"
torch.backends.cuda.matmul.fp32_precision = "ieee"
torch.backends.cudnn.fp32_precision = "ieee"
torch.backends.cudnn.conv.fp32_precision = "tf32"
torch.backends.cudnn.rnn.fp32_precision = "tf32"

cuda/cudnn 的 fp32_precision 可以设置为 ieee 或 tf32。 ieee fp32_precision 表示我们将使用 FP32 作为内部计算精度。 tf32 fp32_precision 表示我们将允许使用 TF32 作为内部计算精度。

如果 fp32_precision 设置为 ieee，我们可以为特定算子覆盖通用设置。

torch.backends.cudnn.fp32_precision = "tf32"
torch.backends.cudnn.conv.fp32_precision = "ieee"
torch.backends.cudnn.rnn.fp32_precision = "ieee"

如果 fp32_precision 设置为 ieee，我们也可以为特定后端覆盖通用设置。

torch.backends.fp32_precision = "tf32"
torch.backends.cudnn.fp32_precision = "ieee"
torch.backends.cudnn.conv.fp32_precision = "ieee"
torch.backends.cudnn.rnn.fp32_precision = "ieee"

对于上述两种情况，torch.backends.cudnn.conv.fp32_precision 和 torch.backends.cudnn.rnn.fp32_precision 都被覆盖为 ieee。

我们建议使用新的设置以获得更好的控制。并且我们不支持使用新旧设置的混合。

警告

旧设置（使用 allow_tf32）将要被弃用。我们建议使用上述新设置以获得更好的控制。并且我们不支持使用新旧设置的混合。

从 PyTorch 1.7 开始，有一个新的标志叫做 allow_tf32。这个标志在 PyTorch 1.7 到 1.11 中默认为 True，在 PyTorch 1.12 及更高版本中默认为 False。这个标志控制 PyTorch 是否允许在内部使用 TensorFloat32 (TF32) 张量核心来计算矩阵乘法（matmul，以及批处理矩阵乘法）和卷积，TensorFloat32 (TF32) 张量核心在 NVIDIA Ampere 及以后的 GPU 上可用。

TF32 张量核心旨在通过将输入数据舍入为具有 10 位尾数的精度，并以 FP32 精度累加结果（保持 FP32 动态范围），来提高在 torch.float32 张量上的矩阵乘法和卷积的性能。

矩阵乘法和卷积是分开控制的，它们的相应标志可以通过以下方式访问：

# The flag below controls whether to allow TF32 on matmul. This flag defaults to False
# in PyTorch 1.12 and later.
torch.backends.cuda.matmul.allow_tf32 = True

# The flag below controls whether to allow TF32 on cuDNN. This flag defaults to True.
torch.backends.cudnn.allow_tf32 = True

矩阵乘法的精度也可以通过 set_float32_matmul_precision() 更广泛地设置（不仅限于 CUDA）。请注意，除了矩阵乘法和卷积本身，内部使用矩阵乘法或卷积的函数和 nn 模块也受到影响。这些包括 nn.Linear、nn.Conv*、cdist、tensordot、affine grid 和 grid sample、adaptive log softmax、GRU 和 LSTM。

为了了解精度和速度，请参见下面的示例代码和基准测试数据（在 A100 上）

a_full = torch.randn(10240, 10240, dtype=torch.double, device='cuda')
b_full = torch.randn(10240, 10240, dtype=torch.double, device='cuda')
ab_full = a_full @ b_full
mean = ab_full.abs().mean()  # 80.7277

a = a_full.float()
b = b_full.float()

# Do matmul at TF32 mode.
torch.backends.cuda.matmul.allow_tf32 = True
ab_tf32 = a @ b  # takes 0.016s on GA100
error = (ab_tf32 - ab_full).abs().max()  # 0.1747
relative_error = error / mean  # 0.0022

# Do matmul with TF32 disabled.
torch.backends.cuda.matmul.allow_tf32 = False
ab_fp32 = a @ b  # takes 0.11s on GA100
error = (ab_fp32 - ab_full).abs().max()  # 0.0031
relative_error = error / mean  # 0.000039

从上面的示例中，我们可以看到，启用 TF32 后，A100 上的速度提高了约 7 倍，与双精度相比，相对误差大约大了 2 个数量级。请注意，TF32 与单精度速度的确切比例取决于硬件代，因为内存带宽与计算的比率以及 TF32 与 FP32 矩阵乘法吞吐量的比率可能因代或模型而异。如果需要完整的 FP32 精度，用户可以通过以下方式禁用 TF32：

torch.backends.cuda.matmul.allow_tf32 = False
torch.backends.cudnn.allow_tf32 = False

要在 C++ 中关闭 TF32 标志，您可以执行以下操作：

at::globalContext().setAllowTF32CuBLAS(false);
at::globalContext().setAllowTF32CuDNN(false);

有关 TF32 的更多信息，请参阅：

FP16 GEMM 中的低精度归约#

(与旨在提高 FP16 累加硬件吞吐量而不是 FP32 累加的完整 FP16 累加不同，请参阅完整 FP16 累加)

fp16 GEMM 可能在某些中间低精度归约中完成（例如，在 fp16 中而不是 fp32 中）。这些选择性的精度归约可以在某些工作负载（特别是 k 维度很大的工作负载）和 GPU 架构上实现更高的性能，但会以数值精度和潜在溢出为代价。

V100 上的一些示例基准测试数据

[--------------------------- bench_gemm_transformer --------------------------]
      [  m ,  k  ,  n  ]    |  allow_fp16_reduc=True  |  allow_fp16_reduc=False
1 threads: --------------------------------------------------------------------
      [4096, 4048, 4096]    |           1634.6        |           1639.8
      [4096, 4056, 4096]    |           1670.8        |           1661.9
      [4096, 4080, 4096]    |           1664.2        |           1658.3
      [4096, 4096, 4096]    |           1639.4        |           1651.0
      [4096, 4104, 4096]    |           1677.4        |           1674.9
      [4096, 4128, 4096]    |           1655.7        |           1646.0
      [4096, 4144, 4096]    |           1796.8        |           2519.6
      [4096, 5096, 4096]    |           2094.6        |           3190.0
      [4096, 5104, 4096]    |           2144.0        |           2663.5
      [4096, 5112, 4096]    |           2149.1        |           2766.9
      [4096, 5120, 4096]    |           2142.8        |           2631.0
      [4096, 9728, 4096]    |           3875.1        |           5779.8
      [4096, 16384, 4096]   |           6182.9        |           9656.5
(times in microseconds).

如果需要完整的精度归约，用户可以通过以下方式禁用 FP16 GEMM 中的低精度归约：

torch.backends.cuda.matmul.allow_fp16_reduced_precision_reduction = False

要在 C++ 中关闭低精度归约标志，您可以执行以下操作：

at::globalContext().setAllowFP16ReductionCuBLAS(false);

BF16 GEMM 中的低精度归约#

BF16 GEMM 存在一个类似的标志（如上）。请注意，此开关默认设置为 True for BF16，如果您遇到工作负载中的数值不稳定性，您可能希望将其设置为 False。

如果不需要低精度归约，用户可以通过以下方式禁用 BF16 GEMM 中的低精度归约：

torch.backends.cuda.matmul.allow_bf16_reduced_precision_reduction = False

要在 C++ 中关闭低精度归约标志，您可以执行以下操作：

at::globalContext().setAllowBF16ReductionCuBLAS(true);

FP16 GEMM 中的完整 FP16 累加#

某些 GPU 在进行 _所有_ FP16 GEMM 累加时性能会提高，但会以数值精度和更高的溢出概率为代价。请注意，此设置仅在计算能力为 7.0 (Volta) 或更高版本的 GPU 上有效。

可以通过以下方式启用此行为：

torch.backends.cuda.matmul.allow_fp16_accumulation = True

要在 C++ 中关闭低精度归约标志，您可以执行以下操作：

at::globalContext().setAllowFP16AccumulationCuBLAS(true);

异步执行#

默认情况下，GPU 操作是异步的。当您调用一个使用 GPU 的函数时，操作会被 *排队* 到特定设备，但不一定立即执行。这允许我们并行执行更多计算，包括 CPU 或其他 GPU 上的操作。

通常，异步计算的效果对调用者来说是不可见的，因为 (1) 每个设备按排队顺序执行操作，并且 (2) PyTorch 在 CPU 和 GPU 之间或两个 GPU 之间复制数据时会自动执行必要的同步。因此，计算将像所有操作都同步执行一样进行。

您可以通过设置环境变量 CUDA_LAUNCH_BLOCKING=1 来强制同步计算。当 GPU 上发生错误时，这会很有用。（在异步执行中，此类错误直到操作实际执行后才报告，因此堆栈跟踪不会显示请求的位置。）

异步计算的一个后果是，不进行同步的时间测量是不准确的。为了获得精确的测量，您应该在测量之前调用 torch.cuda.synchronize()，或者使用 torch.cuda.Event 来记录时间，如下所示：

start_event = torch.cuda.Event(enable_timing=True)
end_event = torch.cuda.Event(enable_timing=True)
start_event.record()

# Run some things here

end_event.record()
torch.cuda.synchronize()  # Wait for the events to be recorded!
elapsed_time_ms = start_event.elapsed_time(end_event)

作为例外，几个函数，如 to() 和 copy_() 允许一个显式的 non_blocking 参数，允许调用者在不必要时绕过同步。另一个例外是 CUDA 流，下面将对此进行解释。

CUDA 流#

CUDA 流是属于特定设备的一系列线性执行。您通常不需要显式创建它：默认情况下，每个设备都有自己的“默认”流。

每个流中的操作按创建顺序串行化，但来自不同流的操作可以按任何相对顺序并发执行，除非使用显式同步函数（如 synchronize() 或 wait_stream()）。例如，以下代码是不正确的：

cuda = torch.device('cuda')
s = torch.cuda.Stream()  # Create a new stream.
A = torch.empty((100, 100), device=cuda).normal_(0.0, 1.0)
with torch.cuda.stream(s):
    # sum() may start execution before normal_() finishes!
    B = torch.sum(A)

当“当前流”是默认流时，PyTorch 会在数据在 CPU 和 GPU 之间或两个 GPU 之间移动时自动执行必要的同步，如上所述。但是，在使用非默认流时，用户有责任确保正确的同步。此示例的修正版本是：

cuda = torch.device('cuda')
s = torch.cuda.Stream()  # Create a new stream.
A = torch.empty((100, 100), device=cuda).normal_(0.0, 1.0)
s.wait_stream(torch.cuda.default_stream(cuda))  # NEW!
with torch.cuda.stream(s):
    B = torch.sum(A)
A.record_stream(s)  # NEW!

有两个新增功能。 torch.cuda.Stream.wait_stream() 调用确保在侧流上运行 sum(A) 之前 normal_() 已完成执行。 torch.Tensor.record_stream()（更多细节请参见）确保在 sum(A) 完成之前不释放 A。您也可以在稍后的某个时间点通过 torch.cuda.default_stream(cuda).wait_stream(s) 手动等待流（注意，立即等待是毫无意义的，因为它会阻止流在默认流上的其他工作并行执行）。有关何时使用其中一个的更多详细信息，请参阅 torch.Tensor.record_stream() 的文档。

请注意，即使没有读取依赖项，也需要这种同步，例如，在以下示例中：

cuda = torch.device('cuda')
s = torch.cuda.Stream()  # Create a new stream.
A = torch.empty((100, 100), device=cuda)
s.wait_stream(torch.cuda.default_stream(cuda))  # STILL REQUIRED!
with torch.cuda.stream(s):
    A.normal_(0.0, 1.0)
    A.record_stream(s)

尽管 s 上的计算不读取 A 的内容，并且没有其他地方使用 A，但仍然需要同步，因为 A 可能对应于 CUDA 缓存分配器重新分配的内存，其中包含来自旧（已释放）内存的待处理操作。

反向传播的流语义#

每个反向 CUDA 操作都在用于其相应前向操作的同一流上运行。如果您的前向传递在不同的流上并行运行独立的操作，这有助于反向传递利用相同的并行性。

反向调用相对于周围操作的流语义与任何其他调用的流语义相同。反向传递会插入内部同步以确保这一点，即使反向操作在多个流上运行，如上一段所述。更具体地说，当调用 autograd.backward、autograd.grad 或 tensor.backward，并选择性地提供 CUDA 张量作为初始梯度（例如，autograd.backward(..., grad_tensors=initial_grads)、autograd.grad(..., grad_outputs=initial_grads) 或 tensor.backward(..., gradient=initial_grad)），则...

选择性填充初始梯度，
调用反向传递，以及
使用梯度

具有与任何操作组相同的流语义关系。

s = torch.cuda.Stream()

# Safe, grads are used in the same stream context as backward()
with torch.cuda.stream(s):
    loss.backward()
    use grads

# Unsafe
with torch.cuda.stream(s):
    loss.backward()
use grads

# Safe, with synchronization
with torch.cuda.stream(s):
    loss.backward()
torch.cuda.current_stream().wait_stream(s)
use grads

# Safe, populating initial grad and invoking backward are in the same stream context
with torch.cuda.stream(s):
    loss.backward(gradient=torch.ones_like(loss))

# Unsafe, populating initial_grad and invoking backward are in different stream contexts,
# without synchronization
initial_grad = torch.ones_like(loss)
with torch.cuda.stream(s):
    loss.backward(gradient=initial_grad)

# Safe, with synchronization
initial_grad = torch.ones_like(loss)
s.wait_stream(torch.cuda.current_stream())
with torch.cuda.stream(s):
    initial_grad.record_stream(s)
    loss.backward(gradient=initial_grad)

BC 注意：在默认流上使用梯度#

在 PyTorch 的早期版本（1.9 及更早版本）中，autograd 引擎始终将默认流与所有反向操作同步，因此以下模式

with torch.cuda.stream(s):
    loss.backward()
use grads

在 use grads 发生在默认流上时是安全的。在当前的 PyTorch 中，这种模式不再安全。如果 backward() 和 use grads 在不同的流上下文中，您必须同步流：

with torch.cuda.stream(s):
    loss.backward()
torch.cuda.current_stream().wait_stream(s)
use grads

即使 use grads 发生在默认流上。

内存管理#

PyTorch 使用缓存内存分配器来加快内存分配速度。这允许在没有设备同步的情况下快速释放内存。但是，分配器管理的未使用内存仍会显示为在 nvidia-smi 中使用。您可以使用 memory_allocated() 和 max_memory_allocated() 来监控张量占用的内存，并使用 memory_reserved() 和 max_memory_reserved() 来监控缓存分配器管理的内存总量。调用 empty_cache() 会释放 PyTorch 中所有*未使用的*缓存内存，以便其他 GPU 应用程序可以使用它们。但是，张量占用的 GPU 内存不会被释放，因此它不能增加 PyTorch 可用的 GPU 内存量。

要更好地了解 CUDA 内存随时间的使用情况，请参阅理解 CUDA 内存使用，其中描述了用于捕获和可视化内存使用情况跟踪的工具。

对于更高级的用户，我们通过 memory_stats() 提供更全面的内存基准测试。我们还通过 memory_snapshot() 提供捕获内存分配器状态完整快照的能力，这有助于您理解代码产生的底层分配模式。

使用 `PYTORCH_CUDA_ALLOC_CONF` 优化内存使用#

使用缓存分配器可能会干扰 cuda-memcheck 等内存检查工具。要使用 cuda-memcheck 调试内存错误，请在环境中设置 PYTORCH_NO_CUDA_MEMORY_CACHING=1 以禁用缓存。

缓存分配器的行为可以通过环境变量 PYTORCH_CUDA_ALLOC_CONF 来控制。格式为 PYTORCH_CUDA_ALLOC_CONF=<option>:<value>,<option2>:<value2>... 可用选项：

backend 允许选择底层分配器实现。目前，有效选项包括 native，它使用 PyTorch 的原生实现，以及 cudaMallocAsync，它使用 CUDA 内置的异步分配器。 cudaMallocAsync 需要 CUDA 11.4 或更高版本。默认为 native。 backend 适用于进程使用的所有设备，不能按设备指定。
max_split_size_mb 防止原生分配器拆分大于此大小（以 MB 为单位）的块。这可以减少碎片，并可能允许一些临界工作负载在不耗尽内存的情况下完成。性能成本可能从“零”到“显著”不等，具体取决于分配模式。默认值为无限，即所有块都可以拆分。 memory_stats() 和 memory_summary() 方法对于调优很有用。此选项应作为最后手段，用于因“内存不足”而中止并显示大量非活动拆分块的工作负载。 max_split_size_mb 仅在 backend:native 时有意义。对于 backend:cudaMallocAsync，将忽略 max_split_size_mb。
roundup_power2_divisions 有助于将请求的分配大小舍入到最近的 2 的幂次方除法，并更好地利用块。在原生的 CUDACachingAllocator 中，大小按 512 的块大小倍数向上舍入，因此这对于较小的尺寸来说效果很好。但是，对于附近的大型分配，这可能效率低下，因为每个分配都会转到不同大小的块，并且这些块的重用被最小化。这可能会产生大量未使用的块，并浪费 GPU 内存容量。此选项支持将分配大小舍入到最近的 2 的幂次方除法。例如，如果我们需要将大小舍入到 1200，并且除法次数为 4，则大小 1200 介于 1024 和 2048 之间，如果我们在这两者之间进行 4 次除法，则值为 1024、1280、1536 和 1792。因此，大小为 1200 的分配将被舍入到 1280，作为最近的 2 的幂次方除法的上限。指定一个值以应用于所有分配大小，或指定一个键值对数组以单独为每个 2 的幂次方间隔设置 2 的幂次方除法。例如，要为所有小于 256MB 的分配设置 1 次除法，为 256MB 到 512MB 之间的分配设置 2 次除法，为 512MB 到 1GB 之间的分配设置 4 次除法，为任何更大的分配设置 8 次除法，请将旋钮值设置为：[256:1,512:2,1024:4,>:8]。 roundup_power2_divisions 仅在 backend:native 时有意义。对于 backend:cudaMallocAsync，将忽略 roundup_power2_divisions。
max_non_split_rounding_mb 将允许非拆分块以实现更好的重用，例如：
一个 1024MB 的缓存块可以重用于 512MB 的分配请求。在默认情况下，我们只允许高达 20MB 的非拆分块舍入，因此一个 512MB 的块只能用 512-532 MB 大小的块来提供。如果我们此选项的值设置为 1024，则允许使用 512-1536 MB 大小的块来处理一个 512MB 的块，从而增加了大型块的重用。这也有助于减少昂贵的 cudaMalloc 调用中的停顿。
garbage_collection_threshold 有助于主动回收未使用的 GPU 内存，以避免触发昂贵的同步和回收所有操作（release_cached_blocks），这可能对延迟敏感的 GPU 应用程序（例如服务器）不利。设置此阈值后（例如，0.8），如果 GPU 内存容量使用率超过阈值（即，分配给 GPU 应用程序的总内存的 80%），分配器将开始回收 GPU 内存块。该算法优先释放旧的、未使用的块，以避免释放正在被重用的块。阈值应大于 0.0 且小于 1.0。默认值为 1.0。

garbage_collection_threshold 仅在 backend:native 时有意义。对于 backend:cudaMallocAsync，将忽略 garbage_collection_threshold。
expandable_segments（实验性，默认值：False）如果设置为 True，此设置指示分配器创建可以稍后扩展的 CUDA 分配，以更好地处理作业频繁更改分配大小的情况，例如具有可变批次大小。通常，对于大型（>2MB）分配，分配器调用 cudaMalloc 来获取与用户请求大小相同的分配。将来，这些分配的部分可以重用于其他请求，如果它们是空闲的。当程序发出许多完全相同大小或其整数倍大小的请求时，这效果很好。许多深度学习模型都遵循此行为。但是，一个常见的例外是批次大小在每次迭代之间略有变化时，例如在批处理推理中。当程序最初以批次大小 N 运行时，它将进行适合该大小的分配。如果将来，它以大小 N - 1 运行，则现有分配仍然足够大。但是，如果它以大小 N + 1 运行，那么它将不得不进行稍大的新分配。并非所有张量的大小都相同。有些可能是 (N + 1)*A，而另一些可能是 (N + 1)*A*B，其中 A 和 B 是模型中的某些非批次维度。由于分配器在现有分配足够大时会重用它们，因此一些 (N + 1)*A 分配实际上会适合现有的 N*B*A 段，尽管不完全合适。随着模型的运行，它将部分填充所有这些段，在这些段的末尾留下不可用的空闲内存切片。分配器在某个时候需要 cudaMalloc 一个新的 (N + 1)*A*B 段。如果内存不足，现在无法恢复这些段末尾的空闲内存切片。对于 50 多层的模型，此模式可能会重复 50 多次，从而创建许多 sliver。

expandable_segments 允许分配器最初创建一个段，然后在需要更多内存时扩展其大小。而不是为每个分配创建一个段，它会尝试为一个段（每个流）创建并根据需要增长。现在，当运行 N + 1 情况时，分配将整齐地排列到一个大段中，直到它填满。然后请求更多内存并附加到段的末尾。此过程不会创建许多不可用内存的 sliver，因此更有可能成功找到该内存。
pinned_use_cuda_host_register 选项是一个布尔标志，它决定是使用 CUDA API 的 cudaHostRegister 函数分配固定内存，还是使用默认的 cudaHostAlloc。当设置为 True 时，内存使用常规 malloc 分配，然后在调用 cudaHostRegister 之前将页面映射到内存。这种页面的预映射有助于减少 cudaHostRegister 执行期间的锁定时间。
pinned_num_register_threads 选项仅在 pinned_use_cuda_host_register 设置为 True 时有效。默认情况下，使用一个线程来映射页面。此选项允许使用更多线程来并行化页面映射操作，以减少固定内存的总体分配时间。基于基准测试结果，此选项的一个好值是 8。
pinned_use_background_threads 选项是一个布尔标志，用于启用后台线程来处理事件。这可以避免在快速分配路径中查询/处理事件的任何缓慢路径。此功能默认禁用。
graph_capture_record_stream_reuse（实验性，默认值：False）如果设置为 True，CUDA 缓存分配器将在 CUDA 图捕获期间尝试回收设备内存，方法是使用图拓扑（而不是 CUDA 事件）来确定何时可以安全地重用已释放的块。这可以减少在捕获时间较长的场景中跨多个流释放和重新分配缓冲区的峰值内存使用量，尤其是在捕获 DAG 频繁到达连接的边界时。注意：启用此选项可能会显着增加捕获图所花费的时间。

注意

PyTorch 的 CUDA 内存管理 API 报告的一些统计信息特定于 backend:native，并且对于 backend:cudaMallocAsync 没有意义。有关详细信息，请参阅每个函数的文档字符串。

为 CUDA 使用自定义内存分配器#

可以将分配器定义为 C/C++ 中的简单函数，并将其编译为共享库。下面的代码显示了一个基本分配器，它只跟踪所有内存操作。

#include <sys/types.h>
#include <cuda_runtime_api.h>
#include <iostream>
// Compile with g++ alloc.cc -o alloc.so -I/usr/local/cuda/include -shared -fPIC
extern "C" {
void* my_malloc(ssize_t size, int device, cudaStream_t stream) {
   void *ptr;
   cudaMalloc(&ptr, size);
   std::cout<<"alloc "<<ptr<<size<<std::endl;
   return ptr;
}

void my_free(void* ptr, ssize_t size, int device, cudaStream_t stream) {
   std::cout<<"free "<<ptr<< " "<<stream<<std::endl;
   cudaFree(ptr);
}
}

可以通过 torch.cuda.memory.CUDAPluggableAllocator 在 Python 中使用它。用户负责提供 .so 文件的路径以及与上述签名匹配的 alloc/free 函数的名称。

import torch

# Load the allocator
new_alloc = torch.cuda.memory.CUDAPluggableAllocator(
    'alloc.so', 'my_malloc', 'my_free')
# Swap the current allocator
torch.cuda.memory.change_current_allocator(new_alloc)
# This will allocate memory in the device using the new allocator
b = torch.zeros(10, device='cuda')

import torch

# Do an initial memory allocator
b = torch.zeros(10, device='cuda')
# Load the allocator
new_alloc = torch.cuda.memory.CUDAPluggableAllocator(
    'alloc.so', 'my_malloc', 'my_free')
# This will error since the current allocator was already instantiated
torch.cuda.memory.change_current_allocator(new_alloc)

在同一个程序中混合不同的 CUDA 系统分配器#

根据您的使用场景，change_current_allocator() 可能不是您想要的，因为它会交换整个程序的 CUDA 分配器（类似于 PYTORCH_CUDA_ALLOC_CONF=backend:cudaMallocAsync）。例如，如果交换的分配器没有缓存机制，您将失去 PyTorch 的 CUDACachingAllocator 的所有好处。相反，您可以使用 torch.cuda.MemPool 选择性地将 PyTorch 代码的一个区域标记为使用自定义分配器。这将允许您在同一个 PyTorch 程序中使用多个 CUDA 系统分配器，以及 CUDACachingAllocator 的大部分优点（例如缓存）。使用 torch.cuda.MemPool，您可以利用支持多种功能的自定义分配器，例如：

使用 ncclMemAlloc 分配器为 all-reduce 分配输出缓冲区可以启用 NVLink Switch Reductions (NVLS)。这可以减少重叠的计算和通信内核在 GPU 资源（SM、Copy Engines）上的争用，尤其是在张量并行工作负载上。
对于基于 Grace CPU 的系统，使用 cuMemCreate 并指定 CU_MEM_LOCATION_TYPE_HOST_NUMA 为 all-gather 分配主机输出缓冲区可以启用基于扩展 GPU 内存 (EGM) 的内存传输，从源 GPU 到目标 CPU。这可以加速 all-gather，因为传输发生在 NVLinks 上，而否则将通过带宽受限的网络接口卡 (NIC) 链接进行。这种加速的 all-gather 进而可以加速模型检查点。
如果您正在构建一个模型，并且一开始不想考虑内存密集型模块（例如，嵌入表）的最佳内存放置，或者您有一个性能不敏感且不适合 GPU 的模块，那么您可以只使用 cudaMallocManaged 分配该模块，首选 CPU 位置，然后让您的模型正常工作。

注意

虽然 cudaMallocManaged 提供了方便的 CUDA 统一虚拟内存 (UVM) 自动内存管理，但它不推荐用于 DL 工作负载。对于适合 GPU 内存的 DL 工作负载，显式放置始终优于 UVM，因为没有页面故障，并且访问模式保持可预测。当 GPU 内存饱和时，UVM 必须执行昂贵的双重传输，将页面逐出到 CPU，然后再调入新的页面。

下面的代码显示了包装在 torch.cuda.memory.CUDAPluggableAllocator 中的 ncclMemAlloc。

import os

import torch
import torch.distributed as dist
from torch.cuda.memory import CUDAPluggableAllocator
from torch.distributed.distributed_c10d import _get_default_group
from torch.utils import cpp_extension


# create allocator
nccl_allocator_source = """
#include <nccl.h>
#include <iostream>
extern "C" {

void* nccl_alloc_plug(size_t size, int device, void* stream) {
  std::cout << "Using ncclMemAlloc" << std::endl;
  void* ptr;
  ncclResult_t err = ncclMemAlloc(&ptr, size);
  return ptr;

}

void nccl_free_plug(void* ptr, size_t size, int device, void* stream) {
  std::cout << "Using ncclMemFree" << std::endl;
  ncclResult_t err = ncclMemFree(ptr);
}

}
"""
nccl_allocator_libname = "nccl_allocator"
nccl_allocator = torch.utils.cpp_extension.load_inline(
    name=nccl_allocator_libname,
    cpp_sources=nccl_allocator_source,
    with_cuda=True,
    extra_ldflags=["-lnccl"],
    verbose=True,
    is_python_module=False,
    build_directory="./",
)

allocator = CUDAPluggableAllocator(
    f"./{nccl_allocator_libname}.so", "nccl_alloc_plug", "nccl_free_plug"
).allocator()

# setup distributed
rank = int(os.getenv("RANK"))
local_rank = int(os.getenv("LOCAL_RANK"))
world_size = int(os.getenv("WORLD_SIZE"))
torch.cuda.set_device(local_rank)
dist.init_process_group(backend="nccl")
device = torch.device(f"cuda:{local_rank}")
default_pg = _get_default_group()
backend = default_pg._get_backend(device)

# Note: for convenience, ProcessGroupNCCL backend provides
# the ncclMemAlloc allocator as backend.mem_allocator
allocator = backend.mem_allocator

您现在可以通过将此分配器传递给 torch.cuda.MemPool 来定义一个新的内存池：

pool = torch.cuda.MemPool(allocator)

然后可以使用 torch.cuda.use_mem_pool 上下文管理器使用该池来分配张量：

with torch.cuda.use_mem_pool(pool):
    # tensor gets allocated with ncclMemAlloc passed in the pool
    tensor = torch.arange(1024 * 1024 * 2, device=device)
    print(f"tensor ptr on rank {rank} is {hex(tensor.data_ptr())}")

# register user buffers using ncclCommRegister (called under the hood)
backend.register_mem_pool(pool)

# Collective uses Zero Copy NVLS
dist.all_reduce(tensor[0:4])
torch.cuda.synchronize()
print(tensor[0:4])

请注意上面示例中 register_mem_pool 的用法。这是 NVLS 归约的一个额外步骤，用户缓冲区需要注册到 NCCL。用户可以使用类似的 deregister_mem_pool 调用注销缓冲区。

要回收内存，用户首先需要确保没有任何东西在使用该池。当没有张量持有对池的引用时，empty_cache() 将在池被删除时在内部调用，从而将所有内存返回给系统。

del tensor, del pool

用户可以在创建 MemPool 时选择性地指定一个 use_on_oom 布尔值（默认为 False）。如果为 True，则 CUDACachingAllocator 将能够使用该池中的内存作为最后的手段，而不是 OOM。

pool = torch.cuda.MemPool(allocator, use_on_oom=True)
with torch.cuda.use_mem_pool(pool):
    a = torch.randn(40 * 1024 * 1024, dtype=torch.uint8, device="cuda")
del a

# at the memory limit, this will succeed by using pool's memory in order to avoid the oom
b = torch.randn(40 * 1024 * 1024, dtype=torch.uint8, device="cuda")

以下 torch.cuda.MemPool.use_count() 和 torch.cuda.MemPool.snapshot() API 可用于调试目的：

pool = torch.cuda.MemPool(allocator)

# pool's use count should be 1 at this point as MemPool object
# holds a reference
assert pool.use_count() == 1

nelem_1mb = 1024 * 1024 // 4

with torch.cuda.use_mem_pool(pool):
    out_0 = torch.randn(nelem_1mb, device="cuda")

    # pool's use count should be 2 at this point as use_mem_pool
    # holds a reference
    assert pool.use_count() == 2

# pool's use count should be back to 1 at this point as use_mem_pool
# released its reference
assert pool.use_count() == 1

with torch.cuda.use_mem_pool(pool):
    # pool should have 1 segment since we made a small allocation (1 MB)
    # above and so the CUDACachingAllocator packed it into a 2 MB buffer
    assert len(pool.snapshot()) == 1

    out_1 = torch.randn(nelem_1mb, device="cuda")

    # pool should still have 1 segment since we made another small allocation
    # (1 MB) that got packed into the existing 2 MB buffer
    assert len(pool.snapshot()) == 1

    out_2 = torch.randn(nelem_1mb, device="cuda")

    # pool now should have 2 segments since the CUDACachingAllocator had
    # to make a new 2 MB buffer to accommodate out_2
    assert len(pool.snapshot()) == 2

注意

torch.cuda.MemPool 持有对池的引用。当您使用 torch.cuda.use_mem_pool 上下文管理器时，它还将获取对池的另一个引用。退出上下文管理器时，它将释放其引用。之后，理想情况下应该只有张量持有对池的引用。一旦张量释放了它们的引用，池的使用计数将为 1，反映出只有 torch.cuda.MemPool 对象持有引用。只有到那时，池持有的内存才能在调用池的析构函数（使用 del）时返回给系统。
torch.cuda.MemPool 目前不支持 CUDACachingAllocator 的 expandable_segments 模式。
NCCL 对与 NVLS 归约兼容的缓冲区有特定要求。这些要求在动态工作负载中可能会被打破，例如，由 CUDACachingAllocator 发送给 NCCL 的缓冲区可能会被拆分，因此对齐不正确。在这些情况下，NCCL 可以使用回退算法而不是 NVLS。
像 ncclMemAlloc 这样的分配器可能会使用比请求更多的内存，因为对齐要求（CU_MULTICAST_GRANULARITY_RECOMMENDED、CU_MULTICAST_GRANULARITY_MINIMUM），这可能会导致您的工作负载内存不足。

在 H100/H200 GPU 上使用自定义内存分配器调整 NVLink 性能#

在极少数情况下，H100/H200 GPU 上的 NVLink 性能会受到数据物理内存布局的影响，这为开发人员提供了一个调整其应用程序以获得最佳吞吐量的机会。

数据物理内存布局影响性能的一个例子是通信内核发出不平衡的 NVLink 读取/写入操作。在下面的图中，我们可以看到每个 warp 在每个单一波中以一致的跨步模式访问内存地址。我们可以通过调整工作负载中的步幅大小来实现更平衡的负载，或者我们可以实现自定义 CUDA 分配器。

_______________________________  _______________________________      _______________________________
| Warp 0 Reading | No-reading |  | Warp 1 Reading | No-reading |  ...  Warp N Reading | No-reading |
_______________________________  _______________________________      _______________________________
<----------------------------->
        Stride size

这样的分配器可以为内核维护连续的虚拟内存地址，同时战略性地安排到物理内存地址的映射（例如，通过改组）。这种技术允许开发人员探索不同的物理访问模式以找到最高效的模式，从而在不修改内核逻辑的情况下解锁更高的性能。可以使用 PyTorch 的自定义分配器支持来实现此类分配器的实际实现，如前所述，其中 malloc 和 free 函数是：

// assuming a system with 8 GPUs
struct CustomAllocInfo {
  void** devPtr;  // This will be the usable virtual memory address
  CUdeviceptr dptr;
  size_t totalSize;  // Total size of the allocated memory
  size_t padded_size;
  int device_id;
  std::vector<CUmemGenericAllocationHandle> handles;  // Handles to physical memory allocations
};

// loop over pages
cudaError_t customCudaMalloc(CustomAllocInfo* info) {
    if (!info) return cudaErrorInvalidValue;

    CUdeviceptr dptr;

    // Handles to redundant physical memory allocations which help truncate stride pattern in physical memory
    std::vector<CUmemGenericAllocationHandle> handles_redundant;

    size_t granularity = 0;
    CUmemAllocationProp prop = {};

    int currentDev = info->device_id;
    size_t totalSize = info->totalSize;

    prop.type = CU_MEM_ALLOCATION_TYPE_PINNED;
    prop.location.type = CU_MEM_LOCATION_TYPE_DEVICE;
    prop.location.id = currentDev;
    cuMemGetAllocationGranularity(&granularity, &prop, CU_MEM_ALLOC_GRANULARITY_MINIMUM);
    size_t padded_size = ROUND_UP(totalSize, granularity);

    info->padded_size = padded_size;

    // loop over pages
    size_t iter_granularity = granularity * 64; // 64 * granularity with shift_size = 2 works
    uint32_t iteration_count = (totalSize + iter_granularity - 1) / iter_granularity;

    cuMemAddressReserve(&dptr, padded_size, 0ULL, 0ULL, 0ULL);

    const int shift_size = 2;
    for (size_t i = 0; i < iteration_count; i+=shift_size) {

        CUmemGenericAllocationHandle allocHandle[shift_size];
        for (int shift = 0; (shift < shift_size)&&(i+shift < iteration_count); shift++){
            CHECK_CUDA(cuMemCreate(&allocHandle[shift], iter_granularity, &prop, 0));
            info->handles.push_back(allocHandle[shift]);
        }

        for (int shift = 0; (shift < shift_size)&&(i+shift < iteration_count); shift++){

            // mapping makes the shift (shift -> (shift+1)%shift_size  )
            CHECK_CUDA(cuMemMap(dptr + (i+shift) * iter_granularity, iter_granularity, 0, allocHandle[(shift+1)%shift_size], 0));

            setupMultiGPUAccess(dptr + (i+shift) * iter_granularity, iter_granularity, {0, 1, 2, 3, 4, 5, 6, 7}); // Enable access for all 8 GPUs
        }

        // std::cout << "Here we allocate one redundant page (2MB)..." << std::endl;
        // this is an extra optimization on top of the swizzling. It helps "break"
        // the physical access pattern even more. It can be left out if workload is already
        // performing at SOL with just swizzling.
        CUmemGenericAllocationHandle allocHandle_redundant;
        CHECK_CUDA(cuMemCreate(&allocHandle_redundant, granularity, &prop, 0));
        handles_redundant.push_back(allocHandle_redundant);
    }

    *info->devPtr = (void*)dptr;
    info->dptr = dptr;

    // Release each redundant allocation
    for (auto handle : handles_redundant) {
        // std::cout << "Here we release one redundant page (2MB)..." << std::endl;
        CHECK_CUDA(cuMemRelease(handle));
    }

    return cudaSuccess;
}

void customCudaFree(CustomAllocInfo* info) {
    if (!info) return;

    // CHECK_CUDA(cudaSetDevice(info->device_id));

    CHECK_CUDA(cuMemUnmap(info->dptr, info->padded_size));

    // Unmap and release each allocation
    for (auto handle : info->handles) {
        CHECK_CUDA(cuMemRelease(handle));
    }

    // Unreserve the virtual address space
    // CHECK_CUDA(cuMemAddressFree((CUdeviceptr)*info->devPtr, info->padded_size));
    CHECK_CUDA(cuMemAddressFree(info->dptr, info->padded_size));
}

cuBLAS 工作区#

对于每个 cuBLAS 句柄和 CUDA 流的组合，如果该句柄和流组合执行需要工作区的 cuBLAS 内核，则会分配一个 cuBLAS 工作区。为了避免重复分配工作区，除非调用 torch._C._cuda_clearCublasWorkspaces()，否则这些工作区不会被释放。每个分配的工作区大小可以通过环境变量 CUBLAS_WORKSPACE_CONFIG 来指定，格式为 :[SIZE]:[COUNT]。例如，每个分配的默认工作区大小为 CUBLAS_WORKSPACE_CONFIG=:4096:2:16:8，它指定了总大小为 2 * 4096 + 8 * 16 KiB。要强制 cuBLAS 避免使用工作区，请将 CUBLAS_WORKSPACE_CONFIG=:0:0 设置为。

cuFFT 计划缓存#

对于每个 CUDA 设备，使用一个 LRU 缓存的 cuFFT 计划来加速在具有相同几何形状和配置的 CUDA 张量上重复运行 FFT 方法（例如，torch.fft.fft()）。由于一些 cuFFT 计划可能会分配 GPU 内存，因此这些缓存具有最大容量。

您可以使用以下 API 来控制和查询当前设备的缓存属性：

torch.backends.cuda.cufft_plan_cache.max_size 显示缓存的容量（CUDA 10 及更高版本默认为 4096，旧版 CUDA 默认为 1023）。直接设置此值可修改容量。
torch.backends.cuda.cufft_plan_cache.size 显示当前缓存中的计划数量。
torch.backends.cuda.cufft_plan_cache.clear() 清除缓存。

要控制和查询非默认设备的计划缓存，您可以索引 torch.backends.cuda.cufft_plan_cache 对象，使用 torch.device 对象或设备索引，并访问上述属性之一。例如，要设置设备 1 的缓存容量，您可以编写 torch.backends.cuda.cufft_plan_cache[1].max_size = 10。

即时编译#

PyTorch 会即时编译一些操作，例如 torch.special.zeta，当在 CUDA 张量上执行时。此编译可能耗时（取决于您的硬件和软件，最多几秒钟），并且对于单个算子可能发生多次，因为许多 PyTorch 算子实际上是从各种内核中选择的，每个内核都必须编译一次，具体取决于它们的输入。此编译在每个进程中发生一次，或者如果使用内核缓存，则仅发生一次。

默认情况下，PyTorch 在 $XDG_CACHE_HOME/torch/kernels 中创建一个内核缓存（如果定义了 XDG_CACHE_HOME），如果未定义，则在 $HOME/.cache/torch/kernels 中创建（Windows 除外，那里尚不支持内核缓存）。缓存行为可以通过两个环境变量直接控制。如果 USE_PYTORCH_KERNEL_CACHE 设置为 0，则不使用缓存，如果设置了 PYTORCH_KERNEL_CACHE_PATH，则该路径将用作内核缓存而不是默认位置。

最佳实践#

设备无关的代码#

由于 PyTorch 的结构，您可能需要显式编写设备无关（CPU 或 GPU）的代码；一个例子可能是创建一个新的张量作为循环神经网络的初始隐藏状态。

第一步是确定是否应使用 GPU。一种常见模式是使用 Python 的 argparse 模块读取用户参数，并有一个可以用于禁用 CUDA 的标志，结合 is_available()。在以下示例中，args.device 生成一个 torch.device 对象，可用于将张量移动到 CPU 或 CUDA。

import argparse
import torch

parser = argparse.ArgumentParser(description='PyTorch Example')
parser.add_argument('--disable-cuda', action='store_true',
                    help='Disable CUDA')
args = parser.parse_args()
args.device = None
if not args.disable_cuda and torch.cuda.is_available():
    args.device = torch.device('cuda')
else:
    args.device = torch.device('cpu')

注意

在评估给定环境中 CUDA 的可用性（is_available()）时，PyTorch 的默认行为是调用 CUDA Runtime API 方法 cudaGetDeviceCount。由于此调用反过来会初始化 CUDA 驱动 API（通过 cuInit）如果它尚未初始化，则后续的 fork 进程（已运行 is_available()）将因 CUDA 初始化错误而失败。

您可以在导入执行 is_available() 的 PyTorch 模块（或直接执行它）之前，在环境中设置 PYTORCH_NVML_BASED_CUDA_CHECK=1，以指示 is_available() 尝试基于 NVML 的评估（nvmlDeviceGetCount_v2）。如果基于 NVML 的评估成功（即 NVML 发现/初始化不失败），则 is_available() 调用不会毒害后续的 fork。

如果 NVML 发现/初始化失败，is_available() 将回退到标准的 CUDA Runtime API 评估，并且上述 fork 限制将适用。

请注意，上述基于 NVML 的 CUDA 可用性评估提供的保证比默认的 CUDA Runtime API 方法（需要 CUDA 初始化成功）要弱。在某些情况下，NVML 检查可能成功，但稍后的 CUDA 初始化会失败。

现在我们有了 args.device，我们可以使用它在所需的设备上创建一个张量。

x = torch.empty((8, 42), device=args.device)
net = Network().to(device=args.device)

这可以在多种情况下用于生成设备无关的代码。下面是使用数据加载器时的示例：

cuda0 = torch.device('cuda:0')  # CUDA GPU 0
for i, x in enumerate(train_loader):
    x = x.to(cuda0)

当在一个系统上使用多个 GPU 时，您可以使用 CUDA_VISIBLE_DEVICES 环境变量来管理 PyTorch 可用的 GPU。如上所述，要手动控制张量创建在哪一个 GPU 上，最佳实践是使用 torch.cuda.device 上下文管理器。

print("Outside device is 0")  # On device 0 (default in most scenarios)
with torch.cuda.device(1):
    print("Inside device is 1")  # On device 1
print("Outside device is still 0")  # On device 0

如果您有一个张量并希望创建具有相同类型和同一设备的新张量，则可以使用 torch.Tensor.new_* 方法（请参阅 torch.Tensor）。虽然前面提到的 torch.* 工厂函数（创建操作）依赖于当前 GPU 上下文和您传入的属性参数，但 torch.Tensor.new_* 方法会保留张量的设备和其他属性。

当创建需要在前向传递中内部创建新张量的模块时，这是推荐的做法。

cuda = torch.device('cuda')
x_cpu = torch.empty(2)
x_gpu = torch.empty(2, device=cuda)
x_cpu_long = torch.empty(2, dtype=torch.int64)

y_cpu = x_cpu.new_full([3, 2], fill_value=0.3)
print(y_cpu)

    tensor([[ 0.3000,  0.3000],
            [ 0.3000,  0.3000],
            [ 0.3000,  0.3000]])

y_gpu = x_gpu.new_full([3, 2], fill_value=-5)
print(y_gpu)

    tensor([[-5.0000, -5.0000],
            [-5.0000, -5.0000],
            [-5.0000, -5.0000]], device='cuda:0')

y_cpu_long = x_cpu_long.new_tensor([[1, 2, 3]])
print(y_cpu_long)

    tensor([[ 1,  2,  3]])

如果您想创建与另一个张量具有相同类型和大小的张量，并用 1 或 0 填充它，则提供了 ones_like() 或 zeros_like() 作为方便的辅助函数（它们也保留了张量的 torch.device 和 torch.dtype）。

x_cpu = torch.empty(2, 3)
x_gpu = torch.empty(2, 3)

y_cpu = torch.ones_like(x_cpu)
y_gpu = torch.zeros_like(x_gpu)

使用固定内存缓冲区#

警告

这是一个高级技巧。如果您过度使用固定内存，在 RAM 不足时可能会导致严重问题，并且您应该意识到固定通常是一项昂贵的操作。

当主机到 GPU 的复制源自固定（页锁定）内存时，速度会快得多。CPU 张量和存储公开了一个 pin_memory() 方法，该方法返回对象的副本，并将数据放入固定区域。

此外，一旦您固定了张量或存储，您就可以使用异步 GPU 复制。只需将额外的 non_blocking=True 参数传递给 to() 或 cuda() 调用。这可用于将数据传输与计算重叠。

通过将 pin_memory=True 传递给构造函数，您可以使 DataLoader 返回放置在固定内存中的批次。

使用 nn.parallel.DistributedDataParallel 而不是 multiprocessing 或 nn.DataParallel#

涉及批处理输入和多个 GPU 的大多数用例应默认使用 DistributedDataParallel 来利用一个以上的 GPU。

在将 CUDA 模型与 multiprocessing 结合使用时存在显著的注意事项；除非小心满足数据处理要求，否则您的程序很可能出现不正确或未定义的行为。

建议使用 DistributedDataParallel，而不是 DataParallel 来进行多 GPU 训练，即使只有一个节点。

DistributedDataParallel 和 DataParallel 之间的区别是：DistributedDataParallel 使用多进程，为每个 GPU 创建一个进程，而 DataParallel 使用多线程。通过使用多进程，每个 GPU 都有其专用的进程，这避免了 Python 解释器 GIL 造成的性能开销。

如果您使用 DistributedDataParallel，您可以使用 torch.distributed.launch 工具来启动您的程序，请参阅启动工具。

CUDA 图#

CUDA 图是 CUDA 流及其依赖流执行的工作（主要是内核及其参数）的记录。有关基本原理和底层 CUDA API 的详细信息，请参阅 Getting Started with CUDA Graphs 和 CUDA C 编程指南的 Graphs 部分。

PyTorch 支持使用*流捕获*来构建 CUDA 图，这会将 CUDA 流置于*捕获模式*。向正在捕获的流发出的 CUDA 工作实际上不会在 GPU 上运行。相反，工作会被记录在一个图中。

捕获后，可以*重放*该图以根据需要多次运行 GPU 工作。每次重放都会使用相同的参数运行相同的内核。对于指针参数，这意味着使用相同的内存地址。通过在每次重放前用新数据（例如，来自新批次）填充输入内存，您可以使用新数据重新运行相同的工作。

为什么使用 CUDA 图？#

重放图会牺牲典型即时执行的动态灵活性，以换取**大大降低的 CPU 开销**。图的参数和内核是固定的，因此图重放会跳过所有参数设置和内核分派层，包括 Python、C++ 和 CUDA 驱动程序的开销。在底层，一次重放调用 cudaGraphLaunch 将整个图的工作提交给 GPU。重放中的内核在 GPU 上的执行速度也会略快，但消除 CPU 开销是主要好处。

如果您的网络全部或部分是图安全的（通常这意味着形状和控制流是静态的，但请参阅其他约束），并且您怀疑其运行时至少在一定程度上受 CPU 限制，则应尝试使用 CUDA 图。

PyTorch API#

警告

此 API 处于 Beta 版，未来版本中可能会更改。

PyTorch 通过原始的 torch.cuda.CUDAGraph 类和两个方便的包装器 torch.cuda.graph 和 torch.cuda.make_graphed_callables 来公开图。

torch.cuda.graph 是一个简单、通用的上下文管理器，它捕获其上下文中的 CUDA 工作。在捕获之前，通过运行几次即时迭代来预热要捕获的工作负载。预热必须在侧流上进行。由于图在每次重放时读取和写入相同的内存地址，因此您必须在捕获期间维护持有输入和输出数据的张量的长期引用。要对新输入数据运行图，请将新数据复制到捕获的输入张量(s) 中，重放图，然后从捕获的输出张量(s) 中读取新输出。示例：

g = torch.cuda.CUDAGraph()

# Placeholder input used for capture
static_input = torch.empty((5,), device="cuda")

# Warmup before capture
s = torch.cuda.Stream()
s.wait_stream(torch.cuda.current_stream())
with torch.cuda.stream(s):
    for _ in range(3):
        static_output = static_input * 2
torch.cuda.current_stream().wait_stream(s)

# Captures the graph
# To allow capture, automatically sets a side stream as the current stream in the context
with torch.cuda.graph(g):
    static_output = static_input * 2

# Fills the graph's input memory with new data to compute on
static_input.copy_(torch.full((5,), 3, device="cuda"))
g.replay()
# static_output holds the results
print(static_output)  # full of 3 * 2 = 6

# Fills the graph's input memory with more data to compute on
static_input.copy_(torch.full((5,), 4, device="cuda"))
g.replay()
print(static_output)  # full of 4 * 2 = 8

有关实际和高级模式，请参阅整个网络捕获、与 torch.cuda.amp 的用法和与多个流的用法。

make_graphed_callables 更为复杂。make_graphed_callables 接受 Python 函数和 torch.nn.Module。对于每个传入的函数或模块，它会创建前向传递和后向传递工作的单独图。请参阅部分网络捕获。

约束#

一组操作是*可捕获*的，如果它不违反任何以下约束。

约束适用于 torch.cuda.graph 上下文中的所有工作，以及您传递给 torch.cuda.make_graphed_callables() 的任何可调用对象的正向和反向传递中的所有工作。

违反任何这些都会导致运行时错误。

捕获必须在非默认流上进行。（这仅在您使用原始的 CUDAGraph.capture_begin 和 CUDAGraph.capture_end 调用时才需要考虑。 graph 和 make_graphed_callables() 会为您设置一个侧流。）
禁止同步 CPU 与 GPU 的操作（例如，.item() 调用）。
CUDA RNG 操作是允许的，并且在图中使用多个 torch.Generator 实例时，必须在图捕获之前使用 CUDAGraph.register_generator_state 注册它们。在捕获期间避免使用 Generator.get_state 和 Generator.set_state；而是使用 Generator.graphsafe_set_state 和 Generator.graphsafe_get_state 来在图上下文中安全地管理生成器状态。这确保了 CUDA 图中正确的 RNG 操作和生成器管理。

违反任何这些将可能导致无声的数值错误或未定义行为。

在一个进程中，一次只能进行一次捕获。
在捕获进行时，不允许在此进程（在任何线程上）运行任何非捕获的 CUDA 工作。
CPU 工作不被捕获。如果捕获的操作包含 CPU 工作，这些工作将在重放期间被省略。
每次重放都读取和写入相同的（虚拟）内存地址。
禁止动态控制流（基于 CPU 或 GPU 数据）。
禁止动态形状。该图假定在每个重放中，捕获的操作序列中的每个张量都具有相同的大小和布局。
允许多个流在捕获中使用，但有限制。

非约束#

捕获后，图可以在任何流上重放。

整个网络捕获#

如果您的整个网络都可以捕获，您可以捕获并重放整个迭代：

N, D_in, H, D_out = 640, 4096, 2048, 1024
model = torch.nn.Sequential(torch.nn.Linear(D_in, H),
                            torch.nn.Dropout(p=0.2),
                            torch.nn.Linear(H, D_out),
                            torch.nn.Dropout(p=0.1)).cuda()
loss_fn = torch.nn.MSELoss()
optimizer = torch.optim.SGD(model.parameters(), lr=0.1)

# Placeholders used for capture
static_input = torch.randn(N, D_in, device='cuda')
static_target = torch.randn(N, D_out, device='cuda')

# warmup
# Uses static_input and static_target here for convenience,
# but in a real setting, because the warmup includes optimizer.step()
# you must use a few batches of real data.
s = torch.cuda.Stream()
s.wait_stream(torch.cuda.current_stream())
with torch.cuda.stream(s):
    for i in range(3):
        optimizer.zero_grad(set_to_none=True)
        y_pred = model(static_input)
        loss = loss_fn(y_pred, static_target)
        loss.backward()
        optimizer.step()
torch.cuda.current_stream().wait_stream(s)

# capture
g = torch.cuda.CUDAGraph()
# Sets grads to None before capture, so backward() will create
# .grad attributes with allocations from the graph's private pool
optimizer.zero_grad(set_to_none=True)
with torch.cuda.graph(g):
    static_y_pred = model(static_input)
    static_loss = loss_fn(static_y_pred, static_target)
    static_loss.backward()
    optimizer.step()

real_inputs = [torch.rand_like(static_input) for _ in range(10)]
real_targets = [torch.rand_like(static_target) for _ in range(10)]

for data, target in zip(real_inputs, real_targets):
    # Fills the graph's input memory with new data to compute on
    static_input.copy_(data)
    static_target.copy_(target)
    # replay() includes forward, backward, and step.
    # You don't even need to call optimizer.zero_grad() between iterations
    # because the captured backward refills static .grad tensors in place.
    g.replay()
    # Params have been updated. static_y_pred, static_loss, and .grad
    # attributes hold values from computing on this iteration's data.

部分网络捕获#

如果您的部分网络不适合捕获（例如，由于动态控制流、动态形状、CPU 同步或必要的 CPU 端逻辑），您可以将不安全的部分或全部运行为即时执行，并使用 torch.cuda.make_graphed_callables() 来仅捕获可捕获的部分。

默认情况下，make_graphed_callables() 返回的可调用对象是 autograd 感知的，并且可以在训练循环中直接替换您传入的函数或 nn.Module。

make_graphed_callables() 内部创建 CUDAGraph 对象，运行预热迭代，并维护所需的静态输入和输出。因此（与 torch.cuda.graph 不同），您不需要手动处理这些。

在以下示例中，数据依赖的动态控制流意味着网络无法端到端捕获，但 make_graphed_callables() 允许我们无论如何捕获和运行图安全的节点：

N, D_in, H, D_out = 640, 4096, 2048, 1024

module1 = torch.nn.Linear(D_in, H).cuda()
module2 = torch.nn.Linear(H, D_out).cuda()
module3 = torch.nn.Linear(H, D_out).cuda()

loss_fn = torch.nn.MSELoss()
optimizer = torch.optim.SGD(chain(module1.parameters(),
                                  module2.parameters(),
                                  module3.parameters()),
                            lr=0.1)

# Sample inputs used for capture
# requires_grad state of sample inputs must match
# requires_grad state of real inputs each callable will see.
x = torch.randn(N, D_in, device='cuda')
h = torch.randn(N, H, device='cuda', requires_grad=True)

module1 = torch.cuda.make_graphed_callables(module1, (x,))
module2 = torch.cuda.make_graphed_callables(module2, (h,))
module3 = torch.cuda.make_graphed_callables(module3, (h,))

real_inputs = [torch.rand_like(x) for _ in range(10)]
real_targets = [torch.randn(N, D_out, device="cuda") for _ in range(10)]

for data, target in zip(real_inputs, real_targets):
    optimizer.zero_grad(set_to_none=True)

    tmp = module1(data)  # forward ops run as a graph

    if tmp.sum().item() > 0:
        tmp = module2(tmp)  # forward ops run as a graph
    else:
        tmp = module3(tmp)  # forward ops run as a graph

    loss = loss_fn(tmp, target)
    # module2's or module3's (whichever was chosen) backward ops,
    # as well as module1's backward ops, run as graphs
    loss.backward()
    optimizer.step()

与 torch.cuda.amp 的用法#

对于典型的优化器，GradScaler.step 会将 CPU 与 GPU 同步，这在捕获期间是被禁止的。为了避免错误，请使用部分网络捕获，或者（如果前向、损失和后向是图安全的）捕获前向、损失和后向，但不捕获优化器步骤：

# warmup
# In a real setting, use a few batches of real data.
s = torch.cuda.Stream()
s.wait_stream(torch.cuda.current_stream())
with torch.cuda.stream(s):
    for i in range(3):
        optimizer.zero_grad(set_to_none=True)
        with torch.cuda.amp.autocast():
            y_pred = model(static_input)
            loss = loss_fn(y_pred, static_target)
        scaler.scale(loss).backward()
        scaler.step(optimizer)
        scaler.update()
torch.cuda.current_stream().wait_stream(s)

# capture
g = torch.cuda.CUDAGraph()
optimizer.zero_grad(set_to_none=True)
with torch.cuda.graph(g):
    with torch.cuda.amp.autocast():
        static_y_pred = model(static_input)
        static_loss = loss_fn(static_y_pred, static_target)
    scaler.scale(static_loss).backward()
    # don't capture scaler.step(optimizer) or scaler.update()

real_inputs = [torch.rand_like(static_input) for _ in range(10)]
real_targets = [torch.rand_like(static_target) for _ in range(10)]

for data, target in zip(real_inputs, real_targets):
    static_input.copy_(data)
    static_target.copy_(target)
    g.replay()
    # Runs scaler.step and scaler.update eagerly
    scaler.step(optimizer)
    scaler.update()

与多个流的用法#

捕获模式会自动传播到与捕获流同步的任何流。在捕获期间，您可以通过发出不同流的调用来暴露并行性，但总体流依赖 DAG 必须从初始捕获流分支出来，并在捕获结束前重新加入初始流。

with torch.cuda.graph(g):
    # at context manager entrance, torch.cuda.current_stream()
    # is the initial capturing stream

    # INCORRECT (does not branch out from or rejoin initial stream)
    with torch.cuda.stream(s):
        cuda_work()

    # CORRECT:
    # branches out from initial stream
    s.wait_stream(torch.cuda.current_stream())
    with torch.cuda.stream(s):
        cuda_work()
    # rejoins initial stream before capture ends
    torch.cuda.current_stream().wait_stream(s)

注意

为了避免高级用户在 nsight systems 或 nvprof 中查看重放时产生混淆：与即时执行不同，图将非平凡的流 DAG 在捕获中解释为提示，而不是命令。在重放期间，图可能会将独立的操作重新组织到不同的流上，或者以不同的顺序将它们入队（同时尊重您原始 DAG 的整体依赖关系）。

与 DistributedDataParallel 的用法#

NCCL < 2.9.6#

早于 2.9.6 的 NCCL 版本不允许捕获集合操作。您必须使用部分网络捕获，它将所有归约推迟到图以外的部分后向传递。

在包装网络为 DDP 之前，对图可捕获的网络部分调用 make_graphed_callables()。

NCCL >= 2.9.6#

NCCL 版本 2.9.6 或更高版本允许在图中使用集合操作。捕获*整个后向传递*的方法是可行的选项，但需要三个设置步骤。

禁用 DDP 的内部异步错误处理

os.environ["NCCL_ASYNC_ERROR_HANDLING"] = "0"
torch.distributed.init_process_group(...)

在完全后向捕获之前，DDP 必须在侧流上下文中构建

with torch.cuda.stream(s):
    model = DistributedDataParallel(model)

您的预热必须在捕获之前至少运行 11 次 DDP 启用的即时迭代。

图内存管理#

捕获的图在每次重放时都作用于相同的虚拟地址。如果 PyTorch 释放了内存，后续的重放可能会遇到非法内存访问。如果 PyTorch 将内存重新分配给新的张量，重放可能会破坏那些张量看到的值。因此，图使用的虚拟地址必须在重放之间为图保留。PyTorch 缓存分配器通过检测捕获何时正在进行，并从图私有内存池满足捕获的分配来实现这一点。私有池一直存在，直到其 CUDAGraph 对象和捕获期间创建的所有张量离开作用域。

私有池是自动维护的。默认情况下，分配器为每个捕获创建一个单独的私有池。如果您捕获多个图，这种保守的方法可以确保图重放永远不会破坏彼此的值，但有时会不必要地浪费内存。

CUDA 语义#

Ampere (及更高版本) 设备上的 TensorFloat-32 (TF32)#

FP16 GEMM 中的低精度归约#

BF16 GEMM 中的低精度归约#

FP16 GEMM 中的完整 FP16 累加#

异步执行#

CUDA 流#

反向传播的流语义#

BC 注意：在默认流上使用梯度#

内存管理#

使用 PYTORCH_CUDA_ALLOC_CONF 优化内存使用#

为 CUDA 使用自定义内存分配器#

在同一个程序中混合不同的 CUDA 系统分配器#

在 H100/H200 GPU 上使用自定义内存分配器调整 NVLink 性能#

cuBLAS 工作区#

cuFFT 计划缓存#

即时编译#

最佳实践#

设备无关的代码#

使用固定内存缓冲区#

使用 nn.parallel.DistributedDataParallel 而不是 multiprocessing 或 nn.DataParallel#

CUDA 图#

为什么使用 CUDA 图？#

PyTorch API#

约束#

非约束#

整个网络捕获#

部分网络捕获#

与 torch.cuda.amp 的用法#

与多个流的用法#

与 DistributedDataParallel 的用法#

NCCL < 2.9.6#

NCCL >= 2.9.6#

图内存管理#

跨捕获共享内存#

文档

教程

资源

使用 `PYTORCH_CUDA_ALLOC_CONF` 优化内存使用#