评价此页

Tensor CUDA Stream API#

A CUDA Stream is a linear sequence of execution that belongs to a specific CUDA device. The PyTorch C++ API supports CUDA streams with the CUDAStream class and useful helper functions to make streaming operations easy. You can find them in CUDAStream.h. This note provides more details on how to use Pytorch C++ CUDA Stream APIs.

获取 CUDA Stream#

Pytorch 的 C++ API 提供了以下几种获取 CUDA Stream 的方式:

  1. 从 CUDA Stream 池中获取一个新的 Stream,Stream 是从池中预先分配的,并以轮循方式返回。

CUDAStream getStreamFromPool(const bool isHighPriority = false, DeviceIndex device = -1);

提示

你可以通过将 `isHighPriority` 设置为 `true` 来请求高优先级池中的 Stream,或者通过设置设备索引(默认为当前 CUDA Stream 的设备索引)来为特定设备请求 Stream。

  1. 获取传入的 CUDA 设备的默认 CUDA Stream,或者在未传入设备索引时获取当前设备的默认 CUDA Stream。

CUDAStream getDefaultCUDAStream(DeviceIndex device_index = -1);

提示

默认 Stream 是你在没有显式使用 Stream 时大部分计算发生的地方。

  1. 获取传入的 CUDA 设备(或在未传入设备索引时获取当前设备)的当前 CUDA Stream。

CUDAStream getCurrentCUDAStream(DeviceIndex device_index = -1);

提示

当前的 CUDA Stream 通常是设备的默认 CUDA Stream,但如果有人调用了 `setCurrentCUDAStream` 或使用了 `StreamGuard` 或 `CUDAStreamGuard`,则可能不同。

设置 CUDA Stream#

Pytorch 的 C++ API 提供了以下几种设置 CUDA Stream 的方式:

  1. 将传入 Stream 的设备上的当前 Stream 设置为传入的 Stream。

void setCurrentCUDAStream(CUDAStream stream);

注意

此函数可能与当前设备无关。它仅更改传入 Stream 的设备上的当前 Stream。我们建议使用 `CUDAStreamGuard`,因为它会切换到 Stream 的设备,并使其成为该设备上的当前 Stream。`CUDAStreamGuard` 在销毁时还会恢复当前设备和 Stream。

  1. 使用 `CUDAStreamGuard` 在作用域内切换到 CUDA Stream,它定义在 CUDAStreamGuard.h 中。

提示

如果你需要在多个 CUDA 设备上设置 Stream,请使用 `CUDAMultiStreamGuard`。

CUDA Stream 使用示例#

  1. 在同一设备上获取和设置 CUDA Stream

// This example shows how to acquire and set CUDA stream on the same device.
// `at::cuda::setCurrentCUDAStream` is used to set current CUDA stream

// create a tensor on device 0
torch::Tensor tensor0 = torch::ones({2, 2}, torch::device(torch::kCUDA));
// get a new CUDA stream from CUDA stream pool on device 0
at::cuda::CUDAStream myStream = at::cuda::getStreamFromPool();
// set current CUDA stream from default stream to `myStream` on device 0
at::cuda::setCurrentCUDAStream(myStream);
// sum() on tensor0 uses `myStream` as current CUDA stream
tensor0.sum();

// get the default CUDA stream on device 0
at::cuda::CUDAStream defaultStream = at::cuda::getDefaultCUDAStream();
// set current CUDA stream back to default CUDA stream on device 0
at::cuda::setCurrentCUDAStream(defaultStream);
// sum() on tensor0 uses `defaultStream` as current CUDA stream
tensor0.sum();
// This example is the same as previous example, but explicitly specify device
// index and use CUDA stream guard to set current CUDA stream

// create a tensor on device 0
torch::Tensor tensor0 = torch::ones({2, 2}, torch::device(torch::kCUDA));
// get a new stream from CUDA stream pool on device 0
at::cuda::CUDAStream myStream = at::cuda::getStreamFromPool(false, 0);
// set the current CUDA stream to `myStream` within the scope using CUDA stream guard
{
  at::cuda::CUDAStreamGuard guard(myStream);
  // current CUDA stream is `myStream` from here till the end of bracket.
  // sum() on tensor0 uses `myStream` as current CUDA stream
  tensor0.sum();
}
// current CUDA stream is reset to default CUDA stream after CUDA stream guard is destroyed
// sum() on tensor0 uses default CUDA stream on device 0 as current CUDA stream
tensor0.sum();

注意

上面的代码运行在同一个 CUDA 设备上。`setCurrentCUDAStream` 总是会在当前设备上设置当前 CUDA Stream,但请注意 `setCurrentCUDAStream` 实际上是在传入的 CUDA Stream 的设备上设置当前 Stream。

  1. 在多个设备上获取和设置 CUDA Stream

// This example shows how to acquire and set CUDA stream on two devices.

// acquire new CUDA streams from CUDA stream pool on device 0 and device 1
at::cuda::CUDAStream myStream0 = at::cuda::getStreamFromPool(false, 0);
at::cuda::CUDAStream myStream1 = at::cuda::getStreamFromPool(false, 1);

// set current CUDA stream to `myStream0` on device 0
at::cuda::setCurrentCUDAStream(myStream0);
// set current CUDA stream to `myStream1` on device 1
at::cuda::setCurrentCUDAStream(myStream1);

// create a tensor on device 0, no need to specify device index since
// current device index is 0
torch::Tensor tensor0 = torch::ones({2, 2}, torch::device(at::kCUDA));
// sum() on tensor0 use `myStream0` as current CUDA stream on device 0
tensor0.sum();

// change the current device index to 1 by using CUDA device guard within a bracket scope
{
  at::cuda::CUDAGuard device_guard{1};
  // create a tensor on device 1
  torch::Tensor tensor1 = torch::ones({2, 2}, torch::device(at::kCUDA));
  // sum() on tensor 1 uses `myStream1` as current CUDA stream on device 1
  tensor1.sum();
}

// current device is reset to device 0 after device_guard is destroyed

// acquire a new CUDA stream on device 1
at::cuda::CUDAStream myStream1_1 = at::cuda::getStreamFromPool(false, 1);
// create a new tensor on device 1
torch::Tensor tensor1 = torch::ones({2, 2}, torch::device({torch::kCUDA, 1}));

// change the current device index to 1 and current CUDA stream on device 1
// to `myStream1_1` using CUDA stream guard within a scope
{
  at::cuda::CUDAStreamGuard stream_guard(myStream1_1);
  // sum() on tensor1 use `myStream1_1` as current CUDA stream on device 1
  tensor1.sum();
}

// current device is reset to device 0 and current CUDA stream on device 1 is
// reset to `myStream1`

// sum() on tensor1 uses `myStream1` as current CUDA stream on device 1
tensor1.sum();
  1. 使用 CUDA multistream guard

// This example shows how to use CUDA multistream guard to set
// two streams on two devices at the same time.

// create two tensor, one on device 0, one on device 1
torch::Tensor tensor0 = torch::ones({2, 2}, torch::device({torch::kCUDA, 0}));
torch::Tensor tensor1 = torch::ones({2, 2}, torch::device({torch::kCUDA, 1}));

// acquire new CUDA streams from CUDA stream pool on device 0 and device 1
at::cuda::CUDAStream myStream0 = at::cuda::getStreamFromPool(false, 0);
at::cuda::CUDAStream myStream1 = at::cuda::getStreamFromPool(false, 1);

// set current CUDA stream on device 0 to `myStream0` and
// set current CUDA stream on device 1 to `myStream1` CUDA using multistream guard
{
  at::cuda::CUDAMultiStreamGuard multi_guard({myStream0, myStream1});

  // sum() on tensor0 uses `myStream0` as current CUDA stream on device 0
  tensor0.sum();
  // sum() on tensor1 uses `myStream1` as current CUDA stream on device 1
  tensor1.sum();
}

// current CUDA stream on device 0 is reset to default CUDA stream on device 0
// current CUDA stream on device 1 is reset to default CUDA stream on device 1

// sum() on tensor0 uses default CUDA stream as current CUDA stream on device 0
tensor0.sum();
// sum() on tensor1 uses default CUDA stream as current CUDA stream on device 1
tensor1.sum();

注意

`CUDAMultiStreamGuard` 不会更改当前设备索引,它只会在每个传入 Stream 的设备上更改 Stream。除了作用域控制外,此 Guard 等同于在每个传入的 Stream 上调用 `setCurrentCUDAStream`。

  1. 处理多个设备上的 CUDA Stream 的骨架示例

// This is a skeleton example that shows how to handle CUDA streams on multiple devices
// Suppose you want to do work on the non-default stream on two devices simultaneously, and we
// already have streams on both devices in two vectors. The following code shows three ways
// of acquiring and setting the streams.

// Usage 0: acquire CUDA stream and set current CUDA stream with `setCurrentCUDAStream`
// Create a CUDA stream vector `streams0` on device 0
std::vector<at::cuda::CUDAStream> streams0 =
  {at::cuda::getDefaultCUDAStream(), at::cuda::getStreamFromPool()};
// set current stream as `streams0[0]` on device 0
at::cuda::setCurrentCUDAStream(streams0[0]);

// create a CUDA stream vector `streams1` on device using CUDA device guard
std::vector<at::cuda::CUDAStream> streams1;
{
  // device index is set to 1 within this scope
  at::cuda::CUDAGuard device_guard(1);
  streams1.push_back(at::cuda::getDefaultCUDAStream());
  streams1.push_back(at::cuda::getStreamFromPool());
}
// device index is reset to 0 after device_guard is destroyed

// set current stream as `streams1[0]` on device 1
at::cuda::setCurrentCUDAStream(streams1[0]);


// Usage 1: use CUDA device guard to change the current device index only
{
  at::cuda::CUDAGuard device_guard(1);

  // current device index is changed to 1 within scope
  // current CUDA stream is still `streams1[0]` on device 1, no change
}
// current device index is reset to 0 after `device_guard` is destroyed


// Usage 2: use CUDA stream guard to change both current device index and current CUDA stream.
{
  at::cuda::CUDAStreamGuard stream_guard(streams1[1]);

  // current device index and current CUDA stream are set to 1 and `streams1[1]` within scope
}
// current device index and current CUDA stream are reset to 0 and `streams0[0]` after
// stream_guard is destroyed


// Usage 3: use CUDA multi-stream guard to change multiple streams on multiple devices
{
  // This is the same as calling `torch::cuda::setCurrentCUDAStream` on both streams
  at::cuda::CUDAMultiStreamGuard multi_guard({streams0[1], streams1[1]});

  // current device index is not change, still 0
  // current CUDA stream on device 0 and device 1 are set to `streams0[1]` and `streams1[1]`
}
// current CUDA stream on device 0 and device 1 are reset to `streams0[0]` and `streams1[0]`
// after `multi_guard` is destroyed.