光流：使用 RAFT 模型预测运动¶

注意

尝试在 Colab 上运行，或前往末尾下载完整的示例代码。

光流是指预测两幅图像之间的运动，通常是视频的两个连续帧。光流模型以两幅图像作为输入，并预测一个流：该流指示第一幅图像中每个像素的位移，并将其映射到第二幅图像中的相应像素。流是 (2, H, W) 维度的张量，其中第一个轴对应于预测的水平和垂直位移。

以下示例说明了如何使用 torchvision 和我们实现的 RAFT 模型来预测流。我们还将演示如何将预测的流转换为 RGB 图像进行可视化。

import numpy as np
import torch
import matplotlib.pyplot as plt
import torchvision.transforms.functional as F


plt.rcParams["savefig.bbox"] = "tight"


def plot(imgs, **imshow_kwargs):
    if not isinstance(imgs[0], list):
        # Make a 2d grid even if there's just 1 row
        imgs = [imgs]

    num_rows = len(imgs)
    num_cols = len(imgs[0])
    _, axs = plt.subplots(nrows=num_rows, ncols=num_cols, squeeze=False)
    for row_idx, row in enumerate(imgs):
        for col_idx, img in enumerate(row):
            ax = axs[row_idx, col_idx]
            img = F.to_pil_image(img.to("cpu"))
            ax.imshow(np.asarray(img), **imshow_kwargs)
            ax.set(xticklabels=[], yticklabels=[], xticks=[], yticks=[])

    plt.tight_layout()

使用 Torchvision 读取视频¶

我们首先使用 read_video() 读取视频。或者，也可以使用新的 VideoReader API（如果 torchvision 是从源代码构建的）。我们将在此处使用的视频来自 pexels.com 免费提供，版权归 Pavel Danilyuk 所有。

import tempfile
from pathlib import Path
from urllib.request import urlretrieve


video_url = "https://download.pytorch.org/tutorial/pexelscom_pavel_danilyuk_basketball_hd.mp4"
video_path = Path(tempfile.mkdtemp()) / "basketball.mp4"
_ = urlretrieve(video_url, video_path)

read_video() 返回视频帧、音频帧以及与视频关联的元数据。在本例中，我们只需要视频帧。

此处我们将仅在两个预先选择的帧对之间进行两次预测，即帧 (100, 101) 和 (150, 151)。这些帧对中的每一对都对应一个模型输入。

from torchvision.io import read_video
frames, _, _ = read_video(str(video_path), output_format="TCHW")

img1_batch = torch.stack([frames[100], frames[150]])
img2_batch = torch.stack([frames[101], frames[151]])

plot(img1_batch)

/pytorch/vision/torchvision/io/_video_deprecation_warning.py:9: UserWarning: The video decoding and encoding capabilities of torchvision are deprecated from version 0.22 and will be removed in version 0.24. We recommend that you migrate to TorchCodec, where we'll consolidate the future decoding/encoding capabilities of PyTorch: https://github.com/pytorch/torchcodec
  warnings.warn(
/pytorch/vision/torchvision/io/video.py:199: UserWarning: The pts_unit 'pts' gives wrong results. Please use pts_unit 'sec'.
  warnings.warn("The pts_unit 'pts' gives wrong results. Please use pts_unit 'sec'.")

RAFT 模型接受 RGB 图像。我们首先从 read_video() 中获取帧，并调整它们的大小以确保它们的尺寸可被 8 整除。请注意，我们明确使用了 antialias=False，因为这些模型就是这样训练的。然后，我们使用与权重捆绑在一起的变换来预处理输入，并将值重新缩放到所需的 [-1, 1] 区间。

from torchvision.models.optical_flow import Raft_Large_Weights

weights = Raft_Large_Weights.DEFAULT
transforms = weights.transforms()


def preprocess(img1_batch, img2_batch):
    img1_batch = F.resize(img1_batch, size=[520, 960], antialias=False)
    img2_batch = F.resize(img2_batch, size=[520, 960], antialias=False)
    return transforms(img1_batch, img2_batch)


img1_batch, img2_batch = preprocess(img1_batch, img2_batch)

print(f"shape = {img1_batch.shape}, dtype = {img1_batch.dtype}")

shape = torch.Size([2, 3, 520, 960]), dtype = torch.float32

使用 RAFT 估计光流¶

我们将使用来自 raft_large() 的 RAFT 实现，它遵循与原始论文中描述的相同的架构。我们还提供了 raft_small() 模型构建器，它更小、运行速度更快，但准确性稍有牺牲。

from torchvision.models.optical_flow import raft_large

# If you can, run this example on a GPU, it will be a lot faster.
device = "cuda" if torch.cuda.is_available() else "cpu"

model = raft_large(weights=Raft_Large_Weights.DEFAULT, progress=False).to(device)
model = model.eval()

list_of_flows = model(img1_batch.to(device), img2_batch.to(device))
print(f"type = {type(list_of_flows)}")
print(f"length = {len(list_of_flows)} = number of iterations of the model")

Downloading: "https://download.pytorch.org/models/raft_large_C_T_SKHT_V2-ff5fadd5.pth" to /root/.cache/torch/hub/checkpoints/raft_large_C_T_SKHT_V2-ff5fadd5.pth
type = <class 'list'>
length = 12 = number of iterations of the model

RAFT 模型输出预测流的列表，其中每个条目都是一个 (N, 2, H, W) 的预测流批次，对应于模型中的给定“迭代”。有关模型迭代特性的更多详细信息，请参阅原始论文。在这里，我们只对最终预测的流（它们是最准确的）感兴趣，所以我们只检索列表中的最后一个条目。

如上所述，流是维度为 (2, H, W)（或对于流批次为 (N, 2, H, W)）的张量，其中每个条目对应于从第一幅图像到第二幅图像的每个像素的水平和垂直位移。请注意，预测的流以“像素”为单位，它们并未根据图像的尺寸进行归一化。

predicted_flows = list_of_flows[-1]
print(f"dtype = {predicted_flows.dtype}")
print(f"shape = {predicted_flows.shape} = (N, 2, H, W)")
print(f"min = {predicted_flows.min()}, max = {predicted_flows.max()}")

dtype = torch.float32
shape = torch.Size([2, 2, 520, 960]) = (N, 2, H, W)
min = -3.8997151851654053, max = 6.400382995605469

可视化预测的流¶

Torchvision 提供了 flow_to_image() 工具，用于将流转换为 RGB 图像。它也支持流批次。流中的每个“方向”都将映射到给定的 RGB 颜色。在下面的图像中，模型假定颜色相似的像素以相似的方向移动。模型能够正确预测球和球员的运动。请特别注意第一个图像中球（向左移动）和第二个图像中球（向上移动）的预测方向不同。

from torchvision.utils import flow_to_image

flow_imgs = flow_to_image(predicted_flows)

# The images have been mapped into [-1, 1] but for plotting we want them in [0, 1]
img1_batch = [(img1 + 1) / 2 for img1 in img1_batch]

grid = [[img1, flow_img] for (img1, flow_img) in zip(img1_batch, flow_imgs)]
plot(grid)

附加：创建预测流的 GIF¶

在上面的示例中，我们只展示了 2 对帧的预测流。应用光流模型的一种有趣方式是让模型处理整个视频，并从所有预测的流创建一个新视频。下面是一段可以帮助你入门的代码片段。我们注释掉了代码，因为此示例正在没有 GPU 的机器上渲染，运行它会花费太长时间。

# from torchvision.io import write_jpeg
# for i, (img1, img2) in enumerate(zip(frames, frames[1:])):
#     # Note: it would be faster to predict batches of flows instead of individual flows
#     img1, img2 = preprocess(img1, img2)

#     list_of_flows = model(img1.to(device), img2.to(device))
#     predicted_flow = list_of_flows[-1][0]
#     flow_img = flow_to_image(predicted_flow).to("cpu")
#     output_folder = "/tmp/"  # Update this to the folder of your choice
#     write_jpeg(flow_img, output_folder + f"predicted_flow_{i}.jpg")

一旦 .jpg 流图像被保存，你就可以使用 ffmpeg 将它们转换为视频或 GIF，例如：

ffmpeg -f image2 -framerate 30 -i predicted_flow_%d.jpg -loop -1 flow.gif

脚本总运行时间： (0 分钟 9.066 秒)

由 Sphinx-Gallery 生成的画廊