评价此页

★ ★ ★ ★ ★

unstable/maskedtensor_adagrad

在 Google Colab 中运行

注意

转至末尾下载完整的示例代码。

使用 MaskedTensor 高效编写 Adagrad 的“稀疏”语义#

在学习本教程之前，请回顾 MaskedTensor 的概述和稀疏性教程。

引言与动机#

在 Issue 1369 中讨论了在编写 Adagrad 的“稀疏”语义时引入的额外代码行。但实际上，这些代码将稀疏性作为掩码语义的代理，而不是稀疏性的本意：一种压缩和优化技术。此前，我们通过引入一次性语义和运算符来绕过对正式掩码语义的缺乏，同时迫使用户了解索引和值等存储细节。

现在我们有了掩码语义，我们能更好地指出何时将稀疏性用作语义扩展。我们还将将其与使用 MaskedTensor 编写的等效代码进行比较和对比。最后，代码片段将重复出现，不再添加注释，以展示简洁性的差异。

准备工作#

# Disable prototype warnings and such

# Some hyperparameters

使用 MaskedTensor 简化代码#

在我们深入研究之前，让我们更具体地介绍一下问题。我们将研究 PyTorch 中的 Adagrad (functional) 实现，最终目标是简化并更忠实地表示掩码方法。

作为参考，这是没有掩码梯度或稀疏性的常规稠密代码路径。

state_sum.addcmul_(grad, grad, value=1)
std = state_sum.sqrt().add_(eps)
param.addcdiv_(grad, std, value=-clr)

稀疏的原始张量实现如下：

def _make_sparse(grad, grad_indices, values):
    size = grad.size()
    if grad_indices.numel() == 0 or values.numel() == 0:
        return torch.empty_like(grad)
    return torch.sparse_coo_tensor(grad_indices, values, size)

grad = grad.coalesce()  # the update is non-linear so indices must be unique
grad_indices = grad._indices()
grad_values = grad._values()

state_sum.add_(_make_sparse(grad, grad_indices, grad_values.pow(2)))   # a different _make_sparse per layout
std = state_sum.sparse_mask(grad)
std_values = std._values().sqrt_().add_(eps)
param.add_(_make_sparse(grad, grad_indices, grad_values / std_values), alpha=-clr)

而使用 `MaskedTensor` 则将代码简化为以下片段：

state_sum2 = state_sum2 + masked_grad.pow(2).get_data()
std2 = masked_tensor(state_sum2.to_sparse(), mask)
std2 = std2.sqrt().add(eps)
param2 = param2.add((masked_grad / std2).get_data(), alpha=-clr)

在本教程中，我们将逐行分析每个实现，但乍一看，我们可以注意到 (1) MaskedTensor 实现的简洁程度，以及 (2) 它如何避免稠密张量和稀疏张量之间的转换。

原始稀疏实现#

现在，让我们用一些行内注释来分解代码。

# We don't support sparse gradients

# pow(2) has the same semantics for both sparse and dense memory layouts since 0^2 is zero

# We take care to make std sparse, even though state_sum clearly is not.
# This means that we're only applying the gradient to parts of the state_sum
# for which it is specified. This further drives the point home that the passed gradient is not sparse, but masked.
# We currently dodge all these concerns using the private method `_values`.

# Note here that we currently don't support div for sparse Tensors because zero / zero is not well defined,
# so we're forced to perform `grad_values / std_values` outside the sparse semantic and then convert back to a
# sparse tensor with `make_sparse`.
# We'll later see that MaskedTensor will actually handle these operations for us as well as properly denote
# undefined / undefined = undefined!

倒数第三行——`std = state_sum.sparse_mask(grad)`——是我们出现一个非常重要的分歧的地方。

eps 的加法技术上应应用于所有值，但此处仅应用于指定值。这里我们使用稀疏性作为语义扩展，并强制执行特定的已定义和未定义值模式。如果梯度的部分值为零，它们在具体化时仍会被包含，即使它们可以通过其他稀疏存储布局进行压缩。这在理论上相当脆弱！不过，有人可能会争辩说 eps 总是非常小的，所以在实际中可能不会有太大影响。

此外，将稀疏性作为存储布局和压缩方案的 `add_` 实现应该会导致密集化，但为了性能，我们强制其不这样做。对于这种情况，一次性使用是可以的……直到我们想要引入新的压缩方案，例如 CSC、BSR 或 BSC。然后，我们需要为每种方案引入单独的张量类型，并为使用不同存储格式压缩的梯度编写变体，这既不方便，也不够可扩展或干净。

MaskedTensor 稀疏实现#

我们将稀疏性作为一种优化与稀疏性作为 PyTorch 的语义扩展混淆了。MaskedTensor 提出将稀疏性优化与语义扩展分离；例如，目前我们无法在稀疏存储中使用稠密语义，或在稠密存储中使用掩码语义。MaskedTensor 通过刻意将存储与语义分离来实现这些想法。

考虑上面使用掩码梯度的示例。

# Let's now import MaskedTensor!

# Create an entirely new set of parameters to avoid errors

# We can add support for in-place operations later. Notice how this doesn't
# need to access any storage internals and is in general a lot shorter

请注意，实现看起来非常相似，但 MaskedTensor 实现更短、更简单。特别是，`_make_sparse` 周围的许多样板代码（以及需要为每种布局拥有单独实现的需求）通过 `MaskedTensor` 为用户处理了。

此时，让我们打印这两个版本和原始版本，以便于比较。

结论#

在本教程中，我们讨论了原生的掩码语义如何能够为 PyTorch 中 Adagrad 的现有实现带来更简洁的开发者体验，而该实现曾将稀疏性用作编写掩码语义的代理。但更重要的是，通过 MaskedTensor 将掩码语义作为一流公民，消除了对稀疏性或不可靠的模仿掩码的技巧的依赖，从而实现了真正的独立性和发展，同时启用了稀疏语义，例如这个。

进一步阅读#

要继续学习更多内容，您可以查阅我们（目前为止）的最终回顾 — MaskedTensor 高级语义，了解 `MaskedTensor` 和 NumPy 的 MaskedArray 在设计决策上的一些差异，以及归约语义。

# %%%%%%RUNNABLE_CODE_REMOVED%%%%%%

脚本总运行时间：（0 分 0.002 秒）