评价此页

摆锤:使用 TorchRL 编写您的环境和变换#

创建日期:2023 年 11 月 9 日 | 最后更新:2025 年 1 月 27 日 | 最后验证:2024 年 11 月 5 日

作者Vincent Moens

创建环境(模拟器或物理控制系统的接口)是强化学习和控制工程的组成部分。

TorchRL 提供了一套工具,可以在多种情况下完成此操作。本教程演示了如何使用 PyTorch 和 TorchRL 从头开始编写摆锤模拟器。它自由地借鉴了 OpenAI-Gym/Farama-Gymnasium 控制库中的 Pendulum-v1 实现。

Pendulum

简单摆锤#

主要学习内容

  • 如何在 TorchRL 中设计环境: - 编写规范(输入、观察和奖励); - 实现行为:播种、重置和步进。

  • 转换您的环境输入和输出,并编写您自己的转换;

  • 如何使用TensorDictcodebase 中传递任意数据结构。

    在此过程中,我们将涉及 TorchRL 的三个关键组件

为了了解 TorchRL 环境可以实现什么,我们将设计一个*无状态*环境。有状态环境会跟踪遇到的最新物理状态并依赖此来模拟状态到状态的转换,而无状态环境则期望在每个步骤中向其提供当前状态以及采取的动作。TorchRL 支持这两种类型的环境,但无状态环境更通用,因此涵盖了 TorchRL 中环境 API 更广泛的功能。

无状态环境建模使用户可以完全控制模拟器的输入和输出:可以在任何阶段重置实验或从外部主动修改动力学。然而,它假设我们对任务有某种控制,这可能并非总是如此:解决一个我们无法控制当前状态的问题更具挑战性,但应用范围更广。

无状态环境的另一个优点是它们可以启用转换模拟的批量执行。如果后端和实现允许,代数运算可以在标量、向量或张量上无缝执行。本教程提供了此类示例。

本教程的结构如下

  • 我们将首先熟悉环境属性:其形状(batch_size)、其方法(主要是step()reset()set_seed())以及其规范。

  • 在编写模拟器之后,我们将演示它如何在训练中与变换一起使用。

  • 我们将探索 TorchRL API 带来的新途径,包括:转换输入的能力、模拟的矢量化执行以及通过模拟图进行反向传播的可能性。

  • 最后,我们将训练一个简单的策略来解决我们实现的系统。

from collections import defaultdict
from typing import Optional

import numpy as np
import torch
import tqdm
from tensordict import TensorDict, TensorDictBase
from tensordict.nn import TensorDictModule
from torch import nn

from torchrl.data import BoundedTensorSpec, CompositeSpec, UnboundedContinuousTensorSpec
from torchrl.envs import (
    CatTensors,
    EnvBase,
    Transform,
    TransformedEnv,
    UnsqueezeTransform,
)
from torchrl.envs.transforms.transforms import _apply_to_composite
from torchrl.envs.utils import check_env_specs, step_mdp

DEFAULT_X = np.pi
DEFAULT_Y = 1.0

设计新的环境类时必须注意四件事

  • EnvBase._reset(),用于将模拟器重置到(可能是随机的)初始状态;

  • EnvBase._step(),用于编码状态转换动态;

  • EnvBase._set_seed`(),用于实现播种机制;

  • 环境规范。

我们首先描述手头的问题:我们希望模拟一个简单摆锤,我们可以控制其固定点上施加的扭矩。我们的目标是将摆锤置于向上位置(根据惯例,角位置为 0),并使其在该位置保持静止。为了设计我们的动力系统,我们需要定义两个方程:作用(施加的扭矩)后的运动方程和构成我们目标函数的奖励方程。

对于运动方程,我们将按如下方式更新角速度

\[\dot{\theta}_{t+1} = \dot{\theta}_t + (3 * g / (2 * L) * \sin(\theta_t) + 3 / (m * L^2) * u) * dt\]

其中 \(\dot{\theta}\) 是角速度(弧度/秒),\(g\) 是重力,\(L\) 是摆锤长度,\(m\) 是其质量,\(\theta\) 是其角位置,\(u\) 是扭矩。角位置随后根据以下公式更新

\[\theta_{t+1} = \theta_{t} + \dot{\theta}_{t+1} dt\]

我们将奖励定义为

\[r = -(\theta^2 + 0.1 * \dot{\theta}^2 + 0.001 * u^2)\]

当角度接近 0(摆锤处于向上位置)、角速度接近 0(无运动)且扭矩也为 0 时,此奖励将最大化。

编码动作的效果:_step()#

步进方法是首先要考虑的事情,因为它将编码我们感兴趣的模拟。在 TorchRL 中,EnvBase 类有一个 EnvBase.step() 方法,它接收一个tensordict.TensorDict 实例,其中包含一个指示要采取的动作的 "action" 条目。

为了方便从该 tensordict 中读取和写入,并确保键与库预期的内容一致,模拟部分已委托给一个私有抽象方法 _step(),该方法从 tensordict 读取输入数据,并用输出数据写入一个*新的* tensordict

_step() 方法应执行以下操作

  1. 读取输入键(例如 "action")并根据这些键执行模拟;

  2. 检索观察结果、完成状态和奖励;

  3. 将一组观察值以及奖励和完成状态写入新 TensorDict 中的相应条目。

接下来,step() 方法将把step() 的输出合并到输入 tensordict 中,以强制输入/输出一致性。

通常,对于有状态环境,这将如下所示

>>> policy(env.reset())
>>> print(tensordict)
TensorDict(
    fields={
        action: Tensor(shape=torch.Size([1]), device=cpu, dtype=torch.float32, is_shared=False),
        done: Tensor(shape=torch.Size([1]), device=cpu, dtype=torch.bool, is_shared=False),
        observation: Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, is_shared=False)},
    batch_size=torch.Size([]),
    device=cpu,
    is_shared=False)
>>> env.step(tensordict)
>>> print(tensordict)
TensorDict(
    fields={
        action: Tensor(shape=torch.Size([1]), device=cpu, dtype=torch.float32, is_shared=False),
        done: Tensor(shape=torch.Size([1]), device=cpu, dtype=torch.bool, is_shared=False),
        next: TensorDict(
            fields={
                done: Tensor(shape=torch.Size([1]), device=cpu, dtype=torch.bool, is_shared=False),
                observation: Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, is_shared=False),
                reward: Tensor(shape=torch.Size([1]), device=cpu, dtype=torch.float32, is_shared=False)},
            batch_size=torch.Size([]),
            device=cpu,
            is_shared=False),
        observation: Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, is_shared=False)},
    batch_size=torch.Size([]),
    device=cpu,
    is_shared=False)

请注意,根 tensordict 未更改,唯一的修改是出现了一个包含新信息的 "next" 条目。

在摆锤示例中,我们的 _step() 方法将从输入 tensordict 中读取相关条目,并计算施加由 "action" 键编码的力后摆锤的位置和速度。我们计算摆锤的新角位置 "new_th",它是先前位置 "th" 加上新速度 "new_thdot" 在时间间隔 dt 上的结果。

由于我们的目标是将摆锤向上并使其在该位置保持静止,因此当位置接近目标且速度较低时,我们的 cost(负奖励)函数会降低。实际上,我们希望阻止偏离“向上”位置和/或偏离 0 速度的位置。

在我们的示例中,EnvBase._step() 被编码为静态方法,因为我们的环境是无状态的。在有状态设置中,需要 self 参数,因为需要从环境中读取状态。

def _step(tensordict):
    th, thdot = tensordict["th"], tensordict["thdot"]  # th := theta

    g_force = tensordict["params", "g"]
    mass = tensordict["params", "m"]
    length = tensordict["params", "l"]
    dt = tensordict["params", "dt"]
    u = tensordict["action"].squeeze(-1)
    u = u.clamp(-tensordict["params", "max_torque"], tensordict["params", "max_torque"])
    costs = angle_normalize(th) ** 2 + 0.1 * thdot**2 + 0.001 * (u**2)

    new_thdot = (
        thdot
        + (3 * g_force / (2 * length) * th.sin() + 3.0 / (mass * length**2) * u) * dt
    )
    new_thdot = new_thdot.clamp(
        -tensordict["params", "max_speed"], tensordict["params", "max_speed"]
    )
    new_th = th + new_thdot * dt
    reward = -costs.view(*tensordict.shape, 1)
    done = torch.zeros_like(reward, dtype=torch.bool)
    out = TensorDict(
        {
            "th": new_th,
            "thdot": new_thdot,
            "params": tensordict["params"],
            "reward": reward,
            "done": done,
        },
        tensordict.shape,
    )
    return out


def angle_normalize(x):
    return ((x + torch.pi) % (2 * torch.pi)) - torch.pi

重置模拟器:_reset()#

我们需要关注的第二个方法是 _reset() 方法。与 _step() 类似,它应该将观察条目和可能的完成状态写入其输出的 tensordict 中(如果省略完成状态,它将由父方法reset()填充为 False)。在某些情况下,_reset 方法需要从调用它的函数接收命令(例如,在多智能体设置中,我们可能需要指示哪些智能体需要重置)。这就是为什么 _reset() 方法也期望一个 tensordict 作为输入,尽管它可能完全为空或 None

EnvBase.reset() 进行一些简单的检查,就像 EnvBase.step() 所做的那样,例如确保输出 tensordict 中返回一个 "done" 状态,并且形状与规范预期的一致。

对我们来说,唯一需要考虑的重要事情是 EnvBase._reset() 是否包含所有预期的观察结果。再次,由于我们使用的是无状态环境,我们将摆锤的配置传递到一个名为 "params" 的嵌套 tensordict 中。

在此示例中,我们不传递完成状态,因为这对于 _reset() 并非强制性的,并且我们的环境是非终止的,因此我们始终期望它为 False

def _reset(self, tensordict):
    if tensordict is None or tensordict.is_empty():
        # if no ``tensordict`` is passed, we generate a single set of hyperparameters
        # Otherwise, we assume that the input ``tensordict`` contains all the relevant
        # parameters to get started.
        tensordict = self.gen_params(batch_size=self.batch_size)

    high_th = torch.tensor(DEFAULT_X, device=self.device)
    high_thdot = torch.tensor(DEFAULT_Y, device=self.device)
    low_th = -high_th
    low_thdot = -high_thdot

    # for non batch-locked environments, the input ``tensordict`` shape dictates the number
    # of simulators run simultaneously. In other contexts, the initial
    # random state's shape will depend upon the environment batch-size instead.
    th = (
        torch.rand(tensordict.shape, generator=self.rng, device=self.device)
        * (high_th - low_th)
        + low_th
    )
    thdot = (
        torch.rand(tensordict.shape, generator=self.rng, device=self.device)
        * (high_thdot - low_thdot)
        + low_thdot
    )
    out = TensorDict(
        {
            "th": th,
            "thdot": thdot,
            "params": tensordict["params"],
        },
        batch_size=tensordict.shape,
    )
    return out

环境元数据:env.*_spec#

规范定义了环境的输入和输出域。重要的是,规范要准确定义运行时将接收的张量,因为它们通常用于在多处理和分布式设置中携带有关环境的信息。它们还可以用于实例化懒惰定义的神经网络和测试脚本,而无需实际查询环境(例如,对于实际物理系统来说,这可能代价高昂)。

我们的环境中必须编码四个规范

  • EnvBase.observation_spec:这将是一个CompositeSpec 实例,其中每个键都是一个观察(CompositeSpec 可以看作是规范的字典)。

  • EnvBase.action_spec:它可以是任何类型的规范,但要求它与输入 tensordict 中的 "action" 条目相对应;

  • EnvBase.reward_spec:提供有关奖励空间的信息;

  • EnvBase.done_spec:提供有关完成标志空间的信息。

TorchRL 规范组织在两个通用容器中:input_spec,它包含步骤函数读取的信息的规范(分为包含动作的 action_spec 和包含其余所有内容的 state_spec),以及 output_spec,它编码步骤输出的规范(observation_specreward_specdone_spec)。通常,您不应直接与 output_specinput_spec 交互,而只能与它们的内容交互:observation_specreward_specdone_specaction_specstate_spec。原因是这些规范在 output_specinput_spec 中以非平凡的方式组织,并且这些都不应该直接修改。

换句话说,observation_spec 和相关属性是输出和输入规范容器内容的便捷快捷方式。

TorchRL 提供了多种TensorSpec 子类来编码环境的输入和输出特性。

规范形状#

环境规范的主要维度必须与环境的批次大小匹配。这样做是为了强制环境的每个组件(包括其变换)都准确表示预期的输入和输出形状。这在有状态设置中应该准确编码。

对于非批次锁定环境,例如我们示例中的环境(见下文),这无关紧要,因为环境批次大小很可能为空。

def _make_spec(self, td_params):
    # Under the hood, this will populate self.output_spec["observation"]
    self.observation_spec = CompositeSpec(
        th=BoundedTensorSpec(
            low=-torch.pi,
            high=torch.pi,
            shape=(),
            dtype=torch.float32,
        ),
        thdot=BoundedTensorSpec(
            low=-td_params["params", "max_speed"],
            high=td_params["params", "max_speed"],
            shape=(),
            dtype=torch.float32,
        ),
        # we need to add the ``params`` to the observation specs, as we want
        # to pass it at each step during a rollout
        params=make_composite_from_td(td_params["params"]),
        shape=(),
    )
    # since the environment is stateless, we expect the previous output as input.
    # For this, ``EnvBase`` expects some state_spec to be available
    self.state_spec = self.observation_spec.clone()
    # action-spec will be automatically wrapped in input_spec when
    # `self.action_spec = spec` will be called supported
    self.action_spec = BoundedTensorSpec(
        low=-td_params["params", "max_torque"],
        high=td_params["params", "max_torque"],
        shape=(1,),
        dtype=torch.float32,
    )
    self.reward_spec = UnboundedContinuousTensorSpec(shape=(*td_params.shape, 1))


def make_composite_from_td(td):
    # custom function to convert a ``tensordict`` in a similar spec structure
    # of unbounded values.
    composite = CompositeSpec(
        {
            key: make_composite_from_td(tensor)
            if isinstance(tensor, TensorDictBase)
            else UnboundedContinuousTensorSpec(
                dtype=tensor.dtype, device=tensor.device, shape=tensor.shape
            )
            for key, tensor in td.items()
        },
        shape=td.shape,
    )
    return composite

可重现实验:播种#

播种环境是初始化实验时的常见操作。EnvBase._set_seed() 的唯一目标是设置包含的模拟器的种子。如果可能,此操作不应调用 reset() 或与环境执行交互。父 EnvBase.set_seed() 方法包含一种机制,允许使用不同的伪随机和可重现种子播种多个环境。

def _set_seed(self, seed: Optional[int]):
    rng = torch.manual_seed(seed)
    self.rng = rng

将各项组合在一起:EnvBase#

我们终于可以把这些部分组合起来,设计我们的环境类。规范初始化需要在环境构建期间执行,因此我们必须在 PendulumEnv.__init__() 中调用 _make_spec() 方法。

我们添加了一个静态方法 PendulumEnv.gen_params(),它确定性地生成一组要在执行期间使用的超参数

def gen_params(g=10.0, batch_size=None) -> TensorDictBase:
    """Returns a ``tensordict`` containing the physical parameters such as gravitational force and torque or speed limits."""
    if batch_size is None:
        batch_size = []
    td = TensorDict(
        {
            "params": TensorDict(
                {
                    "max_speed": 8,
                    "max_torque": 2.0,
                    "dt": 0.05,
                    "g": g,
                    "m": 1.0,
                    "l": 1.0,
                },
                [],
            )
        },
        [],
    )
    if batch_size:
        td = td.expand(batch_size).contiguous()
    return td

我们通过将 homonymous 属性设置为 False 来将环境定义为非 batch_locked。这意味着我们**不会**强制输入 tensordict 具有与环境匹配的 batch-size

以下代码将我们上面编写的部分组合在一起。

class PendulumEnv(EnvBase):
    metadata = {
        "render_modes": ["human", "rgb_array"],
        "render_fps": 30,
    }
    batch_locked = False

    def __init__(self, td_params=None, seed=None, device="cpu"):
        if td_params is None:
            td_params = self.gen_params()

        super().__init__(device=device, batch_size=[])
        self._make_spec(td_params)
        if seed is None:
            seed = torch.empty((), dtype=torch.int64).random_().item()
        self.set_seed(seed)

    # Helpers: _make_step and gen_params
    gen_params = staticmethod(gen_params)
    _make_spec = _make_spec

    # Mandatory methods: _step, _reset and _set_seed
    _reset = _reset
    _step = staticmethod(_step)
    _set_seed = _set_seed

测试我们的环境#

TorchRL 提供了一个简单的函数 check_env_specs() 来检查(转换后的)环境的输入/输出结构是否与其规范规定的结构匹配。让我们试一试

/usr/local/lib/python3.10/dist-packages/torchrl/data/tensor_specs.py:6911: DeprecationWarning:

The BoundedTensorSpec has been deprecated and will be removed in v0.8. Please use Bounded instead.

/usr/local/lib/python3.10/dist-packages/torchrl/data/tensor_specs.py:6911: DeprecationWarning:

The UnboundedContinuousTensorSpec has been deprecated and will be removed in v0.8. Please use Unbounded instead.

/usr/local/lib/python3.10/dist-packages/torchrl/data/tensor_specs.py:6911: DeprecationWarning:

The CompositeSpec has been deprecated and will be removed in v0.8. Please use Composite instead.

2025-08-07 18:30:44,422 [torchrl][INFO]    check_env_specs succeeded! [END]

我们可以查看我们的规范,以获得环境签名的视觉表示

print("observation_spec:", env.observation_spec)
print("state_spec:", env.state_spec)
print("reward_spec:", env.reward_spec)
observation_spec: CompositeSpec(
    th: BoundedContinuous(
        shape=torch.Size([]),
        space=ContinuousBox(
            low=Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, contiguous=True),
            high=Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, contiguous=True)),
        device=cpu,
        dtype=torch.float32,
        domain=continuous),
    thdot: BoundedContinuous(
        shape=torch.Size([]),
        space=ContinuousBox(
            low=Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, contiguous=True),
            high=Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, contiguous=True)),
        device=cpu,
        dtype=torch.float32,
        domain=continuous),
    params: CompositeSpec(
        max_speed: UnboundedDiscrete(
            shape=torch.Size([]),
            space=ContinuousBox(
                low=Tensor(shape=torch.Size([]), device=cpu, dtype=torch.int64, contiguous=True),
                high=Tensor(shape=torch.Size([]), device=cpu, dtype=torch.int64, contiguous=True)),
            device=cpu,
            dtype=torch.int64,
            domain=discrete),
        max_torque: UnboundedContinuous(
            shape=torch.Size([]),
            space=ContinuousBox(
                low=Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, contiguous=True),
                high=Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, contiguous=True)),
            device=cpu,
            dtype=torch.float32,
            domain=continuous),
        dt: UnboundedContinuous(
            shape=torch.Size([]),
            space=ContinuousBox(
                low=Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, contiguous=True),
                high=Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, contiguous=True)),
            device=cpu,
            dtype=torch.float32,
            domain=continuous),
        g: UnboundedContinuous(
            shape=torch.Size([]),
            space=ContinuousBox(
                low=Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, contiguous=True),
                high=Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, contiguous=True)),
            device=cpu,
            dtype=torch.float32,
            domain=continuous),
        m: UnboundedContinuous(
            shape=torch.Size([]),
            space=ContinuousBox(
                low=Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, contiguous=True),
                high=Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, contiguous=True)),
            device=cpu,
            dtype=torch.float32,
            domain=continuous),
        l: UnboundedContinuous(
            shape=torch.Size([]),
            space=ContinuousBox(
                low=Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, contiguous=True),
                high=Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, contiguous=True)),
            device=cpu,
            dtype=torch.float32,
            domain=continuous),
        device=cpu,
        shape=torch.Size([]),
        data_cls=None),
    device=cpu,
    shape=torch.Size([]),
    data_cls=None)
state_spec: CompositeSpec(
    th: BoundedContinuous(
        shape=torch.Size([]),
        space=ContinuousBox(
            low=Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, contiguous=True),
            high=Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, contiguous=True)),
        device=cpu,
        dtype=torch.float32,
        domain=continuous),
    thdot: BoundedContinuous(
        shape=torch.Size([]),
        space=ContinuousBox(
            low=Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, contiguous=True),
            high=Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, contiguous=True)),
        device=cpu,
        dtype=torch.float32,
        domain=continuous),
    params: CompositeSpec(
        max_speed: UnboundedDiscrete(
            shape=torch.Size([]),
            space=ContinuousBox(
                low=Tensor(shape=torch.Size([]), device=cpu, dtype=torch.int64, contiguous=True),
                high=Tensor(shape=torch.Size([]), device=cpu, dtype=torch.int64, contiguous=True)),
            device=cpu,
            dtype=torch.int64,
            domain=discrete),
        max_torque: UnboundedContinuous(
            shape=torch.Size([]),
            space=ContinuousBox(
                low=Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, contiguous=True),
                high=Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, contiguous=True)),
            device=cpu,
            dtype=torch.float32,
            domain=continuous),
        dt: UnboundedContinuous(
            shape=torch.Size([]),
            space=ContinuousBox(
                low=Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, contiguous=True),
                high=Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, contiguous=True)),
            device=cpu,
            dtype=torch.float32,
            domain=continuous),
        g: UnboundedContinuous(
            shape=torch.Size([]),
            space=ContinuousBox(
                low=Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, contiguous=True),
                high=Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, contiguous=True)),
            device=cpu,
            dtype=torch.float32,
            domain=continuous),
        m: UnboundedContinuous(
            shape=torch.Size([]),
            space=ContinuousBox(
                low=Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, contiguous=True),
                high=Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, contiguous=True)),
            device=cpu,
            dtype=torch.float32,
            domain=continuous),
        l: UnboundedContinuous(
            shape=torch.Size([]),
            space=ContinuousBox(
                low=Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, contiguous=True),
                high=Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, contiguous=True)),
            device=cpu,
            dtype=torch.float32,
            domain=continuous),
        device=cpu,
        shape=torch.Size([]),
        data_cls=None),
    device=cpu,
    shape=torch.Size([]),
    data_cls=None)
reward_spec: UnboundedContinuous(
    shape=torch.Size([1]),
    space=ContinuousBox(
        low=Tensor(shape=torch.Size([1]), device=cpu, dtype=torch.float32, contiguous=True),
        high=Tensor(shape=torch.Size([1]), device=cpu, dtype=torch.float32, contiguous=True)),
    device=cpu,
    dtype=torch.float32,
    domain=continuous)

我们也可以执行一些命令来检查输出结构是否符合预期。

td = env.reset()
print("reset tensordict", td)
reset tensordict TensorDict(
    fields={
        done: Tensor(shape=torch.Size([1]), device=cpu, dtype=torch.bool, is_shared=False),
        params: TensorDict(
            fields={
                dt: Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, is_shared=False),
                g: Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, is_shared=False),
                l: Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, is_shared=False),
                m: Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, is_shared=False),
                max_speed: Tensor(shape=torch.Size([]), device=cpu, dtype=torch.int64, is_shared=False),
                max_torque: Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, is_shared=False)},
            batch_size=torch.Size([]),
            device=None,
            is_shared=False),
        terminated: Tensor(shape=torch.Size([1]), device=cpu, dtype=torch.bool, is_shared=False),
        th: Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, is_shared=False),
        thdot: Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, is_shared=False)},
    batch_size=torch.Size([]),
    device=None,
    is_shared=False)

我们可以运行 env.rand_step()action_spec 域中随机生成一个动作。由于我们的环境是无状态的,**必须**传递包含超参数和当前状态的 tensordict。在有状态上下文中,env.rand_step() 也能完美运行。

td = env.rand_step(td)
print("random step tensordict", td)
random step tensordict TensorDict(
    fields={
        action: Tensor(shape=torch.Size([1]), device=cpu, dtype=torch.float32, is_shared=False),
        done: Tensor(shape=torch.Size([1]), device=cpu, dtype=torch.bool, is_shared=False),
        next: TensorDict(
            fields={
                done: Tensor(shape=torch.Size([1]), device=cpu, dtype=torch.bool, is_shared=False),
                params: TensorDict(
                    fields={
                        dt: Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, is_shared=False),
                        g: Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, is_shared=False),
                        l: Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, is_shared=False),
                        m: Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, is_shared=False),
                        max_speed: Tensor(shape=torch.Size([]), device=cpu, dtype=torch.int64, is_shared=False),
                        max_torque: Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, is_shared=False)},
                    batch_size=torch.Size([]),
                    device=None,
                    is_shared=False),
                reward: Tensor(shape=torch.Size([1]), device=cpu, dtype=torch.float32, is_shared=False),
                terminated: Tensor(shape=torch.Size([1]), device=cpu, dtype=torch.bool, is_shared=False),
                th: Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, is_shared=False),
                thdot: Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, is_shared=False)},
            batch_size=torch.Size([]),
            device=None,
            is_shared=False),
        params: TensorDict(
            fields={
                dt: Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, is_shared=False),
                g: Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, is_shared=False),
                l: Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, is_shared=False),
                m: Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, is_shared=False),
                max_speed: Tensor(shape=torch.Size([]), device=cpu, dtype=torch.int64, is_shared=False),
                max_torque: Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, is_shared=False)},
            batch_size=torch.Size([]),
            device=None,
            is_shared=False),
        terminated: Tensor(shape=torch.Size([1]), device=cpu, dtype=torch.bool, is_shared=False),
        th: Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, is_shared=False),
        thdot: Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, is_shared=False)},
    batch_size=torch.Size([]),
    device=None,
    is_shared=False)

转换环境#

为无状态模拟器编写环境转换比为有状态模拟器编写略复杂:转换需要在下一次迭代中读取的输出条目,需要在下次调用 meth.step() 之前应用逆转换。这是展示 TorchRL 转换所有功能的理想场景!

例如,在以下转换后的环境中,我们 unsqueeze 条目 ["th", "thdot"],以便能够沿着最后一个维度堆叠它们。我们还将它们作为 in_keys_inv 传递,以便在下次迭代中作为输入传递时,将它们挤压回原始形状。

env = TransformedEnv(
    env,
    # ``Unsqueeze`` the observations that we will concatenate
    UnsqueezeTransform(
        dim=-1,
        in_keys=["th", "thdot"],
        in_keys_inv=["th", "thdot"],
    ),
)

编写自定义变换#

TorchRL 的变换可能无法涵盖环境执行后要执行的所有操作。编写变换不需要太多精力。与环境设计一样,编写变换分为两步

  • 纠正动力学(正向和反向);

  • 调整环境规范。

一个变换可以在两种设置中使用:它本身可以作为Module 使用。它也可以附加到TransformedEnv 中使用。类的结构允许在不同上下文中自定义行为。

一个Transform 骨架可以总结如下

class Transform(nn.Module):
    def forward(self, tensordict):
        ...
    def _apply_transform(self, tensordict):
        ...
    def _step(self, tensordict):
        ...
    def _call(self, tensordict):
        ...
    def inv(self, tensordict):
        ...
    def _inv_apply_transform(self, tensordict):
        ...

有三个入口点(forward()_step()inv()),它们都接收tensordict.TensorDict 实例。前两个最终将通过 in_keys 指示的键,并对每个键调用 _apply_transform()。结果将被写入由 Transform.out_keys 指示的条目(如果提供,否则 in_keys 将使用转换后的值进行更新)。如果需要执行逆变换,将执行类似的数据流,但使用 Transform.inv()Transform._inv_apply_transform() 方法,并跨越 in_keys_invout_keys_inv 键列表。下图总结了环境和回放缓冲区的数据流。

变换 API

在某些情况下,变换不会以统一的方式作用于键的子集,而是对父环境执行某些操作或处理整个输入 tensordict。在这些情况下,应重新编写 _call()forward() 方法,并且可以跳过 _apply_transform() 方法。

让我们编写新的变换,它将计算位置角的 sinecosine 值,因为这些值对我们学习策略比原始角度值更有用

class SinTransform(Transform):
    def _apply_transform(self, obs: torch.Tensor) -> None:
        return obs.sin()

    # The transform must also modify the data at reset time
    def _reset(
        self, tensordict: TensorDictBase, tensordict_reset: TensorDictBase
    ) -> TensorDictBase:
        return self._call(tensordict_reset)

    # _apply_to_composite will execute the observation spec transform across all
    # in_keys/out_keys pairs and write the result in the observation_spec which
    # is of type ``Composite``
    @_apply_to_composite
    def transform_observation_spec(self, observation_spec):
        return BoundedTensorSpec(
            low=-1,
            high=1,
            shape=observation_spec.shape,
            dtype=observation_spec.dtype,
            device=observation_spec.device,
        )


class CosTransform(Transform):
    def _apply_transform(self, obs: torch.Tensor) -> None:
        return obs.cos()

    # The transform must also modify the data at reset time
    def _reset(
        self, tensordict: TensorDictBase, tensordict_reset: TensorDictBase
    ) -> TensorDictBase:
        return self._call(tensordict_reset)

    # _apply_to_composite will execute the observation spec transform across all
    # in_keys/out_keys pairs and write the result in the observation_spec which
    # is of type ``Composite``
    @_apply_to_composite
    def transform_observation_spec(self, observation_spec):
        return BoundedTensorSpec(
            low=-1,
            high=1,
            shape=observation_spec.shape,
            dtype=observation_spec.dtype,
            device=observation_spec.device,
        )


t_sin = SinTransform(in_keys=["th"], out_keys=["sin"])
t_cos = CosTransform(in_keys=["th"], out_keys=["cos"])
env.append_transform(t_sin)
env.append_transform(t_cos)
TransformedEnv(
    env=PendulumEnv(),
    transform=Compose(
            UnsqueezeTransform(dim=-1, in_keys=['th', 'thdot'], out_keys=['th', 'thdot'], in_keys_inv=['th', 'thdot'], out_keys_inv=['th', 'thdot']),
            SinTransform(keys=['th']),
            CosTransform(keys=['th'])))

将观测值连接到“观测”条目。 del_keys=False 确保我们保留这些值以供下一次迭代。

cat_transform = CatTensors(
    in_keys=["sin", "cos", "thdot"], dim=-1, out_key="observation", del_keys=False
)
env.append_transform(cat_transform)
TransformedEnv(
    env=PendulumEnv(),
    transform=Compose(
            UnsqueezeTransform(dim=-1, in_keys=['th', 'thdot'], out_keys=['th', 'thdot'], in_keys_inv=['th', 'thdot'], out_keys_inv=['th', 'thdot']),
            SinTransform(keys=['th']),
            CosTransform(keys=['th']),
            CatTensors(in_keys=['cos', 'sin', 'thdot'], out_key=observation)))

再次,让我们检查一下我们的环境规范是否与接收到的内容匹配

2025-08-07 18:30:44,458 [torchrl][INFO]    check_env_specs succeeded! [END]

执行一次推演#

执行一次推演是一系列简单的步骤

  • 重置环境

  • 当某个条件不满足时

    • 根据策略计算动作

    • 根据此动作执行一步

    • 收集数据

    • 进行一个 MDP 步进

  • 收集数据并返回

这些操作已方便地封装在rollout() 方法中,我们将在下面提供其简化版本。

def simple_rollout(steps=100):
    # preallocate:
    data = TensorDict({}, [steps])
    # reset
    _data = env.reset()
    for i in range(steps):
        _data["action"] = env.action_spec.rand()
        _data = env.step(_data)
        data[i] = _data
        _data = step_mdp(_data, keep_other=True)
    return data


print("data from rollout:", simple_rollout(100))
data from rollout: TensorDict(
    fields={
        action: Tensor(shape=torch.Size([100, 1]), device=cpu, dtype=torch.float32, is_shared=False),
        cos: Tensor(shape=torch.Size([100, 1]), device=cpu, dtype=torch.float32, is_shared=False),
        done: Tensor(shape=torch.Size([100, 1]), device=cpu, dtype=torch.bool, is_shared=False),
        next: TensorDict(
            fields={
                cos: Tensor(shape=torch.Size([100, 1]), device=cpu, dtype=torch.float32, is_shared=False),
                done: Tensor(shape=torch.Size([100, 1]), device=cpu, dtype=torch.bool, is_shared=False),
                observation: Tensor(shape=torch.Size([100, 3]), device=cpu, dtype=torch.float32, is_shared=False),
                params: TensorDict(
                    fields={
                        dt: Tensor(shape=torch.Size([100]), device=cpu, dtype=torch.float32, is_shared=False),
                        g: Tensor(shape=torch.Size([100]), device=cpu, dtype=torch.float32, is_shared=False),
                        l: Tensor(shape=torch.Size([100]), device=cpu, dtype=torch.float32, is_shared=False),
                        m: Tensor(shape=torch.Size([100]), device=cpu, dtype=torch.float32, is_shared=False),
                        max_speed: Tensor(shape=torch.Size([100]), device=cpu, dtype=torch.int64, is_shared=False),
                        max_torque: Tensor(shape=torch.Size([100]), device=cpu, dtype=torch.float32, is_shared=False)},
                    batch_size=torch.Size([100]),
                    device=None,
                    is_shared=False),
                reward: Tensor(shape=torch.Size([100, 1]), device=cpu, dtype=torch.float32, is_shared=False),
                sin: Tensor(shape=torch.Size([100, 1]), device=cpu, dtype=torch.float32, is_shared=False),
                terminated: Tensor(shape=torch.Size([100, 1]), device=cpu, dtype=torch.bool, is_shared=False),
                th: Tensor(shape=torch.Size([100, 1]), device=cpu, dtype=torch.float32, is_shared=False),
                thdot: Tensor(shape=torch.Size([100, 1]), device=cpu, dtype=torch.float32, is_shared=False)},
            batch_size=torch.Size([100]),
            device=None,
            is_shared=False),
        observation: Tensor(shape=torch.Size([100, 3]), device=cpu, dtype=torch.float32, is_shared=False),
        params: TensorDict(
            fields={
                dt: Tensor(shape=torch.Size([100]), device=cpu, dtype=torch.float32, is_shared=False),
                g: Tensor(shape=torch.Size([100]), device=cpu, dtype=torch.float32, is_shared=False),
                l: Tensor(shape=torch.Size([100]), device=cpu, dtype=torch.float32, is_shared=False),
                m: Tensor(shape=torch.Size([100]), device=cpu, dtype=torch.float32, is_shared=False),
                max_speed: Tensor(shape=torch.Size([100]), device=cpu, dtype=torch.int64, is_shared=False),
                max_torque: Tensor(shape=torch.Size([100]), device=cpu, dtype=torch.float32, is_shared=False)},
            batch_size=torch.Size([100]),
            device=None,
            is_shared=False),
        sin: Tensor(shape=torch.Size([100, 1]), device=cpu, dtype=torch.float32, is_shared=False),
        terminated: Tensor(shape=torch.Size([100, 1]), device=cpu, dtype=torch.bool, is_shared=False),
        th: Tensor(shape=torch.Size([100, 1]), device=cpu, dtype=torch.float32, is_shared=False),
        thdot: Tensor(shape=torch.Size([100, 1]), device=cpu, dtype=torch.float32, is_shared=False)},
    batch_size=torch.Size([100]),
    device=None,
    is_shared=False)

批量计算#

本教程的最后一个尚未探索的领域是我们在 TorchRL 中批量计算的能力。因为我们的环境不对输入数据形状做任何假设,所以我们可以无缝地在数据批次上执行它。更好的是:对于像我们的摆锤这样的非批次锁定环境,我们可以动态更改批次大小而无需重新创建环境。为此,我们只需生成具有所需形状的参数即可。

batch_size = 10  # number of environments to be executed in batch
td = env.reset(env.gen_params(batch_size=[batch_size]))
print("reset (batch size of 10)", td)
td = env.rand_step(td)
print("rand step (batch size of 10)", td)
reset (batch size of 10) TensorDict(
    fields={
        cos: Tensor(shape=torch.Size([10, 1]), device=cpu, dtype=torch.float32, is_shared=False),
        done: Tensor(shape=torch.Size([10, 1]), device=cpu, dtype=torch.bool, is_shared=False),
        observation: Tensor(shape=torch.Size([10, 3]), device=cpu, dtype=torch.float32, is_shared=False),
        params: TensorDict(
            fields={
                dt: Tensor(shape=torch.Size([10]), device=cpu, dtype=torch.float32, is_shared=False),
                g: Tensor(shape=torch.Size([10]), device=cpu, dtype=torch.float32, is_shared=False),
                l: Tensor(shape=torch.Size([10]), device=cpu, dtype=torch.float32, is_shared=False),
                m: Tensor(shape=torch.Size([10]), device=cpu, dtype=torch.float32, is_shared=False),
                max_speed: Tensor(shape=torch.Size([10]), device=cpu, dtype=torch.int64, is_shared=False),
                max_torque: Tensor(shape=torch.Size([10]), device=cpu, dtype=torch.float32, is_shared=False)},
            batch_size=torch.Size([10]),
            device=None,
            is_shared=False),
        sin: Tensor(shape=torch.Size([10, 1]), device=cpu, dtype=torch.float32, is_shared=False),
        terminated: Tensor(shape=torch.Size([10, 1]), device=cpu, dtype=torch.bool, is_shared=False),
        th: Tensor(shape=torch.Size([10, 1]), device=cpu, dtype=torch.float32, is_shared=False),
        thdot: Tensor(shape=torch.Size([10, 1]), device=cpu, dtype=torch.float32, is_shared=False)},
    batch_size=torch.Size([10]),
    device=None,
    is_shared=False)
rand step (batch size of 10) TensorDict(
    fields={
        action: Tensor(shape=torch.Size([10, 1]), device=cpu, dtype=torch.float32, is_shared=False),
        cos: Tensor(shape=torch.Size([10, 1]), device=cpu, dtype=torch.float32, is_shared=False),
        done: Tensor(shape=torch.Size([10, 1]), device=cpu, dtype=torch.bool, is_shared=False),
        next: TensorDict(
            fields={
                cos: Tensor(shape=torch.Size([10, 1]), device=cpu, dtype=torch.float32, is_shared=False),
                done: Tensor(shape=torch.Size([10, 1]), device=cpu, dtype=torch.bool, is_shared=False),
                observation: Tensor(shape=torch.Size([10, 3]), device=cpu, dtype=torch.float32, is_shared=False),
                params: TensorDict(
                    fields={
                        dt: Tensor(shape=torch.Size([10]), device=cpu, dtype=torch.float32, is_shared=False),
                        g: Tensor(shape=torch.Size([10]), device=cpu, dtype=torch.float32, is_shared=False),
                        l: Tensor(shape=torch.Size([10]), device=cpu, dtype=torch.float32, is_shared=False),
                        m: Tensor(shape=torch.Size([10]), device=cpu, dtype=torch.float32, is_shared=False),
                        max_speed: Tensor(shape=torch.Size([10]), device=cpu, dtype=torch.int64, is_shared=False),
                        max_torque: Tensor(shape=torch.Size([10]), device=cpu, dtype=torch.float32, is_shared=False)},
                    batch_size=torch.Size([10]),
                    device=None,
                    is_shared=False),
                reward: Tensor(shape=torch.Size([10, 1]), device=cpu, dtype=torch.float32, is_shared=False),
                sin: Tensor(shape=torch.Size([10, 1]), device=cpu, dtype=torch.float32, is_shared=False),
                terminated: Tensor(shape=torch.Size([10, 1]), device=cpu, dtype=torch.bool, is_shared=False),
                th: Tensor(shape=torch.Size([10, 1]), device=cpu, dtype=torch.float32, is_shared=False),
                thdot: Tensor(shape=torch.Size([10, 1]), device=cpu, dtype=torch.float32, is_shared=False)},
            batch_size=torch.Size([10]),
            device=None,
            is_shared=False),
        observation: Tensor(shape=torch.Size([10, 3]), device=cpu, dtype=torch.float32, is_shared=False),
        params: TensorDict(
            fields={
                dt: Tensor(shape=torch.Size([10]), device=cpu, dtype=torch.float32, is_shared=False),
                g: Tensor(shape=torch.Size([10]), device=cpu, dtype=torch.float32, is_shared=False),
                l: Tensor(shape=torch.Size([10]), device=cpu, dtype=torch.float32, is_shared=False),
                m: Tensor(shape=torch.Size([10]), device=cpu, dtype=torch.float32, is_shared=False),
                max_speed: Tensor(shape=torch.Size([10]), device=cpu, dtype=torch.int64, is_shared=False),
                max_torque: Tensor(shape=torch.Size([10]), device=cpu, dtype=torch.float32, is_shared=False)},
            batch_size=torch.Size([10]),
            device=None,
            is_shared=False),
        sin: Tensor(shape=torch.Size([10, 1]), device=cpu, dtype=torch.float32, is_shared=False),
        terminated: Tensor(shape=torch.Size([10, 1]), device=cpu, dtype=torch.bool, is_shared=False),
        th: Tensor(shape=torch.Size([10, 1]), device=cpu, dtype=torch.float32, is_shared=False),
        thdot: Tensor(shape=torch.Size([10, 1]), device=cpu, dtype=torch.float32, is_shared=False)},
    batch_size=torch.Size([10]),
    device=None,
    is_shared=False)

使用一批数据执行推演需要我们将环境从推演函数中重置,因为我们需要动态定义 batch_size,而 rollout() 不支持此功能

rollout = env.rollout(
    3,
    auto_reset=False,  # we're executing the reset out of the ``rollout`` call
    tensordict=env.reset(env.gen_params(batch_size=[batch_size])),
)
print("rollout of len 3 (batch size of 10):", rollout)
rollout of len 3 (batch size of 10): TensorDict(
    fields={
        action: Tensor(shape=torch.Size([10, 3, 1]), device=cpu, dtype=torch.float32, is_shared=False),
        cos: Tensor(shape=torch.Size([10, 3, 1]), device=cpu, dtype=torch.float32, is_shared=False),
        done: Tensor(shape=torch.Size([10, 3, 1]), device=cpu, dtype=torch.bool, is_shared=False),
        next: TensorDict(
            fields={
                cos: Tensor(shape=torch.Size([10, 3, 1]), device=cpu, dtype=torch.float32, is_shared=False),
                done: Tensor(shape=torch.Size([10, 3, 1]), device=cpu, dtype=torch.bool, is_shared=False),
                observation: Tensor(shape=torch.Size([10, 3, 3]), device=cpu, dtype=torch.float32, is_shared=False),
                params: TensorDict(
                    fields={
                        dt: Tensor(shape=torch.Size([10, 3]), device=cpu, dtype=torch.float32, is_shared=False),
                        g: Tensor(shape=torch.Size([10, 3]), device=cpu, dtype=torch.float32, is_shared=False),
                        l: Tensor(shape=torch.Size([10, 3]), device=cpu, dtype=torch.float32, is_shared=False),
                        m: Tensor(shape=torch.Size([10, 3]), device=cpu, dtype=torch.float32, is_shared=False),
                        max_speed: Tensor(shape=torch.Size([10, 3]), device=cpu, dtype=torch.int64, is_shared=False),
                        max_torque: Tensor(shape=torch.Size([10, 3]), device=cpu, dtype=torch.float32, is_shared=False)},
                    batch_size=torch.Size([10, 3]),
                    device=None,
                    is_shared=False),
                reward: Tensor(shape=torch.Size([10, 3, 1]), device=cpu, dtype=torch.float32, is_shared=False),
                sin: Tensor(shape=torch.Size([10, 3, 1]), device=cpu, dtype=torch.float32, is_shared=False),
                terminated: Tensor(shape=torch.Size([10, 3, 1]), device=cpu, dtype=torch.bool, is_shared=False),
                th: Tensor(shape=torch.Size([10, 3, 1]), device=cpu, dtype=torch.float32, is_shared=False),
                thdot: Tensor(shape=torch.Size([10, 3, 1]), device=cpu, dtype=torch.float32, is_shared=False)},
            batch_size=torch.Size([10, 3]),
            device=None,
            is_shared=False),
        observation: Tensor(shape=torch.Size([10, 3, 3]), device=cpu, dtype=torch.float32, is_shared=False),
        params: TensorDict(
            fields={
                dt: Tensor(shape=torch.Size([10, 3]), device=cpu, dtype=torch.float32, is_shared=False),
                g: Tensor(shape=torch.Size([10, 3]), device=cpu, dtype=torch.float32, is_shared=False),
                l: Tensor(shape=torch.Size([10, 3]), device=cpu, dtype=torch.float32, is_shared=False),
                m: Tensor(shape=torch.Size([10, 3]), device=cpu, dtype=torch.float32, is_shared=False),
                max_speed: Tensor(shape=torch.Size([10, 3]), device=cpu, dtype=torch.int64, is_shared=False),
                max_torque: Tensor(shape=torch.Size([10, 3]), device=cpu, dtype=torch.float32, is_shared=False)},
            batch_size=torch.Size([10, 3]),
            device=None,
            is_shared=False),
        sin: Tensor(shape=torch.Size([10, 3, 1]), device=cpu, dtype=torch.float32, is_shared=False),
        terminated: Tensor(shape=torch.Size([10, 3, 1]), device=cpu, dtype=torch.bool, is_shared=False),
        th: Tensor(shape=torch.Size([10, 3, 1]), device=cpu, dtype=torch.float32, is_shared=False),
        thdot: Tensor(shape=torch.Size([10, 3, 1]), device=cpu, dtype=torch.float32, is_shared=False)},
    batch_size=torch.Size([10, 3]),
    device=None,
    is_shared=False)

训练一个简单策略#

在此示例中,我们将使用奖励作为可微分目标(例如负损失)来训练一个简单策略。我们将利用动态系统完全可微分的事实,通过轨迹回报进行反向传播,并调整策略的权重以直接最大化此值。当然,在许多设置中,我们所做的许多假设(例如可微分系统和对底层机制的完全访问)并不成立。

尽管如此,这是一个非常简单的示例,展示了如何使用 TorchRL 中的自定义环境编写训练循环。

我们先来编写策略网络

torch.manual_seed(0)
env.set_seed(0)

net = nn.Sequential(
    nn.LazyLinear(64),
    nn.Tanh(),
    nn.LazyLinear(64),
    nn.Tanh(),
    nn.LazyLinear(64),
    nn.Tanh(),
    nn.LazyLinear(1),
)
policy = TensorDictModule(
    net,
    in_keys=["observation"],
    out_keys=["action"],
)

和我们的优化器

训练循环#

我们将依次

  • 生成轨迹

  • 累加奖励

  • 通过这些操作定义的图进行反向传播

  • 裁剪梯度范数并进行优化步骤

  • 重复

在训练循环结束时,我们应该有一个接近 0 的最终奖励,这表明摆锤如预期般向上并静止。

batch_size = 32
pbar = tqdm.tqdm(range(20_000 // batch_size))
scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(optim, 20_000)
logs = defaultdict(list)

for _ in pbar:
    init_td = env.reset(env.gen_params(batch_size=[batch_size]))
    rollout = env.rollout(100, policy, tensordict=init_td, auto_reset=False)
    traj_return = rollout["next", "reward"].mean()
    (-traj_return).backward()
    gn = torch.nn.utils.clip_grad_norm_(net.parameters(), 1.0)
    optim.step()
    optim.zero_grad()
    pbar.set_description(
        f"reward: {traj_return: 4.4f}, "
        f"last reward: {rollout[..., -1]['next', 'reward'].mean(): 4.4f}, gradient norm: {gn: 4.4}"
    )
    logs["return"].append(traj_return.item())
    logs["last_reward"].append(rollout[..., -1]["next", "reward"].mean().item())
    scheduler.step()


def plot():
    import matplotlib
    from matplotlib import pyplot as plt

    is_ipython = "inline" in matplotlib.get_backend()
    if is_ipython:
        from IPython import display

    with plt.ion():
        plt.figure(figsize=(10, 5))
        plt.subplot(1, 2, 1)
        plt.plot(logs["return"])
        plt.title("returns")
        plt.xlabel("iteration")
        plt.subplot(1, 2, 2)
        plt.plot(logs["last_reward"])
        plt.title("last reward")
        plt.xlabel("iteration")
        if is_ipython:
            display.display(plt.gcf())
            display.clear_output(wait=True)
        plt.show()


plot()
returns, last reward
  0%|          | 0/625 [00:00<?, ?it/s]
reward: -6.0488, last reward: -5.0748, gradient norm:  8.519:   0%|          | 0/625 [00:00<?, ?it/s]
reward: -6.0488, last reward: -5.0748, gradient norm:  8.519:   0%|          | 1/625 [00:00<03:23,  3.07it/s]
reward: -7.0499, last reward: -7.4472, gradient norm:  5.073:   0%|          | 1/625 [00:00<03:23,  3.07it/s]
reward: -7.0499, last reward: -7.4472, gradient norm:  5.073:   0%|          | 2/625 [00:00<02:41,  3.87it/s]
reward: -7.0685, last reward: -7.0408, gradient norm:  5.552:   0%|          | 2/625 [00:00<02:41,  3.87it/s]
reward: -7.0685, last reward: -7.0408, gradient norm:  5.552:   0%|          | 3/625 [00:00<02:26,  4.25it/s]
reward: -6.5154, last reward: -5.9086, gradient norm:  2.527:   0%|          | 3/625 [00:00<02:26,  4.25it/s]
reward: -6.5154, last reward: -5.9086, gradient norm:  2.527:   1%|          | 4/625 [00:00<02:18,  4.47it/s]
reward: -6.2006, last reward: -5.9385, gradient norm:  8.155:   1%|          | 4/625 [00:01<02:18,  4.47it/s]
reward: -6.2006, last reward: -5.9385, gradient norm:  8.155:   1%|          | 5/625 [00:01<02:15,  4.59it/s]
reward: -6.2568, last reward: -5.4981, gradient norm:  6.223:   1%|          | 5/625 [00:01<02:15,  4.59it/s]
reward: -6.2568, last reward: -5.4981, gradient norm:  6.223:   1%|          | 6/625 [00:01<02:13,  4.65it/s]
reward: -5.8929, last reward: -8.4491, gradient norm:  4.581:   1%|          | 6/625 [00:01<02:13,  4.65it/s]
reward: -5.8929, last reward: -8.4491, gradient norm:  4.581:   1%|          | 7/625 [00:01<02:11,  4.70it/s]
reward: -6.3233, last reward: -9.0664, gradient norm:  7.596:   1%|          | 7/625 [00:01<02:11,  4.70it/s]
reward: -6.3233, last reward: -9.0664, gradient norm:  7.596:   1%|▏         | 8/625 [00:01<02:10,  4.74it/s]
reward: -6.1021, last reward: -9.5263, gradient norm:  0.9579:   1%|▏         | 8/625 [00:01<02:10,  4.74it/s]
reward: -6.1021, last reward: -9.5263, gradient norm:  0.9579:   1%|▏         | 9/625 [00:01<02:09,  4.78it/s]
reward: -6.5807, last reward: -8.8075, gradient norm:  3.212:   1%|▏         | 9/625 [00:02<02:09,  4.78it/s]
reward: -6.5807, last reward: -8.8075, gradient norm:  3.212:   2%|▏         | 10/625 [00:02<02:08,  4.80it/s]
reward: -6.2009, last reward: -8.5525, gradient norm:  2.914:   2%|▏         | 10/625 [00:02<02:08,  4.80it/s]
reward: -6.2009, last reward: -8.5525, gradient norm:  2.914:   2%|▏         | 11/625 [00:02<02:07,  4.80it/s]
reward: -6.2894, last reward: -8.0115, gradient norm:  52.06:   2%|▏         | 11/625 [00:02<02:07,  4.80it/s]
reward: -6.2894, last reward: -8.0115, gradient norm:  52.06:   2%|▏         | 12/625 [00:02<02:07,  4.81it/s]
reward: -6.0977, last reward: -6.1845, gradient norm:  18.09:   2%|▏         | 12/625 [00:02<02:07,  4.81it/s]
reward: -6.0977, last reward: -6.1845, gradient norm:  18.09:   2%|▏         | 13/625 [00:02<02:07,  4.81it/s]
reward: -6.1830, last reward: -7.4858, gradient norm:  5.233:   2%|▏         | 13/625 [00:03<02:07,  4.81it/s]
reward: -6.1830, last reward: -7.4858, gradient norm:  5.233:   2%|▏         | 14/625 [00:03<02:06,  4.82it/s]
reward: -6.2863, last reward: -5.0297, gradient norm:  1.464:   2%|▏         | 14/625 [00:03<02:06,  4.82it/s]
reward: -6.2863, last reward: -5.0297, gradient norm:  1.464:   2%|▏         | 15/625 [00:03<02:06,  4.83it/s]
reward: -6.4617, last reward: -5.5997, gradient norm:  2.903:   2%|▏         | 15/625 [00:03<02:06,  4.83it/s]
reward: -6.4617, last reward: -5.5997, gradient norm:  2.903:   3%|▎         | 16/625 [00:03<02:05,  4.83it/s]
reward: -6.1647, last reward: -6.0777, gradient norm:  4.918:   3%|▎         | 16/625 [00:03<02:05,  4.83it/s]
reward: -6.1647, last reward: -6.0777, gradient norm:  4.918:   3%|▎         | 17/625 [00:03<02:05,  4.83it/s]
reward: -6.4709, last reward: -6.6813, gradient norm:  0.8319:   3%|▎         | 17/625 [00:03<02:05,  4.83it/s]
reward: -6.4709, last reward: -6.6813, gradient norm:  0.8319:   3%|▎         | 18/625 [00:03<02:06,  4.81it/s]
reward: -6.3221, last reward: -6.5577, gradient norm:  0.8415:   3%|▎         | 18/625 [00:04<02:06,  4.81it/s]
reward: -6.3221, last reward: -6.5577, gradient norm:  0.8415:   3%|▎         | 19/625 [00:04<02:05,  4.82it/s]
reward: -6.3229, last reward: -8.3322, gradient norm:  27.31:   3%|▎         | 19/625 [00:04<02:05,  4.82it/s]
reward: -6.3229, last reward: -8.3322, gradient norm:  27.31:   3%|▎         | 20/625 [00:04<02:05,  4.81it/s]
reward: -6.0258, last reward: -8.0581, gradient norm:  12.32:   3%|▎         | 20/625 [00:04<02:05,  4.81it/s]
reward: -6.0258, last reward: -8.0581, gradient norm:  12.32:   3%|▎         | 21/625 [00:04<02:05,  4.83it/s]
reward: -5.7295, last reward: -6.7230, gradient norm:  24.23:   3%|▎         | 21/625 [00:04<02:05,  4.83it/s]
reward: -5.7295, last reward: -6.7230, gradient norm:  24.23:   4%|▎         | 22/625 [00:04<02:04,  4.83it/s]
reward: -6.0265, last reward: -6.6077, gradient norm:  52.82:   4%|▎         | 22/625 [00:04<02:04,  4.83it/s]
reward: -6.0265, last reward: -6.6077, gradient norm:  52.82:   4%|▎         | 23/625 [00:04<02:04,  4.84it/s]
reward: -6.1081, last reward: -6.1347, gradient norm:  31.16:   4%|▎         | 23/625 [00:05<02:04,  4.84it/s]
reward: -6.1081, last reward: -6.1347, gradient norm:  31.16:   4%|▍         | 24/625 [00:05<02:04,  4.84it/s]
reward: -5.5231, last reward: -4.8435, gradient norm:  11.51:   4%|▍         | 24/625 [00:05<02:04,  4.84it/s]
reward: -5.5231, last reward: -4.8435, gradient norm:  11.51:   4%|▍         | 25/625 [00:05<02:04,  4.84it/s]
reward: -5.5310, last reward: -6.5397, gradient norm:  13.18:   4%|▍         | 25/625 [00:05<02:04,  4.84it/s]
reward: -5.5310, last reward: -6.5397, gradient norm:  13.18:   4%|▍         | 26/625 [00:05<02:03,  4.84it/s]
reward: -5.6382, last reward: -4.8204, gradient norm:  10.72:   4%|▍         | 26/625 [00:05<02:03,  4.84it/s]
reward: -5.6382, last reward: -4.8204, gradient norm:  10.72:   4%|▍         | 27/625 [00:05<02:03,  4.84it/s]
reward: -5.8162, last reward: -5.1618, gradient norm:  10.44:   4%|▍         | 27/625 [00:05<02:03,  4.84it/s]
reward: -5.8162, last reward: -5.1618, gradient norm:  10.44:   4%|▍         | 28/625 [00:05<02:03,  4.84it/s]
reward: -6.1180, last reward: -5.4640, gradient norm:  7.744:   4%|▍         | 28/625 [00:06<02:03,  4.84it/s]
reward: -6.1180, last reward: -5.4640, gradient norm:  7.744:   5%|▍         | 29/625 [00:06<02:03,  4.84it/s]
reward: -5.8759, last reward: -5.7826, gradient norm:  1.796:   5%|▍         | 29/625 [00:06<02:03,  4.84it/s]
reward: -5.8759, last reward: -5.7826, gradient norm:  1.796:   5%|▍         | 30/625 [00:06<02:03,  4.82it/s]
reward: -5.8296, last reward: -6.4808, gradient norm:  2.25:   5%|▍         | 30/625 [00:06<02:03,  4.82it/s]
reward: -5.8296, last reward: -6.4808, gradient norm:  2.25:   5%|▍         | 31/625 [00:06<02:03,  4.83it/s]
reward: -5.7578, last reward: -7.5124, gradient norm:  30.52:   5%|▍         | 31/625 [00:06<02:03,  4.83it/s]
reward: -5.7578, last reward: -7.5124, gradient norm:  30.52:   5%|▌         | 32/625 [00:06<02:02,  4.83it/s]
reward: -5.9313, last reward: -7.5212, gradient norm:  7.652:   5%|▌         | 32/625 [00:06<02:02,  4.83it/s]
reward: -5.9313, last reward: -7.5212, gradient norm:  7.652:   5%|▌         | 33/625 [00:06<02:02,  4.83it/s]
reward: -6.0223, last reward: -6.6343, gradient norm:  4.224:   5%|▌         | 33/625 [00:07<02:02,  4.83it/s]
reward: -6.0223, last reward: -6.6343, gradient norm:  4.224:   5%|▌         | 34/625 [00:07<02:25,  4.07it/s]
reward: -6.2886, last reward: -5.1441, gradient norm:  3.539:   5%|▌         | 34/625 [00:07<02:25,  4.07it/s]
reward: -6.2886, last reward: -5.1441, gradient norm:  3.539:   6%|▌         | 35/625 [00:07<02:18,  4.26it/s]
reward: -6.1060, last reward: -7.1638, gradient norm:  2.407:   6%|▌         | 35/625 [00:07<02:18,  4.26it/s]
reward: -6.1060, last reward: -7.1638, gradient norm:  2.407:   6%|▌         | 36/625 [00:07<02:13,  4.41it/s]
reward: -6.2230, last reward: -5.2917, gradient norm:  5.425:   6%|▌         | 36/625 [00:07<02:13,  4.41it/s]
reward: -6.2230, last reward: -5.2917, gradient norm:  5.425:   6%|▌         | 37/625 [00:07<02:09,  4.52it/s]
reward: -6.2950, last reward: -6.2126, gradient norm:  6.035:   6%|▌         | 37/625 [00:08<02:09,  4.52it/s]
reward: -6.2950, last reward: -6.2126, gradient norm:  6.035:   6%|▌         | 38/625 [00:08<02:07,  4.60it/s]
reward: -5.9786, last reward: -5.8757, gradient norm:  2.098:   6%|▌         | 38/625 [00:08<02:07,  4.60it/s]
reward: -5.9786, last reward: -5.8757, gradient norm:  2.098:   6%|▌         | 39/625 [00:08<02:06,  4.64it/s]
reward: -6.0730, last reward: -5.1952, gradient norm:  3.982:   6%|▌         | 39/625 [00:08<02:06,  4.64it/s]
reward: -6.0730, last reward: -5.1952, gradient norm:  3.982:   6%|▋         | 40/625 [00:08<02:04,  4.69it/s]
reward: -5.9481, last reward: -5.7122, gradient norm:  4.42:   6%|▋         | 40/625 [00:08<02:04,  4.69it/s]
reward: -5.9481, last reward: -5.7122, gradient norm:  4.42:   7%|▋         | 41/625 [00:08<02:03,  4.73it/s]
reward: -6.0875, last reward: -6.7567, gradient norm:  7.728:   7%|▋         | 41/625 [00:08<02:03,  4.73it/s]
reward: -6.0875, last reward: -6.7567, gradient norm:  7.728:   7%|▋         | 42/625 [00:08<02:02,  4.75it/s]
reward: -5.6301, last reward: -6.2249, gradient norm:  9.824:   7%|▋         | 42/625 [00:09<02:02,  4.75it/s]
reward: -5.6301, last reward: -6.2249, gradient norm:  9.824:   7%|▋         | 43/625 [00:09<02:02,  4.76it/s]
reward: -5.5281, last reward: -5.7749, gradient norm:  7.223:   7%|▋         | 43/625 [00:09<02:02,  4.76it/s]
reward: -5.5281, last reward: -5.7749, gradient norm:  7.223:   7%|▋         | 44/625 [00:09<02:01,  4.78it/s]
reward: -5.5904, last reward: -5.0048, gradient norm:  11.73:   7%|▋         | 44/625 [00:09<02:01,  4.78it/s]
reward: -5.5904, last reward: -5.0048, gradient norm:  11.73:   7%|▋         | 45/625 [00:09<02:01,  4.79it/s]
reward: -5.7882, last reward: -4.8660, gradient norm:  2.094:   7%|▋         | 45/625 [00:09<02:01,  4.79it/s]
reward: -5.7882, last reward: -4.8660, gradient norm:  2.094:   7%|▋         | 46/625 [00:09<02:00,  4.80it/s]
reward: -5.8592, last reward: -4.4848, gradient norm:  30.4:   7%|▋         | 46/625 [00:10<02:00,  4.80it/s]
reward: -5.8592, last reward: -4.4848, gradient norm:  30.4:   8%|▊         | 47/625 [00:10<02:00,  4.81it/s]
reward: -5.3849, last reward: -3.5828, gradient norm:  2.244:   8%|▊         | 47/625 [00:10<02:00,  4.81it/s]
reward: -5.3849, last reward: -3.5828, gradient norm:  2.244:   8%|▊         | 48/625 [00:10<02:00,  4.80it/s]
reward: -5.5785, last reward: -2.4216, gradient norm:  0.8946:   8%|▊         | 48/625 [00:10<02:00,  4.80it/s]
reward: -5.5785, last reward: -2.4216, gradient norm:  0.8946:   8%|▊         | 49/625 [00:10<01:59,  4.81it/s]
reward: -5.4433, last reward: -3.4306, gradient norm:  16.48:   8%|▊         | 49/625 [00:10<01:59,  4.81it/s]
reward: -5.4433, last reward: -3.4306, gradient norm:  16.48:   8%|▊         | 50/625 [00:10<01:59,  4.81it/s]
reward: -5.5546, last reward: -5.3443, gradient norm:  8.319:   8%|▊         | 50/625 [00:10<01:59,  4.81it/s]
reward: -5.5546, last reward: -5.3443, gradient norm:  8.319:   8%|▊         | 51/625 [00:10<01:59,  4.80it/s]
reward: -5.5681, last reward: -7.5266, gradient norm:  5.593:   8%|▊         | 51/625 [00:11<01:59,  4.80it/s]
reward: -5.5681, last reward: -7.5266, gradient norm:  5.593:   8%|▊         | 52/625 [00:11<01:59,  4.81it/s]
reward: -5.6418, last reward: -8.1904, gradient norm:  12.34:   8%|▊         | 52/625 [00:11<01:59,  4.81it/s]
reward: -5.6418, last reward: -8.1904, gradient norm:  12.34:   8%|▊         | 53/625 [00:11<01:58,  4.81it/s]
reward: -5.6517, last reward: -8.3856, gradient norm:  4.565:   8%|▊         | 53/625 [00:11<01:58,  4.81it/s]
reward: -5.6517, last reward: -8.3856, gradient norm:  4.565:   9%|▊         | 54/625 [00:11<01:58,  4.82it/s]
reward: -5.9653, last reward: -8.4339, gradient norm:  12.73:   9%|▊         | 54/625 [00:11<01:58,  4.82it/s]
reward: -5.9653, last reward: -8.4339, gradient norm:  12.73:   9%|▉         | 55/625 [00:11<01:58,  4.82it/s]
reward: -6.0832, last reward: -8.9027, gradient norm:  6.07:   9%|▉         | 55/625 [00:11<01:58,  4.82it/s]
reward: -6.0832, last reward: -8.9027, gradient norm:  6.07:   9%|▉         | 56/625 [00:11<01:58,  4.81it/s]
reward: -6.2454, last reward: -8.9134, gradient norm:  9.312:   9%|▉         | 56/625 [00:12<01:58,  4.81it/s]
reward: -6.2454, last reward: -8.9134, gradient norm:  9.312:   9%|▉         | 57/625 [00:12<01:58,  4.81it/s]
reward: -6.1343, last reward: -9.4171, gradient norm:  16.74:   9%|▉         | 57/625 [00:12<01:58,  4.81it/s]
reward: -6.1343, last reward: -9.4171, gradient norm:  16.74:   9%|▉         | 58/625 [00:12<01:57,  4.82it/s]
reward: -5.7796, last reward: -11.1745, gradient norm:  20.83:   9%|▉         | 58/625 [00:12<01:57,  4.82it/s]
reward: -5.7796, last reward: -11.1745, gradient norm:  20.83:   9%|▉         | 59/625 [00:12<01:57,  4.82it/s]
reward: -5.4783, last reward: -6.2441, gradient norm:  8.777:   9%|▉         | 59/625 [00:12<01:57,  4.82it/s]
reward: -5.4783, last reward: -6.2441, gradient norm:  8.777:  10%|▉         | 60/625 [00:12<01:57,  4.83it/s]
reward: -5.5816, last reward: -4.1932, gradient norm:  6.328:  10%|▉         | 60/625 [00:12<01:57,  4.83it/s]
reward: -5.5816, last reward: -4.1932, gradient norm:  6.328:  10%|▉         | 61/625 [00:12<01:56,  4.84it/s]
reward: -5.6604, last reward: -4.1629, gradient norm:  3.516:  10%|▉         | 61/625 [00:13<01:56,  4.84it/s]
reward: -5.6604, last reward: -4.1629, gradient norm:  3.516:  10%|▉         | 62/625 [00:13<01:56,  4.84it/s]
reward: -5.4195, last reward: -5.1296, gradient norm:  8.378:  10%|▉         | 62/625 [00:13<01:56,  4.84it/s]
reward: -5.4195, last reward: -5.1296, gradient norm:  8.378:  10%|█         | 63/625 [00:13<01:56,  4.83it/s]
reward: -5.5165, last reward: -3.0986, gradient norm:  17.72:  10%|█         | 63/625 [00:13<01:56,  4.83it/s]
reward: -5.5165, last reward: -3.0986, gradient norm:  17.72:  10%|█         | 64/625 [00:13<01:56,  4.83it/s]
reward: -5.5596, last reward: -4.2442, gradient norm:  11.38:  10%|█         | 64/625 [00:13<01:56,  4.83it/s]
reward: -5.5596, last reward: -4.2442, gradient norm:  11.38:  10%|█         | 65/625 [00:13<01:55,  4.84it/s]
reward: -5.9834, last reward: -6.0432, gradient norm:  8.038:  10%|█         | 65/625 [00:13<01:55,  4.84it/s]
reward: -5.9834, last reward: -6.0432, gradient norm:  8.038:  11%|█         | 66/625 [00:13<01:55,  4.84it/s]
reward: -5.7958, last reward: -5.1525, gradient norm:  8.564:  11%|█         | 66/625 [00:14<01:55,  4.84it/s]
reward: -5.7958, last reward: -5.1525, gradient norm:  8.564:  11%|█         | 67/625 [00:14<01:55,  4.84it/s]
reward: -5.8544, last reward: -5.2747, gradient norm:  7.632:  11%|█         | 67/625 [00:14<01:55,  4.84it/s]
reward: -5.8544, last reward: -5.2747, gradient norm:  7.632:  11%|█         | 68/625 [00:14<01:55,  4.84it/s]
reward: -5.3922, last reward: -4.5267, gradient norm:  18.13:  11%|█         | 68/625 [00:14<01:55,  4.84it/s]
reward: -5.3922, last reward: -4.5267, gradient norm:  18.13:  11%|█         | 69/625 [00:14<01:54,  4.85it/s]
reward: -5.0917, last reward: -3.3025, gradient norm:  2.33:  11%|█         | 69/625 [00:14<01:54,  4.85it/s]
reward: -5.0917, last reward: -3.3025, gradient norm:  2.33:  11%|█         | 70/625 [00:14<01:54,  4.84it/s]
reward: -5.0968, last reward: -6.1214, gradient norm:  11.27:  11%|█         | 70/625 [00:14<01:54,  4.84it/s]
reward: -5.0968, last reward: -6.1214, gradient norm:  11.27:  11%|█▏        | 71/625 [00:14<01:54,  4.84it/s]
reward: -5.2523, last reward: -4.0580, gradient norm:  22.2:  11%|█▏        | 71/625 [00:15<01:54,  4.84it/s]
reward: -5.2523, last reward: -4.0580, gradient norm:  22.2:  12%|█▏        | 72/625 [00:15<01:54,  4.83it/s]
reward: -5.4829, last reward: -6.6886, gradient norm:  12.37:  12%|█▏        | 72/625 [00:15<01:54,  4.83it/s]
reward: -5.4829, last reward: -6.6886, gradient norm:  12.37:  12%|█▏        | 73/625 [00:15<01:54,  4.83it/s]
reward: -5.7293, last reward: -9.4615, gradient norm:  15.07:  12%|█▏        | 73/625 [00:15<01:54,  4.83it/s]
reward: -5.7293, last reward: -9.4615, gradient norm:  15.07:  12%|█▏        | 74/625 [00:15<01:54,  4.82it/s]
reward: -5.7735, last reward: -9.0859, gradient norm:  892.4:  12%|█▏        | 74/625 [00:15<01:54,  4.82it/s]
reward: -5.7735, last reward: -9.0859, gradient norm:  892.4:  12%|█▏        | 75/625 [00:15<01:54,  4.81it/s]
reward: -6.1616, last reward: -9.2996, gradient norm:  9.569:  12%|█▏        | 75/625 [00:16<01:54,  4.81it/s]
reward: -6.1616, last reward: -9.2996, gradient norm:  9.569:  12%|█▏        | 76/625 [00:16<01:54,  4.81it/s]
reward: -6.2202, last reward: -9.3199, gradient norm:  8.919:  12%|█▏        | 76/625 [00:16<01:54,  4.81it/s]
reward: -6.2202, last reward: -9.3199, gradient norm:  8.919:  12%|█▏        | 77/625 [00:16<01:53,  4.81it/s]
reward: -6.1349, last reward: -9.9361, gradient norm:  10.06:  12%|█▏        | 77/625 [00:16<01:53,  4.81it/s]
reward: -6.1349, last reward: -9.9361, gradient norm:  10.06:  12%|█▏        | 78/625 [00:16<01:53,  4.80it/s]
reward: -6.0374, last reward: -10.4791, gradient norm:  45.37:  12%|█▏        | 78/625 [00:16<01:53,  4.80it/s]
reward: -6.0374, last reward: -10.4791, gradient norm:  45.37:  13%|█▎        | 79/625 [00:16<01:54,  4.77it/s]
reward: -5.6990, last reward: -9.0426, gradient norm:  32.93:  13%|█▎        | 79/625 [00:16<01:54,  4.77it/s]
reward: -5.6990, last reward: -9.0426, gradient norm:  32.93:  13%|█▎        | 80/625 [00:16<01:54,  4.78it/s]
reward: -5.3303, last reward: -4.9148, gradient norm:  307.4:  13%|█▎        | 80/625 [00:17<01:54,  4.78it/s]
reward: -5.3303, last reward: -4.9148, gradient norm:  307.4:  13%|█▎        | 81/625 [00:17<01:53,  4.79it/s]
reward: -5.2291, last reward: -3.3632, gradient norm:  2.828:  13%|█▎        | 81/625 [00:17<01:53,  4.79it/s]
reward: -5.2291, last reward: -3.3632, gradient norm:  2.828:  13%|█▎        | 82/625 [00:17<01:53,  4.79it/s]
reward: -5.0228, last reward: -3.1018, gradient norm:  32.56:  13%|█▎        | 82/625 [00:17<01:53,  4.79it/s]
reward: -5.0228, last reward: -3.1018, gradient norm:  32.56:  13%|█▎        | 83/625 [00:17<01:53,  4.79it/s]
reward: -5.0364, last reward: -3.8503, gradient norm:  8.948:  13%|█▎        | 83/625 [00:17<01:53,  4.79it/s]
reward: -5.0364, last reward: -3.8503, gradient norm:  8.948:  13%|█▎        | 84/625 [00:17<01:52,  4.80it/s]
reward: -4.9341, last reward: -6.9319, gradient norm:  119.2:  13%|█▎        | 84/625 [00:17<01:52,  4.80it/s]
reward: -4.9341, last reward: -6.9319, gradient norm:  119.2:  14%|█▎        | 85/625 [00:17<01:52,  4.81it/s]
reward: -5.0693, last reward: -6.4436, gradient norm:  5.28:  14%|█▎        | 85/625 [00:18<01:52,  4.81it/s]
reward: -5.0693, last reward: -6.4436, gradient norm:  5.28:  14%|█▍        | 86/625 [00:18<01:51,  4.83it/s]
reward: -4.9258, last reward: -6.0461, gradient norm:  4.376:  14%|█▍        | 86/625 [00:18<01:51,  4.83it/s]
reward: -4.9258, last reward: -6.0461, gradient norm:  4.376:  14%|█▍        | 87/625 [00:18<01:51,  4.83it/s]
reward: -4.9910, last reward: -4.5681, gradient norm:  25.14:  14%|█▍        | 87/625 [00:18<01:51,  4.83it/s]
reward: -4.9910, last reward: -4.5681, gradient norm:  25.14:  14%|█▍        | 88/625 [00:18<01:51,  4.84it/s]
reward: -5.1716, last reward: -5.3157, gradient norm:  15.5:  14%|█▍        | 88/625 [00:18<01:51,  4.84it/s]
reward: -5.1716, last reward: -5.3157, gradient norm:  15.5:  14%|█▍        | 89/625 [00:18<01:50,  4.83it/s]
reward: -4.9816, last reward: -3.5950, gradient norm:  7.403:  14%|█▍        | 89/625 [00:18<01:50,  4.83it/s]
reward: -4.9816, last reward: -3.5950, gradient norm:  7.403:  14%|█▍        | 90/625 [00:18<01:50,  4.83it/s]
reward: -4.7252, last reward: -4.8815, gradient norm:  10.07:  14%|█▍        | 90/625 [00:19<01:50,  4.83it/s]
reward: -4.7252, last reward: -4.8815, gradient norm:  10.07:  15%|█▍        | 91/625 [00:19<01:50,  4.84it/s]
reward: -4.9986, last reward: -5.8680, gradient norm:  14.26:  15%|█▍        | 91/625 [00:19<01:50,  4.84it/s]
reward: -4.9986, last reward: -5.8680, gradient norm:  14.26:  15%|█▍        | 92/625 [00:19<01:50,  4.83it/s]
reward: -4.9029, last reward: -5.7132, gradient norm:  21.65:  15%|█▍        | 92/625 [00:19<01:50,  4.83it/s]
reward: -4.9029, last reward: -5.7132, gradient norm:  21.65:  15%|█▍        | 93/625 [00:19<01:50,  4.84it/s]
reward: -4.7814, last reward: -6.5231, gradient norm:  27.4:  15%|█▍        | 93/625 [00:19<01:50,  4.84it/s]
reward: -4.7814, last reward: -6.5231, gradient norm:  27.4:  15%|█▌        | 94/625 [00:19<01:50,  4.82it/s]
reward: -4.7013, last reward: -6.0821, gradient norm:  22.53:  15%|█▌        | 94/625 [00:19<01:50,  4.82it/s]
reward: -4.7013, last reward: -6.0821, gradient norm:  22.53:  15%|█▌        | 95/625 [00:19<01:49,  4.82it/s]
reward: -4.3526, last reward: -5.3718, gradient norm:  28.77:  15%|█▌        | 95/625 [00:20<01:49,  4.82it/s]
reward: -4.3526, last reward: -5.3718, gradient norm:  28.77:  15%|█▌        | 96/625 [00:20<01:49,  4.83it/s]
reward: -5.0901, last reward: -5.0493, gradient norm:  8.428:  15%|█▌        | 96/625 [00:20<01:49,  4.83it/s]
reward: -5.0901, last reward: -5.0493, gradient norm:  8.428:  16%|█▌        | 97/625 [00:20<01:49,  4.83it/s]
reward: -4.9341, last reward: -4.0375, gradient norm:  17.1:  16%|█▌        | 97/625 [00:20<01:49,  4.83it/s]
reward: -4.9341, last reward: -4.0375, gradient norm:  17.1:  16%|█▌        | 98/625 [00:20<01:49,  4.83it/s]
reward: -5.0707, last reward: -5.9903, gradient norm:  12.01:  16%|█▌        | 98/625 [00:20<01:49,  4.83it/s]
reward: -5.0707, last reward: -5.9903, gradient norm:  12.01:  16%|█▌        | 99/625 [00:20<01:48,  4.83it/s]
reward: -4.8171, last reward: -4.1591, gradient norm:  47.69:  16%|█▌        | 99/625 [00:20<01:48,  4.83it/s]
reward: -4.8171, last reward: -4.1591, gradient norm:  47.69:  16%|█▌        | 100/625 [00:20<01:48,  4.84it/s]
reward: -4.8621, last reward: -4.1783, gradient norm:  9.28:  16%|█▌        | 100/625 [00:21<01:48,  4.84it/s]
reward: -4.8621, last reward: -4.1783, gradient norm:  9.28:  16%|█▌        | 101/625 [00:21<01:48,  4.84it/s]
reward: -4.4683, last reward: -2.4896, gradient norm:  10.58:  16%|█▌        | 101/625 [00:21<01:48,  4.84it/s]
reward: -4.4683, last reward: -2.4896, gradient norm:  10.58:  16%|█▋        | 102/625 [00:21<01:48,  4.83it/s]
reward: -4.5413, last reward: -5.7029, gradient norm:  8.056:  16%|█▋        | 102/625 [00:21<01:48,  4.83it/s]
reward: -4.5413, last reward: -5.7029, gradient norm:  8.056:  16%|█▋        | 103/625 [00:21<01:47,  4.83it/s]
reward: -4.6580, last reward: -8.4799, gradient norm:  34.32:  16%|█▋        | 103/625 [00:21<01:47,  4.83it/s]
reward: -4.6580, last reward: -8.4799, gradient norm:  34.32:  17%|█▋        | 104/625 [00:21<01:47,  4.83it/s]
reward: -4.6693, last reward: -7.4469, gradient norm:  81.33:  17%|█▋        | 104/625 [00:22<01:47,  4.83it/s]
reward: -4.6693, last reward: -7.4469, gradient norm:  81.33:  17%|█▋        | 105/625 [00:22<01:47,  4.83it/s]
reward: -4.7061, last reward: -3.6757, gradient norm:  13.94:  17%|█▋        | 105/625 [00:22<01:47,  4.83it/s]
reward: -4.7061, last reward: -3.6757, gradient norm:  13.94:  17%|█▋        | 106/625 [00:22<01:47,  4.84it/s]
reward: -4.4342, last reward: -3.6883, gradient norm:  26.25:  17%|█▋        | 106/625 [00:22<01:47,  4.84it/s]
reward: -4.4342, last reward: -3.6883, gradient norm:  26.25:  17%|█▋        | 107/625 [00:22<01:47,  4.84it/s]
reward: -4.3992, last reward: -2.4497, gradient norm:  15.67:  17%|█▋        | 107/625 [00:22<01:47,  4.84it/s]
reward: -4.3992, last reward: -2.4497, gradient norm:  15.67:  17%|█▋        | 108/625 [00:22<01:46,  4.83it/s]
reward: -4.3980, last reward: -4.0425, gradient norm:  13.06:  17%|█▋        | 108/625 [00:22<01:46,  4.83it/s]
reward: -4.3980, last reward: -4.0425, gradient norm:  13.06:  17%|█▋        | 109/625 [00:22<01:46,  4.83it/s]
reward: -5.2514, last reward: -4.0430, gradient norm:  8.778:  17%|█▋        | 109/625 [00:23<01:46,  4.83it/s]
reward: -5.2514, last reward: -4.0430, gradient norm:  8.778:  18%|█▊        | 110/625 [00:23<01:46,  4.84it/s]
reward: -5.2656, last reward: -5.0365, gradient norm:  8.68:  18%|█▊        | 110/625 [00:23<01:46,  4.84it/s]
reward: -5.2656, last reward: -5.0365, gradient norm:  8.68:  18%|█▊        | 111/625 [00:23<01:46,  4.83it/s]
reward: -5.2567, last reward: -5.9920, gradient norm:  11.66:  18%|█▊        | 111/625 [00:23<01:46,  4.83it/s]
reward: -5.2567, last reward: -5.9920, gradient norm:  11.66:  18%|█▊        | 112/625 [00:23<01:46,  4.84it/s]
reward: -5.0847, last reward: -5.2160, gradient norm:  12.61:  18%|█▊        | 112/625 [00:23<01:46,  4.84it/s]
reward: -5.0847, last reward: -5.2160, gradient norm:  12.61:  18%|█▊        | 113/625 [00:23<01:45,  4.83it/s]
reward: -4.8941, last reward: -5.0903, gradient norm:  14.7:  18%|█▊        | 113/625 [00:23<01:45,  4.83it/s]
reward: -4.8941, last reward: -5.0903, gradient norm:  14.7:  18%|█▊        | 114/625 [00:23<01:45,  4.84it/s]
reward: -4.5529, last reward: -3.4350, gradient norm:  24.5:  18%|█▊        | 114/625 [00:24<01:45,  4.84it/s]
reward: -4.5529, last reward: -3.4350, gradient norm:  24.5:  18%|█▊        | 115/625 [00:24<01:45,  4.83it/s]
reward: -4.4047, last reward: -3.9059, gradient norm:  11.8:  18%|█▊        | 115/625 [00:24<01:45,  4.83it/s]
reward: -4.4047, last reward: -3.9059, gradient norm:  11.8:  19%|█▊        | 116/625 [00:24<01:45,  4.84it/s]
reward: -4.7905, last reward: -4.2659, gradient norm:  14.6:  19%|█▊        | 116/625 [00:24<01:45,  4.84it/s]
reward: -4.7905, last reward: -4.2659, gradient norm:  14.6:  19%|█▊        | 117/625 [00:24<01:45,  4.84it/s]
reward: -5.1685, last reward: -5.0558, gradient norm:  2.069:  19%|█▊        | 117/625 [00:24<01:45,  4.84it/s]
reward: -5.1685, last reward: -5.0558, gradient norm:  2.069:  19%|█▉        | 118/625 [00:24<01:44,  4.83it/s]
reward: -5.3224, last reward: -3.9649, gradient norm:  22.7:  19%|█▉        | 118/625 [00:24<01:44,  4.83it/s]
reward: -5.3224, last reward: -3.9649, gradient norm:  22.7:  19%|█▉        | 119/625 [00:24<01:44,  4.83it/s]
reward: -5.3083, last reward: -4.9055, gradient norm:  13.3:  19%|█▉        | 119/625 [00:25<01:44,  4.83it/s]
reward: -5.3083, last reward: -4.9055, gradient norm:  13.3:  19%|█▉        | 120/625 [00:25<01:44,  4.83it/s]
reward: -5.1928, last reward: -6.0475, gradient norm:  59.18:  19%|█▉        | 120/625 [00:25<01:44,  4.83it/s]
reward: -5.1928, last reward: -6.0475, gradient norm:  59.18:  19%|█▉        | 121/625 [00:25<01:44,  4.83it/s]
reward: -5.0833, last reward: -4.8086, gradient norm:  20.01:  19%|█▉        | 121/625 [00:25<01:44,  4.83it/s]
reward: -5.0833, last reward: -4.8086, gradient norm:  20.01:  20%|█▉        | 122/625 [00:25<01:44,  4.83it/s]
reward: -4.6719, last reward: -8.9463, gradient norm:  54.76:  20%|█▉        | 122/625 [00:25<01:44,  4.83it/s]
reward: -4.6719, last reward: -8.9463, gradient norm:  54.76:  20%|█▉        | 123/625 [00:25<01:44,  4.82it/s]
reward: -4.2157, last reward: -3.4610, gradient norm:  10.41:  20%|█▉        | 123/625 [00:25<01:44,  4.82it/s]
reward: -4.2157, last reward: -3.4610, gradient norm:  10.41:  20%|█▉        | 124/625 [00:25<01:44,  4.82it/s]
reward: -4.4119, last reward: -2.9298, gradient norm:  50.3:  20%|█▉        | 124/625 [00:26<01:44,  4.82it/s]
reward: -4.4119, last reward: -2.9298, gradient norm:  50.3:  20%|██        | 125/625 [00:26<01:43,  4.82it/s]
reward: -4.7378, last reward: -4.1409, gradient norm:  12.45:  20%|██        | 125/625 [00:26<01:43,  4.82it/s]
reward: -4.7378, last reward: -4.1409, gradient norm:  12.45:  20%|██        | 126/625 [00:26<01:43,  4.83it/s]
reward: -4.0920, last reward: -4.0036, gradient norm:  17.08:  20%|██        | 126/625 [00:26<01:43,  4.83it/s]
reward: -4.0920, last reward: -4.0036, gradient norm:  17.08:  20%|██        | 127/625 [00:26<01:43,  4.83it/s]
reward: -4.4453, last reward: -2.8994, gradient norm:  26.63:  20%|██        | 127/625 [00:26<01:43,  4.83it/s]
reward: -4.4453, last reward: -2.8994, gradient norm:  26.63:  20%|██        | 128/625 [00:26<01:42,  4.83it/s]
reward: -4.2940, last reward: -4.9240, gradient norm:  113.7:  20%|██        | 128/625 [00:26<01:42,  4.83it/s]
reward: -4.2940, last reward: -4.9240, gradient norm:  113.7:  21%|██        | 129/625 [00:26<01:42,  4.83it/s]
reward: -4.4657, last reward: -5.8249, gradient norm:  15.75:  21%|██        | 129/625 [00:27<01:42,  4.83it/s]
reward: -4.4657, last reward: -5.8249, gradient norm:  15.75:  21%|██        | 130/625 [00:27<01:42,  4.84it/s]
reward: -4.6821, last reward: -6.2320, gradient norm:  24.59:  21%|██        | 130/625 [00:27<01:42,  4.84it/s]
reward: -4.6821, last reward: -6.2320, gradient norm:  24.59:  21%|██        | 131/625 [00:27<01:42,  4.83it/s]
reward: -4.7717, last reward: -7.0348, gradient norm:  21.43:  21%|██        | 131/625 [00:27<01:42,  4.83it/s]
reward: -4.7717, last reward: -7.0348, gradient norm:  21.43:  21%|██        | 132/625 [00:27<01:42,  4.83it/s]
reward: -4.5923, last reward: -9.1746, gradient norm:  38.4:  21%|██        | 132/625 [00:27<01:42,  4.83it/s]
reward: -4.5923, last reward: -9.1746, gradient norm:  38.4:  21%|██▏       | 133/625 [00:27<01:41,  4.82it/s]
reward: -4.2964, last reward: -4.3941, gradient norm:  7.475:  21%|██▏       | 133/625 [00:28<01:41,  4.82it/s]
reward: -4.2964, last reward: -4.3941, gradient norm:  7.475:  21%|██▏       | 134/625 [00:28<01:41,  4.83it/s]
reward: -4.2730, last reward: -3.0781, gradient norm:  22.33:  21%|██▏       | 134/625 [00:28<01:41,  4.83it/s]
reward: -4.2730, last reward: -3.0781, gradient norm:  22.33:  22%|██▏       | 135/625 [00:28<01:41,  4.82it/s]
reward: -4.2718, last reward: -3.1451, gradient norm:  8.063:  22%|██▏       | 135/625 [00:28<01:41,  4.82it/s]
reward: -4.2718, last reward: -3.1451, gradient norm:  8.063:  22%|██▏       | 136/625 [00:28<02:00,  4.06it/s]
reward: -4.3199, last reward: -5.0931, gradient norm:  131.1:  22%|██▏       | 136/625 [00:28<02:00,  4.06it/s]
reward: -4.3199, last reward: -5.0931, gradient norm:  131.1:  22%|██▏       | 137/625 [00:28<01:54,  4.26it/s]
reward: -4.4474, last reward: -5.2053, gradient norm:  22.13:  22%|██▏       | 137/625 [00:28<01:54,  4.26it/s]
reward: -4.4474, last reward: -5.2053, gradient norm:  22.13:  22%|██▏       | 138/625 [00:28<01:50,  4.42it/s]
reward: -4.9233, last reward: -3.8841, gradient norm:  6.794:  22%|██▏       | 138/625 [00:29<01:50,  4.42it/s]
reward: -4.9233, last reward: -3.8841, gradient norm:  6.794:  22%|██▏       | 139/625 [00:29<01:47,  4.53it/s]
reward: -4.7412, last reward: -4.6784, gradient norm:  15.88:  22%|██▏       | 139/625 [00:29<01:47,  4.53it/s]
reward: -4.7412, last reward: -4.6784, gradient norm:  15.88:  22%|██▏       | 140/625 [00:29<01:45,  4.60it/s]
reward: -4.4236, last reward: -3.8232, gradient norm:  95.06:  22%|██▏       | 140/625 [00:29<01:45,  4.60it/s]
reward: -4.4236, last reward: -3.8232, gradient norm:  95.06:  23%|██▎       | 141/625 [00:29<01:43,  4.66it/s]
reward: -4.2859, last reward: -5.9936, gradient norm:  19.62:  23%|██▎       | 141/625 [00:29<01:43,  4.66it/s]
reward: -4.2859, last reward: -5.9936, gradient norm:  19.62:  23%|██▎       | 142/625 [00:29<01:42,  4.71it/s]
reward: -4.4756, last reward: -3.0061, gradient norm:  58.42:  23%|██▎       | 142/625 [00:30<01:42,  4.71it/s]
reward: -4.4756, last reward: -3.0061, gradient norm:  58.42:  23%|██▎       | 143/625 [00:30<01:41,  4.74it/s]
reward: -4.6419, last reward: -2.8358, gradient norm:  21.94:  23%|██▎       | 143/625 [00:30<01:41,  4.74it/s]
reward: -4.6419, last reward: -2.8358, gradient norm:  21.94:  23%|██▎       | 144/625 [00:30<01:41,  4.76it/s]
reward: -4.5489, last reward: -4.8108, gradient norm:  26.27:  23%|██▎       | 144/625 [00:30<01:41,  4.76it/s]
reward: -4.5489, last reward: -4.8108, gradient norm:  26.27:  23%|██▎       | 145/625 [00:30<01:40,  4.77it/s]
reward: -4.4234, last reward: -6.1971, gradient norm:  24.6:  23%|██▎       | 145/625 [00:30<01:40,  4.77it/s]
reward: -4.4234, last reward: -6.1971, gradient norm:  24.6:  23%|██▎       | 146/625 [00:30<01:40,  4.78it/s]
reward: -4.6739, last reward: -4.1551, gradient norm:  8.242:  23%|██▎       | 146/625 [00:30<01:40,  4.78it/s]
reward: -4.6739, last reward: -4.1551, gradient norm:  8.242:  24%|██▎       | 147/625 [00:30<01:39,  4.80it/s]
reward: -4.4584, last reward: -5.1256, gradient norm:  4.714:  24%|██▎       | 147/625 [00:31<01:39,  4.80it/s]
reward: -4.4584, last reward: -5.1256, gradient norm:  4.714:  24%|██▎       | 148/625 [00:31<01:39,  4.81it/s]
reward: -4.3930, last reward: -3.8382, gradient norm:  2.931:  24%|██▎       | 148/625 [00:31<01:39,  4.81it/s]
reward: -4.3930, last reward: -3.8382, gradient norm:  2.931:  24%|██▍       | 149/625 [00:31<01:38,  4.81it/s]
reward: -4.8215, last reward: -3.7751, gradient norm:  12.4:  24%|██▍       | 149/625 [00:31<01:38,  4.81it/s]
reward: -4.8215, last reward: -3.7751, gradient norm:  12.4:  24%|██▍       | 150/625 [00:31<01:38,  4.82it/s]
reward: -4.9927, last reward: -4.0620, gradient norm:  9.91:  24%|██▍       | 150/625 [00:31<01:38,  4.82it/s]
reward: -4.9927, last reward: -4.0620, gradient norm:  9.91:  24%|██▍       | 151/625 [00:31<01:38,  4.81it/s]
reward: -4.7118, last reward: -4.4055, gradient norm:  14.72:  24%|██▍       | 151/625 [00:31<01:38,  4.81it/s]
reward: -4.7118, last reward: -4.4055, gradient norm:  14.72:  24%|██▍       | 152/625 [00:31<01:38,  4.82it/s]
reward: -4.5860, last reward: -3.0642, gradient norm:  12.02:  24%|██▍       | 152/625 [00:32<01:38,  4.82it/s]
reward: -4.5860, last reward: -3.0642, gradient norm:  12.02:  24%|██▍       | 153/625 [00:32<01:37,  4.83it/s]
reward: -4.2358, last reward: -3.0014, gradient norm:  20.68:  24%|██▍       | 153/625 [00:32<01:37,  4.83it/s]
reward: -4.2358, last reward: -3.0014, gradient norm:  20.68:  25%|██▍       | 154/625 [00:32<01:37,  4.83it/s]
reward: -4.3053, last reward: -4.5390, gradient norm:  14.11:  25%|██▍       | 154/625 [00:32<01:37,  4.83it/s]
reward: -4.3053, last reward: -4.5390, gradient norm:  14.11:  25%|██▍       | 155/625 [00:32<01:37,  4.84it/s]
reward: -4.4845, last reward: -7.6566, gradient norm:  51.89:  25%|██▍       | 155/625 [00:32<01:37,  4.84it/s]
reward: -4.4845, last reward: -7.6566, gradient norm:  51.89:  25%|██▍       | 156/625 [00:32<01:36,  4.84it/s]
reward: -4.7679, last reward: -8.4566, gradient norm:  19.11:  25%|██▍       | 156/625 [00:32<01:36,  4.84it/s]
reward: -4.7679, last reward: -8.4566, gradient norm:  19.11:  25%|██▌       | 157/625 [00:32<01:36,  4.84it/s]
reward: -4.6030, last reward: -6.4867, gradient norm:  24.21:  25%|██▌       | 157/625 [00:33<01:36,  4.84it/s]
reward: -4.6030, last reward: -6.4867, gradient norm:  24.21:  25%|██▌       | 158/625 [00:33<01:36,  4.85it/s]
reward: -4.3156, last reward: -4.3057, gradient norm:  26.15:  25%|██▌       | 158/625 [00:33<01:36,  4.85it/s]
reward: -4.3156, last reward: -4.3057, gradient norm:  26.15:  25%|██▌       | 159/625 [00:33<01:36,  4.83it/s]
reward: -4.1515, last reward: -2.7400, gradient norm:  46.67:  25%|██▌       | 159/625 [00:33<01:36,  4.83it/s]
reward: -4.1515, last reward: -2.7400, gradient norm:  46.67:  26%|██▌       | 160/625 [00:33<01:36,  4.84it/s]
reward: -4.1984, last reward: -3.1343, gradient norm:  10.44:  26%|██▌       | 160/625 [00:33<01:36,  4.84it/s]
reward: -4.1984, last reward: -3.1343, gradient norm:  10.44:  26%|██▌       | 161/625 [00:33<01:35,  4.83it/s]
reward: -4.7794, last reward: -4.1895, gradient norm:  15.07:  26%|██▌       | 161/625 [00:33<01:35,  4.83it/s]
reward: -4.7794, last reward: -4.1895, gradient norm:  15.07:  26%|██▌       | 162/625 [00:33<01:35,  4.84it/s]
reward: -4.8227, last reward: -3.9495, gradient norm:  10.96:  26%|██▌       | 162/625 [00:34<01:35,  4.84it/s]
reward: -4.8227, last reward: -3.9495, gradient norm:  10.96:  26%|██▌       | 163/625 [00:34<01:35,  4.85it/s]
reward: -5.0627, last reward: -2.8677, gradient norm:  8.216:  26%|██▌       | 163/625 [00:34<01:35,  4.85it/s]
reward: -5.0627, last reward: -2.8677, gradient norm:  8.216:  26%|██▌       | 164/625 [00:34<01:35,  4.84it/s]
reward: -4.3039, last reward: -3.8106, gradient norm:  15.09:  26%|██▌       | 164/625 [00:34<01:35,  4.84it/s]
reward: -4.3039, last reward: -3.8106, gradient norm:  15.09:  26%|██▋       | 165/625 [00:34<01:34,  4.85it/s]
reward: -4.2623, last reward: -3.6619, gradient norm:  22.77:  26%|██▋       | 165/625 [00:34<01:34,  4.85it/s]
reward: -4.2623, last reward: -3.6619, gradient norm:  22.77:  27%|██▋       | 166/625 [00:34<01:34,  4.84it/s]
reward: -4.0987, last reward: -3.0736, gradient norm:  20.92:  27%|██▋       | 166/625 [00:34<01:34,  4.84it/s]
reward: -4.0987, last reward: -3.0736, gradient norm:  20.92:  27%|██▋       | 167/625 [00:34<01:34,  4.84it/s]
reward: -4.3893, last reward: -5.3442, gradient norm:  9.876:  27%|██▋       | 167/625 [00:35<01:34,  4.84it/s]
reward: -4.3893, last reward: -5.3442, gradient norm:  9.876:  27%|██▋       | 168/625 [00:35<01:34,  4.83it/s]
reward: -4.6078, last reward: -7.7466, gradient norm:  16.06:  27%|██▋       | 168/625 [00:35<01:34,  4.83it/s]
reward: -4.6078, last reward: -7.7466, gradient norm:  16.06:  27%|██▋       | 169/625 [00:35<01:34,  4.83it/s]
reward: -4.5928, last reward: -6.5101, gradient norm:  20.69:  27%|██▋       | 169/625 [00:35<01:34,  4.83it/s]
reward: -4.5928, last reward: -6.5101, gradient norm:  20.69:  27%|██▋       | 170/625 [00:35<01:34,  4.83it/s]
reward: -4.3683, last reward: -3.9307, gradient norm:  78.59:  27%|██▋       | 170/625 [00:35<01:34,  4.83it/s]
reward: -4.3683, last reward: -3.9307, gradient norm:  78.59:  27%|██▋       | 171/625 [00:35<01:33,  4.83it/s]
reward: -4.1301, last reward: -2.4966, gradient norm:  41.21:  27%|██▋       | 171/625 [00:36<01:33,  4.83it/s]
reward: -4.1301, last reward: -2.4966, gradient norm:  41.21:  28%|██▊       | 172/625 [00:36<01:33,  4.83it/s]
reward: -4.0062, last reward: -2.8255, gradient norm:  4.798:  28%|██▊       | 172/625 [00:36<01:33,  4.83it/s]
reward: -4.0062, last reward: -2.8255, gradient norm:  4.798:  28%|██▊       | 173/625 [00:36<01:33,  4.83it/s]
reward: -4.1558, last reward: -3.7388, gradient norm:  214.8:  28%|██▊       | 173/625 [00:36<01:33,  4.83it/s]
reward: -4.1558, last reward: -3.7388, gradient norm:  214.8:  28%|██▊       | 174/625 [00:36<01:33,  4.82it/s]
reward: -4.2803, last reward: -3.7403, gradient norm:  15.82:  28%|██▊       | 174/625 [00:36<01:33,  4.82it/s]
reward: -4.2803, last reward: -3.7403, gradient norm:  15.82:  28%|██▊       | 175/625 [00:36<01:33,  4.83it/s]
reward: -4.4744, last reward: -2.6246, gradient norm:  8.711:  28%|██▊       | 175/625 [00:36<01:33,  4.83it/s]
reward: -4.4744, last reward: -2.6246, gradient norm:  8.711:  28%|██▊       | 176/625 [00:36<01:33,  4.81it/s]
reward: -4.3930, last reward: -4.4075, gradient norm:  5.093:  28%|██▊       | 176/625 [00:37<01:33,  4.81it/s]
reward: -4.3930, last reward: -4.4075, gradient norm:  5.093:  28%|██▊       | 177/625 [00:37<01:33,  4.81it/s]
reward: -4.5119, last reward: -5.6155, gradient norm:  6.556:  28%|██▊       | 177/625 [00:37<01:33,  4.81it/s]
reward: -4.5119, last reward: -5.6155, gradient norm:  6.556:  28%|██▊       | 178/625 [00:37<01:32,  4.81it/s]
reward: -4.4439, last reward: -4.5042, gradient norm:  4.911:  28%|██▊       | 178/625 [00:37<01:32,  4.81it/s]
reward: -4.4439, last reward: -4.5042, gradient norm:  4.911:  29%|██▊       | 179/625 [00:37<01:32,  4.81it/s]
reward: -3.9554, last reward: -2.5403, gradient norm:  13.88:  29%|██▊       | 179/625 [00:37<01:32,  4.81it/s]
reward: -3.9554, last reward: -2.5403, gradient norm:  13.88:  29%|██▉       | 180/625 [00:37<01:32,  4.82it/s]
reward: -4.3505, last reward: -2.7444, gradient norm:  4.01:  29%|██▉       | 180/625 [00:37<01:32,  4.82it/s]
reward: -4.3505, last reward: -2.7444, gradient norm:  4.01:  29%|██▉       | 181/625 [00:37<01:31,  4.83it/s]
reward: -4.4148, last reward: -4.6757, gradient norm:  9.661:  29%|██▉       | 181/625 [00:38<01:31,  4.83it/s]
reward: -4.4148, last reward: -4.6757, gradient norm:  9.661:  29%|██▉       | 182/625 [00:38<01:32,  4.81it/s]
reward: -4.7255, last reward: -4.1250, gradient norm:  13.23:  29%|██▉       | 182/625 [00:38<01:32,  4.81it/s]
reward: -4.7255, last reward: -4.1250, gradient norm:  13.23:  29%|██▉       | 183/625 [00:38<01:31,  4.81it/s]
reward: -4.7526, last reward: -4.5914, gradient norm:  10.12:  29%|██▉       | 183/625 [00:38<01:31,  4.81it/s]
reward: -4.7526, last reward: -4.5914, gradient norm:  10.12:  29%|██▉       | 184/625 [00:38<01:31,  4.82it/s]
reward: -4.6860, last reward: -3.1830, gradient norm:  11.02:  29%|██▉       | 184/625 [00:38<01:31,  4.82it/s]
reward: -4.6860, last reward: -3.1830, gradient norm:  11.02:  30%|██▉       | 185/625 [00:38<01:31,  4.81it/s]
reward: -4.3758, last reward: -4.4231, gradient norm:  21.28:  30%|██▉       | 185/625 [00:38<01:31,  4.81it/s]
reward: -4.3758, last reward: -4.4231, gradient norm:  21.28:  30%|██▉       | 186/625 [00:38<01:31,  4.82it/s]
reward: -4.1488, last reward: -4.7337, gradient norm:  9.908:  30%|██▉       | 186/625 [00:39<01:31,  4.82it/s]
reward: -4.1488, last reward: -4.7337, gradient norm:  9.908:  30%|██▉       | 187/625 [00:39<01:30,  4.83it/s]
reward: -3.9613, last reward: -3.1772, gradient norm:  15.58:  30%|██▉       | 187/625 [00:39<01:30,  4.83it/s]
reward: -3.9613, last reward: -3.1772, gradient norm:  15.58:  30%|███       | 188/625 [00:39<01:30,  4.81it/s]
reward: -4.2562, last reward: -4.2022, gradient norm:  28.65:  30%|███       | 188/625 [00:39<01:30,  4.81it/s]
reward: -4.2562, last reward: -4.2022, gradient norm:  28.65:  30%|███       | 189/625 [00:39<01:30,  4.82it/s]
reward: -4.6174, last reward: -5.0209, gradient norm:  20.98:  30%|███       | 189/625 [00:39<01:30,  4.82it/s]
reward: -4.6174, last reward: -5.0209, gradient norm:  20.98:  30%|███       | 190/625 [00:39<01:30,  4.82it/s]
reward: -4.5392, last reward: -6.6212, gradient norm:  26.19:  30%|███       | 190/625 [00:39<01:30,  4.82it/s]
reward: -4.5392, last reward: -6.6212, gradient norm:  26.19:  31%|███       | 191/625 [00:39<01:29,  4.83it/s]
reward: -4.4612, last reward: -5.7472, gradient norm:  25.55:  31%|███       | 191/625 [00:40<01:29,  4.83it/s]
reward: -4.4612, last reward: -5.7472, gradient norm:  25.55:  31%|███       | 192/625 [00:40<01:29,  4.82it/s]
reward: -3.7723, last reward: -2.9722, gradient norm:  55.78:  31%|███       | 192/625 [00:40<01:29,  4.82it/s]
reward: -3.7723, last reward: -2.9722, gradient norm:  55.78:  31%|███       | 193/625 [00:40<01:29,  4.82it/s]
reward: -3.7303, last reward: -4.6766, gradient norm:  57.47:  31%|███       | 193/625 [00:40<01:29,  4.82it/s]
reward: -3.7303, last reward: -4.6766, gradient norm:  57.47:  31%|███       | 194/625 [00:40<01:29,  4.81it/s]
reward: -4.5050, last reward: -3.5319, gradient norm:  12.82:  31%|███       | 194/625 [00:40<01:29,  4.81it/s]
reward: -4.5050, last reward: -3.5319, gradient norm:  12.82:  31%|███       | 195/625 [00:40<01:29,  4.80it/s]
reward: -4.9510, last reward: -4.2900, gradient norm:  10.02:  31%|███       | 195/625 [00:41<01:29,  4.80it/s]
reward: -4.9510, last reward: -4.2900, gradient norm:  10.02:  31%|███▏      | 196/625 [00:41<01:29,  4.82it/s]
reward: -4.8987, last reward: -3.8858, gradient norm:  11.21:  31%|███▏      | 196/625 [00:41<01:29,  4.82it/s]
reward: -4.8987, last reward: -3.8858, gradient norm:  11.21:  32%|███▏      | 197/625 [00:41<01:28,  4.82it/s]
reward: -4.7844, last reward: -4.1996, gradient norm:  16.9:  32%|███▏      | 197/625 [00:41<01:28,  4.82it/s]
reward: -4.7844, last reward: -4.1996, gradient norm:  16.9:  32%|███▏      | 198/625 [00:41<01:28,  4.81it/s]
reward: -4.7041, last reward: -3.7807, gradient norm:  12.8:  32%|███▏      | 198/625 [00:41<01:28,  4.81it/s]
reward: -4.7041, last reward: -3.7807, gradient norm:  12.8:  32%|███▏      | 199/625 [00:41<01:28,  4.82it/s]
reward: -4.5883, last reward: -3.1343, gradient norm:  5.33:  32%|███▏      | 199/625 [00:41<01:28,  4.82it/s]
reward: -4.5883, last reward: -3.1343, gradient norm:  5.33:  32%|███▏      | 200/625 [00:41<01:28,  4.81it/s]
reward: -4.3860, last reward: -4.1545, gradient norm:  12.24:  32%|███▏      | 200/625 [00:42<01:28,  4.81it/s]
reward: -4.3860, last reward: -4.1545, gradient norm:  12.24:  32%|███▏      | 201/625 [00:42<01:27,  4.82it/s]
reward: -4.3071, last reward: -5.9397, gradient norm:  70.8:  32%|███▏      | 201/625 [00:42<01:27,  4.82it/s]
reward: -4.3071, last reward: -5.9397, gradient norm:  70.8:  32%|███▏      | 202/625 [00:42<01:27,  4.82it/s]
reward: -3.8351, last reward: -2.9276, gradient norm:  28.92:  32%|███▏      | 202/625 [00:42<01:27,  4.82it/s]
reward: -3.8351, last reward: -2.9276, gradient norm:  28.92:  32%|███▏      | 203/625 [00:42<01:27,  4.81it/s]
reward: -3.6451, last reward: -3.3669, gradient norm:  133.9:  32%|███▏      | 203/625 [00:42<01:27,  4.81it/s]
reward: -3.6451, last reward: -3.3669, gradient norm:  133.9:  33%|███▎      | 204/625 [00:42<01:27,  4.81it/s]
reward: -3.9093, last reward: -2.9751, gradient norm:  34.3:  33%|███▎      | 204/625 [00:42<01:27,  4.81it/s]
reward: -3.9093, last reward: -2.9751, gradient norm:  34.3:  33%|███▎      | 205/625 [00:42<01:27,  4.81it/s]
reward: -4.0323, last reward: -1.9548, gradient norm:  18.41:  33%|███▎      | 205/625 [00:43<01:27,  4.81it/s]
reward: -4.0323, last reward: -1.9548, gradient norm:  18.41:  33%|███▎      | 206/625 [00:43<01:27,  4.81it/s]
reward: -3.4461, last reward: -2.4580, gradient norm:  25.43:  33%|███▎      | 206/625 [00:43<01:27,  4.81it/s]
reward: -3.4461, last reward: -2.4580, gradient norm:  25.43:  33%|███▎      | 207/625 [00:43<01:27,  4.80it/s]
reward: -3.7982, last reward: -2.7564, gradient norm:  107.4:  33%|███▎      | 207/625 [00:43<01:27,  4.80it/s]
reward: -3.7982, last reward: -2.7564, gradient norm:  107.4:  33%|███▎      | 208/625 [00:43<01:26,  4.81it/s]
reward: -3.8554, last reward: -3.2339, gradient norm:  20.46:  33%|███▎      | 208/625 [00:43<01:26,  4.81it/s]
reward: -3.8554, last reward: -3.2339, gradient norm:  20.46:  33%|███▎      | 209/625 [00:43<01:26,  4.82it/s]
reward: -3.7704, last reward: -3.8807, gradient norm:  33.34:  33%|███▎      | 209/625 [00:43<01:26,  4.82it/s]
reward: -3.7704, last reward: -3.8807, gradient norm:  33.34:  34%|███▎      | 210/625 [00:43<01:25,  4.83it/s]
reward: -3.9760, last reward: -4.4843, gradient norm:  25.69:  34%|███▎      | 210/625 [00:44<01:25,  4.83it/s]
reward: -3.9760, last reward: -4.4843, gradient norm:  25.69:  34%|███▍      | 211/625 [00:44<01:25,  4.82it/s]
reward: -3.7967, last reward: -5.2582, gradient norm:  25.03:  34%|███▍      | 211/625 [00:44<01:25,  4.82it/s]
reward: -3.7967, last reward: -5.2582, gradient norm:  25.03:  34%|███▍      | 212/625 [00:44<01:25,  4.82it/s]
reward: -3.7655, last reward: -4.4343, gradient norm:  46.35:  34%|███▍      | 212/625 [00:44<01:25,  4.82it/s]
reward: -3.7655, last reward: -4.4343, gradient norm:  46.35:  34%|███▍      | 213/625 [00:44<01:25,  4.83it/s]
reward: -4.1830, last reward: -3.9914, gradient norm:  48.97:  34%|███▍      | 213/625 [00:44<01:25,  4.83it/s]
reward: -4.1830, last reward: -3.9914, gradient norm:  48.97:  34%|███▍      | 214/625 [00:44<01:25,  4.83it/s]
reward: -4.3355, last reward: -4.1371, gradient norm:  10.28:  34%|███▍      | 214/625 [00:44<01:25,  4.83it/s]
reward: -4.3355, last reward: -4.1371, gradient norm:  10.28:  34%|███▍      | 215/625 [00:44<01:24,  4.83it/s]
reward: -4.2021, last reward: -2.7219, gradient norm:  12.34:  34%|███▍      | 215/625 [00:45<01:24,  4.83it/s]
reward: -4.2021, last reward: -2.7219, gradient norm:  12.34:  35%|███▍      | 216/625 [00:45<01:24,  4.83it/s]
reward: -4.1103, last reward: -3.1725, gradient norm:  11.8:  35%|███▍      | 216/625 [00:45<01:24,  4.83it/s]
reward: -4.1103, last reward: -3.1725, gradient norm:  11.8:  35%|███▍      | 217/625 [00:45<01:24,  4.82it/s]
reward: -4.4244, last reward: -4.2578, gradient norm:  11.67:  35%|███▍      | 217/625 [00:45<01:24,  4.82it/s]
reward: -4.4244, last reward: -4.2578, gradient norm:  11.67:  35%|███▍      | 218/625 [00:45<01:24,  4.81it/s]
reward: -4.0961, last reward: -2.4116, gradient norm:  4.52:  35%|███▍      | 218/625 [00:45<01:24,  4.81it/s]
reward: -4.0961, last reward: -2.4116, gradient norm:  4.52:  35%|███▌      | 219/625 [00:45<01:24,  4.82it/s]
reward: -4.1262, last reward: -2.6491, gradient norm:  12.21:  35%|███▌      | 219/625 [00:45<01:24,  4.82it/s]
reward: -4.1262, last reward: -2.6491, gradient norm:  12.21:  35%|███▌      | 220/625 [00:45<01:24,  4.82it/s]
reward: -4.2716, last reward: -3.9329, gradient norm:  18.67:  35%|███▌      | 220/625 [00:46<01:24,  4.82it/s]
reward: -4.2716, last reward: -3.9329, gradient norm:  18.67:  35%|███▌      | 221/625 [00:46<01:23,  4.82it/s]
reward: -3.8580, last reward: -3.1444, gradient norm:  52.86:  35%|███▌      | 221/625 [00:46<01:23,  4.82it/s]
reward: -3.8580, last reward: -3.1444, gradient norm:  52.86:  36%|███▌      | 222/625 [00:46<01:23,  4.81it/s]
reward: -4.3621, last reward: -3.7214, gradient norm:  16.0:  36%|███▌      | 222/625 [00:46<01:23,  4.81it/s]
reward: -4.3621, last reward: -3.7214, gradient norm:  16.0:  36%|███▌      | 223/625 [00:46<01:23,  4.82it/s]
reward: -4.4639, last reward: -5.2648, gradient norm:  24.71:  36%|███▌      | 223/625 [00:46<01:23,  4.82it/s]
reward: -4.4639, last reward: -5.2648, gradient norm:  24.71:  36%|███▌      | 224/625 [00:46<01:23,  4.83it/s]
reward: -4.6842, last reward: -4.6974, gradient norm:  14.15:  36%|███▌      | 224/625 [00:47<01:23,  4.83it/s]
reward: -4.6842, last reward: -4.6974, gradient norm:  14.15:  36%|███▌      | 225/625 [00:47<01:22,  4.83it/s]
reward: -3.8237, last reward: -3.6540, gradient norm:  21.16:  36%|███▌      | 225/625 [00:47<01:22,  4.83it/s]
reward: -3.8237, last reward: -3.6540, gradient norm:  21.16:  36%|███▌      | 226/625 [00:47<01:22,  4.82it/s]
reward: -4.0712, last reward: -4.1515, gradient norm:  7.923:  36%|███▌      | 226/625 [00:47<01:22,  4.82it/s]
reward: -4.0712, last reward: -4.1515, gradient norm:  7.923:  36%|███▋      | 227/625 [00:47<01:22,  4.83it/s]
reward: -4.0174, last reward: -3.0392, gradient norm:  16.69:  36%|███▋      | 227/625 [00:47<01:22,  4.83it/s]
reward: -4.0174, last reward: -3.0392, gradient norm:  16.69:  36%|███▋      | 228/625 [00:47<01:22,  4.83it/s]
reward: -4.0842, last reward: -3.7785, gradient norm:  19.62:  36%|███▋      | 228/625 [00:47<01:22,  4.83it/s]
reward: -4.0842, last reward: -3.7785, gradient norm:  19.62:  37%|███▋      | 229/625 [00:47<01:21,  4.84it/s]
reward: -4.0530, last reward: -4.4058, gradient norm:  16.16:  37%|███▋      | 229/625 [00:48<01:21,  4.84it/s]
reward: -4.0530, last reward: -4.4058, gradient norm:  16.16:  37%|███▋      | 230/625 [00:48<01:21,  4.83it/s]
reward: -4.0566, last reward: -3.0590, gradient norm:  46.33:  37%|███▋      | 230/625 [00:48<01:21,  4.83it/s]
reward: -4.0566, last reward: -3.0590, gradient norm:  46.33:  37%|███▋      | 231/625 [00:48<01:21,  4.83it/s]
reward: -3.8513, last reward: -2.7985, gradient norm:  47.95:  37%|███▋      | 231/625 [00:48<01:21,  4.83it/s]
reward: -3.8513, last reward: -2.7985, gradient norm:  47.95:  37%|███▋      | 232/625 [00:48<01:21,  4.84it/s]
reward: -3.7363, last reward: -3.3588, gradient norm:  6.625:  37%|███▋      | 232/625 [00:48<01:21,  4.84it/s]
reward: -3.7363, last reward: -3.3588, gradient norm:  6.625:  37%|███▋      | 233/625 [00:48<01:21,  4.83it/s]
reward: -3.7676, last reward: -4.5312, gradient norm:  5.029:  37%|███▋      | 233/625 [00:48<01:21,  4.83it/s]
reward: -3.7676, last reward: -4.5312, gradient norm:  5.029:  37%|███▋      | 234/625 [00:48<01:20,  4.83it/s]
reward: -3.7305, last reward: -3.6823, gradient norm:  23.2:  37%|███▋      | 234/625 [00:49<01:20,  4.83it/s]
reward: -3.7305, last reward: -3.6823, gradient norm:  23.2:  38%|███▊      | 235/625 [00:49<01:35,  4.07it/s]
reward: -4.1303, last reward: -4.9328, gradient norm:  19.52:  38%|███▊      | 235/625 [00:49<01:35,  4.07it/s]
reward: -4.1303, last reward: -4.9328, gradient norm:  19.52:  38%|███▊      | 236/625 [00:49<01:31,  4.27it/s]
reward: -4.1665, last reward: -5.0729, gradient norm:  33.78:  38%|███▊      | 236/625 [00:49<01:31,  4.27it/s]
reward: -4.1665, last reward: -5.0729, gradient norm:  33.78:  38%|███▊      | 237/625 [00:49<01:27,  4.41it/s]
reward: -4.1188, last reward: -5.8531, gradient norm:  36.56:  38%|███▊      | 237/625 [00:49<01:27,  4.41it/s]
reward: -4.1188, last reward: -5.8531, gradient norm:  36.56:  38%|███▊      | 238/625 [00:49<01:25,  4.53it/s]
reward: -3.5453, last reward: -2.3132, gradient norm:  10.89:  38%|███▊      | 238/625 [00:50<01:25,  4.53it/s]
reward: -3.5453, last reward: -2.3132, gradient norm:  10.89:  38%|███▊      | 239/625 [00:50<01:23,  4.61it/s]
reward: -3.2605, last reward: -2.8357, gradient norm:  13.73:  38%|███▊      | 239/625 [00:50<01:23,  4.61it/s]
reward: -3.2605, last reward: -2.8357, gradient norm:  13.73:  38%|███▊      | 240/625 [00:50<01:22,  4.67it/s]
reward: -3.7712, last reward: -1.9925, gradient norm:  45.24:  38%|███▊      | 240/625 [00:50<01:22,  4.67it/s]
reward: -3.7712, last reward: -1.9925, gradient norm:  45.24:  39%|███▊      | 241/625 [00:50<01:21,  4.72it/s]
reward: -3.7126, last reward: -2.1642, gradient norm:  6.793:  39%|███▊      | 241/625 [00:50<01:21,  4.72it/s]
reward: -3.7126, last reward: -2.1642, gradient norm:  6.793:  39%|███▊      | 242/625 [00:50<01:20,  4.74it/s]
reward: -3.4435, last reward: -2.1223, gradient norm:  30.3:  39%|███▊      | 242/625 [00:50<01:20,  4.74it/s]
reward: -3.4435, last reward: -2.1223, gradient norm:  30.3:  39%|███▉      | 243/625 [00:50<01:20,  4.76it/s]
reward: -3.8483, last reward: -1.9589, gradient norm:  76.23:  39%|███▉      | 243/625 [00:51<01:20,  4.76it/s]
reward: -3.8483, last reward: -1.9589, gradient norm:  76.23:  39%|███▉      | 244/625 [00:51<01:19,  4.77it/s]
reward: -3.7243, last reward: -3.9248, gradient norm:  77.73:  39%|███▉      | 244/625 [00:51<01:19,  4.77it/s]
reward: -3.7243, last reward: -3.9248, gradient norm:  77.73:  39%|███▉      | 245/625 [00:51<01:19,  4.78it/s]
reward: -4.7954, last reward: -3.4635, gradient norm:  13.38:  39%|███▉      | 245/625 [00:51<01:19,  4.78it/s]
reward: -4.7954, last reward: -3.4635, gradient norm:  13.38:  39%|███▉      | 246/625 [00:51<01:19,  4.80it/s]
reward: -4.6425, last reward: -4.7224, gradient norm:  14.12:  39%|███▉      | 246/625 [00:51<01:19,  4.80it/s]
reward: -4.6425, last reward: -4.7224, gradient norm:  14.12:  40%|███▉      | 247/625 [00:51<01:18,  4.80it/s]
reward: -4.2372, last reward: -4.5707, gradient norm:  21.06:  40%|███▉      | 247/625 [00:51<01:18,  4.80it/s]
reward: -4.2372, last reward: -4.5707, gradient norm:  21.06:  40%|███▉      | 248/625 [00:51<01:18,  4.82it/s]
reward: -3.9959, last reward: -3.4874, gradient norm:  60.15:  40%|███▉      | 248/625 [00:52<01:18,  4.82it/s]
reward: -3.9959, last reward: -3.4874, gradient norm:  60.15:  40%|███▉      | 249/625 [00:52<01:18,  4.81it/s]
reward: -4.0894, last reward: -3.5227, gradient norm:  14.05:  40%|███▉      | 249/625 [00:52<01:18,  4.81it/s]
reward: -4.0894, last reward: -3.5227, gradient norm:  14.05:  40%|████      | 250/625 [00:52<01:17,  4.81it/s]
reward: -4.5161, last reward: -6.4950, gradient norm:  135.6:  40%|████      | 250/625 [00:52<01:17,  4.81it/s]
reward: -4.5161, last reward: -6.4950, gradient norm:  135.6:  40%|████      | 251/625 [00:52<01:17,  4.82it/s]
reward: -4.0824, last reward: -3.0430, gradient norm:  18.15:  40%|████      | 251/625 [00:52<01:17,  4.82it/s]
reward: -4.0824, last reward: -3.0430, gradient norm:  18.15:  40%|████      | 252/625 [00:52<01:17,  4.82it/s]
reward: -4.6468, last reward: -3.6022, gradient norm:  16.69:  40%|████      | 252/625 [00:52<01:17,  4.82it/s]
reward: -4.6468, last reward: -3.6022, gradient norm:  16.69:  40%|████      | 253/625 [00:52<01:16,  4.83it/s]
reward: -4.0601, last reward: -3.4058, gradient norm:  30.29:  40%|████      | 253/625 [00:53<01:16,  4.83it/s]
reward: -4.0601, last reward: -3.4058, gradient norm:  30.29:  41%|████      | 254/625 [00:53<01:16,  4.84it/s]
reward: -4.2424, last reward: -3.7108, gradient norm:  19.45:  41%|████      | 254/625 [00:53<01:16,  4.84it/s]
reward: -4.2424, last reward: -3.7108, gradient norm:  19.45:  41%|████      | 255/625 [00:53<01:16,  4.84it/s]
reward: -3.5179, last reward: -2.3462, gradient norm:  127.3:  41%|████      | 255/625 [00:53<01:16,  4.84it/s]
reward: -3.5179, last reward: -2.3462, gradient norm:  127.3:  41%|████      | 256/625 [00:53<01:16,  4.84it/s]
reward: -3.5197, last reward: -4.0831, gradient norm:  17.4:  41%|████      | 256/625 [00:53<01:16,  4.84it/s]
reward: -3.5197, last reward: -4.0831, gradient norm:  17.4:  41%|████      | 257/625 [00:53<01:16,  4.84it/s]
reward: -3.8827, last reward: -4.6454, gradient norm:  13.75:  41%|████      | 257/625 [00:53<01:16,  4.84it/s]
reward: -3.8827, last reward: -4.6454, gradient norm:  13.75:  41%|████▏     | 258/625 [00:54<01:15,  4.83it/s]
reward: -3.4425, last reward: -2.8616, gradient norm:  30.91:  41%|████▏     | 258/625 [00:54<01:15,  4.83it/s]
reward: -3.4425, last reward: -2.8616, gradient norm:  30.91:  41%|████▏     | 259/625 [00:54<01:15,  4.83it/s]
reward: -3.3707, last reward: -1.6766, gradient norm:  89.46:  41%|████▏     | 259/625 [00:54<01:15,  4.83it/s]
reward: -3.3707, last reward: -1.6766, gradient norm:  89.46:  42%|████▏     | 260/625 [00:54<01:15,  4.83it/s]
reward: -3.7682, last reward: -2.7231, gradient norm:  15.74:  42%|████▏     | 260/625 [00:54<01:15,  4.83it/s]
reward: -3.7682, last reward: -2.7231, gradient norm:  15.74:  42%|████▏     | 261/625 [00:54<01:15,  4.82it/s]
reward: -3.9477, last reward: -3.8103, gradient norm:  14.7:  42%|████▏     | 261/625 [00:54<01:15,  4.82it/s]
reward: -3.9477, last reward: -3.8103, gradient norm:  14.7:  42%|████▏     | 262/625 [00:54<01:15,  4.81it/s]
reward: -3.7253, last reward: -3.3617, gradient norm:  15.5:  42%|████▏     | 262/625 [00:55<01:15,  4.81it/s]
reward: -3.7253, last reward: -3.3617, gradient norm:  15.5:  42%|████▏     | 263/625 [00:55<01:15,  4.82it/s]
reward: -3.8854, last reward: -2.6403, gradient norm:  46.48:  42%|████▏     | 263/625 [00:55<01:15,  4.82it/s]
reward: -3.8854, last reward: -2.6403, gradient norm:  46.48:  42%|████▏     | 264/625 [00:55<01:14,  4.82it/s]
reward: -2.2784, last reward: -0.3983, gradient norm:  2.552:  42%|████▏     | 264/625 [00:55<01:14,  4.82it/s]
reward: -2.2784, last reward: -0.3983, gradient norm:  2.552:  42%|████▏     | 265/625 [00:55<01:14,  4.82it/s]
reward: -3.3063, last reward: -1.4367, gradient norm:  12.58:  42%|████▏     | 265/625 [00:55<01:14,  4.82it/s]
reward: -3.3063, last reward: -1.4367, gradient norm:  12.58:  43%|████▎     | 266/625 [00:55<01:14,  4.83it/s]
reward: -2.9484, last reward: -2.5394, gradient norm:  28.81:  43%|████▎     | 266/625 [00:55<01:14,  4.83it/s]
reward: -2.9484, last reward: -2.5394, gradient norm:  28.81:  43%|████▎     | 267/625 [00:55<01:14,  4.83it/s]
reward: -3.4480, last reward: -4.8011, gradient norm:  69.75:  43%|████▎     | 267/625 [00:56<01:14,  4.83it/s]
reward: -3.4480, last reward: -4.8011, gradient norm:  69.75:  43%|████▎     | 268/625 [00:56<01:13,  4.84it/s]
reward: -3.2181, last reward: -1.7389, gradient norm:  18.54:  43%|████▎     | 268/625 [00:56<01:13,  4.84it/s]
reward: -3.2181, last reward: -1.7389, gradient norm:  18.54:  43%|████▎     | 269/625 [00:56<01:13,  4.83it/s]
reward: -3.5885, last reward: -2.3872, gradient norm:  1.067e+03:  43%|████▎     | 269/625 [00:56<01:13,  4.83it/s]
reward: -3.5885, last reward: -2.3872, gradient norm:  1.067e+03:  43%|████▎     | 270/625 [00:56<01:13,  4.82it/s]
reward: -3.5645, last reward: -2.3470, gradient norm:  10.39:  43%|████▎     | 270/625 [00:56<01:13,  4.82it/s]
reward: -3.5645, last reward: -2.3470, gradient norm:  10.39:  43%|████▎     | 271/625 [00:56<01:13,  4.83it/s]
reward: -3.1180, last reward: -2.9837, gradient norm:  21.35:  43%|████▎     | 271/625 [00:56<01:13,  4.83it/s]
reward: -3.1180, last reward: -2.9837, gradient norm:  21.35:  44%|████▎     | 272/625 [00:56<01:13,  4.83it/s]
reward: -3.0020, last reward: -1.7848, gradient norm:  14.11:  44%|████▎     | 272/625 [00:57<01:13,  4.83it/s]
reward: -3.0020, last reward: -1.7848, gradient norm:  14.11:  44%|████▎     | 273/625 [00:57<01:12,  4.83it/s]
reward: -2.9024, last reward: -1.2560, gradient norm:  48.93:  44%|████▎     | 273/625 [00:57<01:12,  4.83it/s]
reward: -2.9024, last reward: -1.2560, gradient norm:  48.93:  44%|████▍     | 274/625 [00:57<01:12,  4.81it/s]
reward: -2.3769, last reward: -0.9803, gradient norm:  403.4:  44%|████▍     | 274/625 [00:57<01:12,  4.81it/s]
reward: -2.3769, last reward: -0.9803, gradient norm:  403.4:  44%|████▍     | 275/625 [00:57<01:12,  4.82it/s]
reward: -3.1577, last reward: -1.9462, gradient norm:  25.01:  44%|████▍     | 275/625 [00:57<01:12,  4.82it/s]
reward: -3.1577, last reward: -1.9462, gradient norm:  25.01:  44%|████▍     | 276/625 [00:57<01:12,  4.83it/s]
reward: -3.7512, last reward: -3.6302, gradient norm:  47.82:  44%|████▍     | 276/625 [00:57<01:12,  4.83it/s]
reward: -3.7512, last reward: -3.6302, gradient norm:  47.82:  44%|████▍     | 277/625 [00:57<01:12,  4.83it/s]
reward: -3.3241, last reward: -1.4824, gradient norm:  29.08:  44%|████▍     | 277/625 [00:58<01:12,  4.83it/s]
reward: -3.3241, last reward: -1.4824, gradient norm:  29.08:  44%|████▍     | 278/625 [00:58<01:11,  4.84it/s]
reward: -2.8900, last reward: -1.5340, gradient norm:  6.86:  44%|████▍     | 278/625 [00:58<01:11,  4.84it/s]
reward: -2.8900, last reward: -1.5340, gradient norm:  6.86:  45%|████▍     | 279/625 [00:58<01:11,  4.84it/s]
reward: -2.4089, last reward: -0.1335, gradient norm:  1.654:  45%|████▍     | 279/625 [00:58<01:11,  4.84it/s]
reward: -2.4089, last reward: -0.1335, gradient norm:  1.654:  45%|████▍     | 280/625 [00:58<01:11,  4.83it/s]
reward: -2.1500, last reward: -0.0078, gradient norm:  0.7977:  45%|████▍     | 280/625 [00:58<01:11,  4.83it/s]
reward: -2.1500, last reward: -0.0078, gradient norm:  0.7977:  45%|████▍     | 281/625 [00:58<01:11,  4.83it/s]
reward: -2.8219, last reward: -0.0230, gradient norm:  1.073:  45%|████▍     | 281/625 [00:58<01:11,  4.83it/s]
reward: -2.8219, last reward: -0.0230, gradient norm:  1.073:  45%|████▌     | 282/625 [00:58<01:10,  4.83it/s]
reward: -3.3674, last reward: -2.5903, gradient norm:  28.51:  45%|████▌     | 282/625 [00:59<01:10,  4.83it/s]
reward: -3.3674, last reward: -2.5903, gradient norm:  28.51:  45%|████▌     | 283/625 [00:59<01:10,  4.83it/s]
reward: -2.6695, last reward: -1.1400, gradient norm:  9.986:  45%|████▌     | 283/625 [00:59<01:10,  4.83it/s]
reward: -2.6695, last reward: -1.1400, gradient norm:  9.986:  45%|████▌     | 284/625 [00:59<01:10,  4.84it/s]
reward: -3.9000, last reward: -2.8705, gradient norm:  21.76:  45%|████▌     | 284/625 [00:59<01:10,  4.84it/s]
reward: -3.9000, last reward: -2.8705, gradient norm:  21.76:  46%|████▌     | 285/625 [00:59<01:10,  4.83it/s]
reward: -3.3866, last reward: -2.6675, gradient norm:  25.97:  46%|████▌     | 285/625 [00:59<01:10,  4.83it/s]
reward: -3.3866, last reward: -2.6675, gradient norm:  25.97:  46%|████▌     | 286/625 [00:59<01:10,  4.83it/s]
reward: -3.1383, last reward: -2.5193, gradient norm:  28.38:  46%|████▌     | 286/625 [01:00<01:10,  4.83it/s]
reward: -3.1383, last reward: -2.5193, gradient norm:  28.38:  46%|████▌     | 287/625 [01:00<01:09,  4.83it/s]
reward: -1.9981, last reward: -1.1067, gradient norm:  22.2:  46%|████▌     | 287/625 [01:00<01:09,  4.83it/s]
reward: -1.9981, last reward: -1.1067, gradient norm:  22.2:  46%|████▌     | 288/625 [01:00<01:09,  4.84it/s]
reward: -2.4183, last reward: -0.6585, gradient norm:  12.21:  46%|████▌     | 288/625 [01:00<01:09,  4.84it/s]
reward: -2.4183, last reward: -0.6585, gradient norm:  12.21:  46%|████▌     | 289/625 [01:00<01:09,  4.84it/s]
reward: -2.2903, last reward: -0.1044, gradient norm:  1.397:  46%|████▌     | 289/625 [01:00<01:09,  4.84it/s]
reward: -2.2903, last reward: -0.1044, gradient norm:  1.397:  46%|████▋     | 290/625 [01:00<01:09,  4.84it/s]
reward: -2.3470, last reward: -0.0267, gradient norm:  1.381:  46%|████▋     | 290/625 [01:00<01:09,  4.84it/s]
reward: -2.3470, last reward: -0.0267, gradient norm:  1.381:  47%|████▋     | 291/625 [01:00<01:09,  4.84it/s]
reward: -2.4752, last reward: -0.2300, gradient norm:  0.4783:  47%|████▋     | 291/625 [01:01<01:09,  4.84it/s]
reward: -2.4752, last reward: -0.2300, gradient norm:  0.4783:  47%|████▋     | 292/625 [01:01<01:08,  4.83it/s]
reward: -2.2931, last reward: -0.0729, gradient norm:  4.72:  47%|████▋     | 292/625 [01:01<01:08,  4.83it/s]
reward: -2.2931, last reward: -0.0729, gradient norm:  4.72:  47%|████▋     | 293/625 [01:01<01:08,  4.84it/s]
reward: -2.5747, last reward: -0.0695, gradient norm:  2.437:  47%|████▋     | 293/625 [01:01<01:08,  4.84it/s]
reward: -2.5747, last reward: -0.0695, gradient norm:  2.437:  47%|████▋     | 294/625 [01:01<01:08,  4.84it/s]
reward: -2.3089, last reward: -0.0061, gradient norm:  0.6729:  47%|████▋     | 294/625 [01:01<01:08,  4.84it/s]
reward: -2.3089, last reward: -0.0061, gradient norm:  0.6729:  47%|████▋     | 295/625 [01:01<01:08,  4.84it/s]
reward: -2.3122, last reward: -0.0378, gradient norm:  1.651:  47%|████▋     | 295/625 [01:01<01:08,  4.84it/s]
reward: -2.3122, last reward: -0.0378, gradient norm:  1.651:  47%|████▋     | 296/625 [01:01<01:07,  4.85it/s]
reward: -1.8535, last reward: -0.0574, gradient norm:  2.329:  47%|████▋     | 296/625 [01:02<01:07,  4.85it/s]
reward: -1.8535, last reward: -0.0574, gradient norm:  2.329:  48%|████▊     | 297/625 [01:02<01:07,  4.84it/s]
reward: -2.3665, last reward: -0.0111, gradient norm:  0.9808:  48%|████▊     | 297/625 [01:02<01:07,  4.84it/s]
reward: -2.3665, last reward: -0.0111, gradient norm:  0.9808:  48%|████▊     | 298/625 [01:02<01:07,  4.84it/s]
reward: -2.0677, last reward: -0.0970, gradient norm:  5.651:  48%|████▊     | 298/625 [01:02<01:07,  4.84it/s]
reward: -2.0677, last reward: -0.0970, gradient norm:  5.651:  48%|████▊     | 299/625 [01:02<01:07,  4.82it/s]
reward: -2.8268, last reward: -1.0460, gradient norm:  15.6:  48%|████▊     | 299/625 [01:02<01:07,  4.82it/s]
reward: -2.8268, last reward: -1.0460, gradient norm:  15.6:  48%|████▊     | 300/625 [01:02<01:07,  4.82it/s]
reward: -2.2015, last reward: -0.2860, gradient norm:  22.44:  48%|████▊     | 300/625 [01:02<01:07,  4.82it/s]
reward: -2.2015, last reward: -0.2860, gradient norm:  22.44:  48%|████▊     | 301/625 [01:02<01:07,  4.83it/s]
reward: -2.3683, last reward: -0.0137, gradient norm:  1.152:  48%|████▊     | 301/625 [01:03<01:07,  4.83it/s]
reward: -2.3683, last reward: -0.0137, gradient norm:  1.152:  48%|████▊     | 302/625 [01:03<01:07,  4.80it/s]
reward: -1.9836, last reward: -0.0664, gradient norm:  5.29:  48%|████▊     | 302/625 [01:03<01:07,  4.80it/s]
reward: -1.9836, last reward: -0.0664, gradient norm:  5.29:  48%|████▊     | 303/625 [01:03<01:06,  4.81it/s]
reward: -2.1668, last reward: -0.0758, gradient norm:  2.976:  48%|████▊     | 303/625 [01:03<01:06,  4.81it/s]
reward: -2.1668, last reward: -0.0758, gradient norm:  2.976:  49%|████▊     | 304/625 [01:03<01:06,  4.81it/s]
reward: -1.7214, last reward: -0.0275, gradient norm:  2.978:  49%|████▊     | 304/625 [01:03<01:06,  4.81it/s]
reward: -1.7214, last reward: -0.0275, gradient norm:  2.978:  49%|████▉     | 305/625 [01:03<01:06,  4.81it/s]
reward: -2.1655, last reward: -1.0136, gradient norm:  67.86:  49%|████▉     | 305/625 [01:03<01:06,  4.81it/s]
reward: -2.1655, last reward: -1.0136, gradient norm:  67.86:  49%|████▉     | 306/625 [01:03<01:06,  4.83it/s]
reward: -2.9232, last reward: -3.2623, gradient norm:  62.61:  49%|████▉     | 306/625 [01:04<01:06,  4.83it/s]
reward: -2.9232, last reward: -3.2623, gradient norm:  62.61:  49%|████▉     | 307/625 [01:04<01:05,  4.83it/s]
reward: -2.2422, last reward: -2.5996, gradient norm:  90.63:  49%|████▉     | 307/625 [01:04<01:05,  4.83it/s]
reward: -2.2422, last reward: -2.5996, gradient norm:  90.63:  49%|████▉     | 308/625 [01:04<01:05,  4.84it/s]
reward: -2.1574, last reward: -0.0119, gradient norm:  2.67:  49%|████▉     | 308/625 [01:04<01:05,  4.84it/s]
reward: -2.1574, last reward: -0.0119, gradient norm:  2.67:  49%|████▉     | 309/625 [01:04<01:05,  4.84it/s]
reward: -1.7745, last reward: -0.1597, gradient norm:  10.93:  49%|████▉     | 309/625 [01:04<01:05,  4.84it/s]
reward: -1.7745, last reward: -0.1597, gradient norm:  10.93:  50%|████▉     | 310/625 [01:04<01:05,  4.84it/s]
reward: -1.8866, last reward: -0.5739, gradient norm:  59.4:  50%|████▉     | 310/625 [01:04<01:05,  4.84it/s]
reward: -1.8866, last reward: -0.5739, gradient norm:  59.4:  50%|████▉     | 311/625 [01:04<01:04,  4.84it/s]
reward: -2.0082, last reward: -0.0806, gradient norm:  3.376:  50%|████▉     | 311/625 [01:05<01:04,  4.84it/s]
reward: -2.0082, last reward: -0.0806, gradient norm:  3.376:  50%|████▉     | 312/625 [01:05<01:04,  4.83it/s]
reward: -2.0180, last reward: -0.0130, gradient norm:  0.8043:  50%|████▉     | 312/625 [01:05<01:04,  4.83it/s]
reward: -2.0180, last reward: -0.0130, gradient norm:  0.8043:  50%|█████     | 313/625 [01:05<01:04,  4.83it/s]
reward: -2.1591, last reward: -0.1254, gradient norm:  7.212:  50%|█████     | 313/625 [01:05<01:04,  4.83it/s]
reward: -2.1591, last reward: -0.1254, gradient norm:  7.212:  50%|█████     | 314/625 [01:05<01:04,  4.83it/s]
reward: -1.9418, last reward: -0.0125, gradient norm:  0.6393:  50%|█████     | 314/625 [01:05<01:04,  4.83it/s]
reward: -1.9418, last reward: -0.0125, gradient norm:  0.6393:  50%|█████     | 315/625 [01:05<01:04,  4.83it/s]
reward: -2.0906, last reward: -0.0021, gradient norm:  0.7693:  50%|█████     | 315/625 [01:06<01:04,  4.83it/s]
reward: -2.0906, last reward: -0.0021, gradient norm:  0.7693:  51%|█████     | 316/625 [01:06<01:03,  4.84it/s]
reward: -2.1884, last reward: -0.0084, gradient norm:  0.9224:  51%|█████     | 316/625 [01:06<01:03,  4.84it/s]
reward: -2.1884, last reward: -0.0084, gradient norm:  0.9224:  51%|█████     | 317/625 [01:06<01:03,  4.84it/s]
reward: -2.0722, last reward: -0.0024, gradient norm:  0.6936:  51%|█████     | 317/625 [01:06<01:03,  4.84it/s]
reward: -2.0722, last reward: -0.0024, gradient norm:  0.6936:  51%|█████     | 318/625 [01:06<01:03,  4.84it/s]
reward: -2.2271, last reward: -0.0027, gradient norm:  0.3025:  51%|█████     | 318/625 [01:06<01:03,  4.84it/s]
reward: -2.2271, last reward: -0.0027, gradient norm:  0.3025:  51%|█████     | 319/625 [01:06<01:03,  4.83it/s]
reward: -2.0207, last reward: -0.0060, gradient norm:  1.949:  51%|█████     | 319/625 [01:06<01:03,  4.83it/s]
reward: -2.0207, last reward: -0.0060, gradient norm:  1.949:  51%|█████     | 320/625 [01:06<01:03,  4.84it/s]
reward: -1.8973, last reward: -0.0129, gradient norm:  0.6215:  51%|█████     | 320/625 [01:07<01:03,  4.84it/s]
reward: -1.8973, last reward: -0.0129, gradient norm:  0.6215:  51%|█████▏    | 321/625 [01:07<01:02,  4.85it/s]
reward: -1.7585, last reward: -0.0027, gradient norm:  0.5406:  51%|█████▏    | 321/625 [01:07<01:02,  4.85it/s]
reward: -1.7585, last reward: -0.0027, gradient norm:  0.5406:  52%|█████▏    | 322/625 [01:07<01:02,  4.85it/s]
reward: -2.2886, last reward: -0.0517, gradient norm:  10.62:  52%|█████▏    | 322/625 [01:07<01:02,  4.85it/s]
reward: -2.2886, last reward: -0.0517, gradient norm:  10.62:  52%|█████▏    | 323/625 [01:07<01:02,  4.85it/s]
reward: -1.8662, last reward: -0.0046, gradient norm:  2.198:  52%|█████▏    | 323/625 [01:07<01:02,  4.85it/s]
reward: -1.8662, last reward: -0.0046, gradient norm:  2.198:  52%|█████▏    | 324/625 [01:07<01:02,  4.85it/s]
reward: -2.0652, last reward: -0.0135, gradient norm:  2.58:  52%|█████▏    | 324/625 [01:07<01:02,  4.85it/s]
reward: -2.0652, last reward: -0.0135, gradient norm:  2.58:  52%|█████▏    | 325/625 [01:07<01:01,  4.85it/s]
reward: -2.0966, last reward: -0.0214, gradient norm:  1.656:  52%|█████▏    | 325/625 [01:08<01:01,  4.85it/s]
reward: -2.0966, last reward: -0.0214, gradient norm:  1.656:  52%|█████▏    | 326/625 [01:08<01:01,  4.85it/s]
reward: -2.5183, last reward: -0.0011, gradient norm:  0.705:  52%|█████▏    | 326/625 [01:08<01:01,  4.85it/s]
reward: -2.5183, last reward: -0.0011, gradient norm:  0.705:  52%|█████▏    | 327/625 [01:08<01:01,  4.85it/s]
reward: -2.3712, last reward: -0.0457, gradient norm:  1.244:  52%|█████▏    | 327/625 [01:08<01:01,  4.85it/s]
reward: -2.3712, last reward: -0.0457, gradient norm:  1.244:  52%|█████▏    | 328/625 [01:08<01:01,  4.85it/s]
reward: -2.2987, last reward: -0.0218, gradient norm:  1.368:  52%|█████▏    | 328/625 [01:08<01:01,  4.85it/s]
reward: -2.2987, last reward: -0.0218, gradient norm:  1.368:  53%|█████▎    | 329/625 [01:08<01:01,  4.83it/s]
reward: -2.3155, last reward: -0.0095, gradient norm:  0.7518:  53%|█████▎    | 329/625 [01:08<01:01,  4.83it/s]
reward: -2.3155, last reward: -0.0095, gradient norm:  0.7518:  53%|█████▎    | 330/625 [01:08<01:01,  4.83it/s]
reward: -2.1199, last reward: -0.1257, gradient norm:  5.305:  53%|█████▎    | 330/625 [01:09<01:01,  4.83it/s]
reward: -2.1199, last reward: -0.1257, gradient norm:  5.305:  53%|█████▎    | 331/625 [01:09<01:00,  4.83it/s]
reward: -1.9859, last reward: -0.0679, gradient norm:  5.372:  53%|█████▎    | 331/625 [01:09<01:00,  4.83it/s]
reward: -1.9859, last reward: -0.0679, gradient norm:  5.372:  53%|█████▎    | 332/625 [01:09<01:00,  4.83it/s]
reward: -2.4061, last reward: -0.6118, gradient norm:  94.16:  53%|█████▎    | 332/625 [01:09<01:00,  4.83it/s]
reward: -2.4061, last reward: -0.6118, gradient norm:  94.16:  53%|█████▎    | 333/625 [01:09<01:00,  4.83it/s]
reward: -3.0361, last reward: -3.3765, gradient norm:  103.5:  53%|█████▎    | 333/625 [01:09<01:00,  4.83it/s]
reward: -3.0361, last reward: -3.3765, gradient norm:  103.5:  53%|█████▎    | 334/625 [01:09<01:00,  4.81it/s]
reward: -2.2451, last reward: -0.1210, gradient norm:  3.228:  53%|█████▎    | 334/625 [01:09<01:00,  4.81it/s]
reward: -2.2451, last reward: -0.1210, gradient norm:  3.228:  54%|█████▎    | 335/625 [01:09<01:00,  4.82it/s]
reward: -1.8761, last reward: -0.0040, gradient norm:  0.777:  54%|█████▎    | 335/625 [01:10<01:00,  4.82it/s]
reward: -1.8761, last reward: -0.0040, gradient norm:  0.777:  54%|█████▍    | 336/625 [01:10<01:11,  4.05it/s]
reward: -2.9146, last reward: -3.2809, gradient norm:  51.08:  54%|█████▍    | 336/625 [01:10<01:11,  4.05it/s]
reward: -2.9146, last reward: -3.2809, gradient norm:  51.08:  54%|█████▍    | 337/625 [01:10<01:07,  4.26it/s]
reward: -3.0197, last reward: -2.2499, gradient norm:  20.1:  54%|█████▍    | 337/625 [01:10<01:07,  4.26it/s]
reward: -3.0197, last reward: -2.2499, gradient norm:  20.1:  54%|█████▍    | 338/625 [01:10<01:05,  4.41it/s]
reward: -2.9844, last reward: -2.3444, gradient norm:  18.91:  54%|█████▍    | 338/625 [01:10<01:05,  4.41it/s]
reward: -2.9844, last reward: -2.3444, gradient norm:  18.91:  54%|█████▍    | 339/625 [01:10<01:03,  4.53it/s]
reward: -2.4492, last reward: -2.3984, gradient norm:  62.17:  54%|█████▍    | 339/625 [01:11<01:03,  4.53it/s]
reward: -2.4492, last reward: -2.3984, gradient norm:  62.17:  54%|█████▍    | 340/625 [01:11<01:01,  4.62it/s]
reward: -2.1010, last reward: -0.0191, gradient norm:  1.736:  54%|█████▍    | 340/625 [01:11<01:01,  4.62it/s]
reward: -2.1010, last reward: -0.0191, gradient norm:  1.736:  55%|█████▍    | 341/625 [01:11<01:00,  4.68it/s]
reward: -2.6114, last reward: -0.2858, gradient norm:  2.123:  55%|█████▍    | 341/625 [01:11<01:00,  4.68it/s]
reward: -2.6114, last reward: -0.2858, gradient norm:  2.123:  55%|█████▍    | 342/625 [01:11<00:59,  4.73it/s]
reward: -2.4618, last reward: -0.0410, gradient norm:  2.15:  55%|█████▍    | 342/625 [01:11<00:59,  4.73it/s]
reward: -2.4618, last reward: -0.0410, gradient norm:  2.15:  55%|█████▍    | 343/625 [01:11<00:59,  4.75it/s]
reward: -2.5515, last reward: -0.4695, gradient norm:  5.609:  55%|█████▍    | 343/625 [01:11<00:59,  4.75it/s]
reward: -2.5515, last reward: -0.4695, gradient norm:  5.609:  55%|█████▌    | 344/625 [01:11<00:58,  4.77it/s]
reward: -2.8009, last reward: -2.1572, gradient norm:  34.87:  55%|█████▌    | 344/625 [01:12<00:58,  4.77it/s]
reward: -2.8009, last reward: -2.1572, gradient norm:  34.87:  55%|█████▌    | 345/625 [01:12<00:58,  4.79it/s]
reward: -3.2082, last reward: -5.0086, gradient norm:  45.63:  55%|█████▌    | 345/625 [01:12<00:58,  4.79it/s]
reward: -3.2082, last reward: -5.0086, gradient norm:  45.63:  55%|█████▌    | 346/625 [01:12<00:58,  4.80it/s]
reward: -2.8382, last reward: -3.4997, gradient norm:  50.9:  55%|█████▌    | 346/625 [01:12<00:58,  4.80it/s]
reward: -2.8382, last reward: -3.4997, gradient norm:  50.9:  56%|█████▌    | 347/625 [01:12<00:57,  4.81it/s]
reward: -2.4106, last reward: -0.8440, gradient norm:  20.79:  56%|█████▌    | 347/625 [01:12<00:57,  4.81it/s]
reward: -2.4106, last reward: -0.8440, gradient norm:  20.79:  56%|█████▌    | 348/625 [01:12<00:57,  4.81it/s]
reward: -1.9518, last reward: -0.0163, gradient norm:  1.572:  56%|█████▌    | 348/625 [01:12<00:57,  4.81it/s]
reward: -1.9518, last reward: -0.0163, gradient norm:  1.572:  56%|█████▌    | 349/625 [01:12<00:57,  4.82it/s]
reward: -2.0997, last reward: -0.0540, gradient norm:  6.954:  56%|█████▌    | 349/625 [01:13<00:57,  4.82it/s]
reward: -2.0997, last reward: -0.0540, gradient norm:  6.954:  56%|█████▌    | 350/625 [01:13<00:57,  4.82it/s]
reward: -2.0961, last reward: -0.0805, gradient norm:  2.763:  56%|█████▌    | 350/625 [01:13<00:57,  4.82it/s]
reward: -2.0961, last reward: -0.0805, gradient norm:  2.763:  56%|█████▌    | 351/625 [01:13<00:56,  4.82it/s]
reward: -2.0131, last reward: -0.0443, gradient norm:  2.295:  56%|█████▌    | 351/625 [01:13<00:56,  4.82it/s]
reward: -2.0131, last reward: -0.0443, gradient norm:  2.295:  56%|█████▋    | 352/625 [01:13<00:56,  4.82it/s]
reward: -1.5239, last reward: -0.0026, gradient norm:  0.9087:  56%|█████▋    | 352/625 [01:13<00:56,  4.82it/s]
reward: -1.5239, last reward: -0.0026, gradient norm:  0.9087:  56%|█████▋    | 353/625 [01:13<00:56,  4.83it/s]
reward: -2.3815, last reward: -0.0786, gradient norm:  5.712:  56%|█████▋    | 353/625 [01:14<00:56,  4.83it/s]
reward: -2.3815, last reward: -0.0786, gradient norm:  5.712:  57%|█████▋    | 354/625 [01:14<00:56,  4.83it/s]
reward: -2.2704, last reward: -0.0027, gradient norm:  2.876:  57%|█████▋    | 354/625 [01:14<00:56,  4.83it/s]
reward: -2.2704, last reward: -0.0027, gradient norm:  2.876:  57%|█████▋    | 355/625 [01:14<00:55,  4.83it/s]
reward: -2.2578, last reward: -0.0315, gradient norm:  1.772:  57%|█████▋    | 355/625 [01:14<00:55,  4.83it/s]
reward: -2.2578, last reward: -0.0315, gradient norm:  1.772:  57%|█████▋    | 356/625 [01:14<00:55,  4.83it/s]
reward: -2.7637, last reward: -2.6112, gradient norm:  44.13:  57%|█████▋    | 356/625 [01:14<00:55,  4.83it/s]
reward: -2.7637, last reward: -2.6112, gradient norm:  44.13:  57%|█████▋    | 357/625 [01:14<00:55,  4.83it/s]
reward: -2.6214, last reward: -2.8094, gradient norm:  34.44:  57%|█████▋    | 357/625 [01:14<00:55,  4.83it/s]
reward: -2.6214, last reward: -2.8094, gradient norm:  34.44:  57%|█████▋    | 358/625 [01:14<00:55,  4.83it/s]
reward: -2.6773, last reward: -0.9341, gradient norm:  17.79:  57%|█████▋    | 358/625 [01:15<00:55,  4.83it/s]
reward: -2.6773, last reward: -0.9341, gradient norm:  17.79:  57%|█████▋    | 359/625 [01:15<00:54,  4.84it/s]
reward: -2.0646, last reward: -0.0045, gradient norm:  0.8423:  57%|█████▋    | 359/625 [01:15<00:54,  4.84it/s]
reward: -2.0646, last reward: -0.0045, gradient norm:  0.8423:  58%|█████▊    | 360/625 [01:15<00:54,  4.84it/s]
reward: -2.2144, last reward: -0.0755, gradient norm:  2.833:  58%|█████▊    | 360/625 [01:15<00:54,  4.84it/s]
reward: -2.2144, last reward: -0.0755, gradient norm:  2.833:  58%|█████▊    | 361/625 [01:15<00:54,  4.84it/s]
reward: -2.1301, last reward: -0.1504, gradient norm:  4.438:  58%|█████▊    | 361/625 [01:15<00:54,  4.84it/s]
reward: -2.1301, last reward: -0.1504, gradient norm:  4.438:  58%|█████▊    | 362/625 [01:15<00:54,  4.84it/s]
reward: -2.2999, last reward: -0.1190, gradient norm:  3.388:  58%|█████▊    | 362/625 [01:15<00:54,  4.84it/s]
reward: -2.2999, last reward: -0.1190, gradient norm:  3.388:  58%|█████▊    | 363/625 [01:15<00:54,  4.83it/s]
reward: -2.0784, last reward: -0.0349, gradient norm:  1.901:  58%|█████▊    | 363/625 [01:16<00:54,  4.83it/s]
reward: -2.0784, last reward: -0.0349, gradient norm:  1.901:  58%|█████▊    | 364/625 [01:16<00:53,  4.84it/s]
reward: -2.2406, last reward: -0.0235, gradient norm:  1.598:  58%|█████▊    | 364/625 [01:16<00:53,  4.84it/s]
reward: -2.2406, last reward: -0.0235, gradient norm:  1.598:  58%|█████▊    | 365/625 [01:16<00:53,  4.84it/s]
reward: -2.4914, last reward: -0.5533, gradient norm:  18.79:  58%|█████▊    | 365/625 [01:16<00:53,  4.84it/s]
reward: -2.4914, last reward: -0.5533, gradient norm:  18.79:  59%|█████▊    | 366/625 [01:16<00:53,  4.84it/s]
reward: -2.1190, last reward: -1.1747, gradient norm:  50.33:  59%|█████▊    | 366/625 [01:16<00:53,  4.84it/s]
reward: -2.1190, last reward: -1.1747, gradient norm:  50.33:  59%|█████▊    | 367/625 [01:16<00:53,  4.84it/s]
reward: -1.9734, last reward: -0.0011, gradient norm:  6.159:  59%|█████▊    | 367/625 [01:16<00:53,  4.84it/s]
reward: -1.9734, last reward: -0.0011, gradient norm:  6.159:  59%|█████▉    | 368/625 [01:16<00:53,  4.83it/s]
reward: -2.4497, last reward: -0.0361, gradient norm:  1.444:  59%|█████▉    | 368/625 [01:17<00:53,  4.83it/s]
reward: -2.4497, last reward: -0.0361, gradient norm:  1.444:  59%|█████▉    | 369/625 [01:17<00:52,  4.84it/s]
reward: -1.6725, last reward: -0.0607, gradient norm:  2.076:  59%|█████▉    | 369/625 [01:17<00:52,  4.84it/s]
reward: -1.6725, last reward: -0.0607, gradient norm:  2.076:  59%|█████▉    | 370/625 [01:17<00:52,  4.83it/s]
reward: -2.1384, last reward: -0.0464, gradient norm:  1.567:  59%|█████▉    | 370/625 [01:17<00:52,  4.83it/s]
reward: -2.1384, last reward: -0.0464, gradient norm:  1.567:  59%|█████▉    | 371/625 [01:17<00:52,  4.83it/s]
reward: -1.7059, last reward: -0.0138, gradient norm:  1.031:  59%|█████▉    | 371/625 [01:17<00:52,  4.83it/s]
reward: -1.7059, last reward: -0.0138, gradient norm:  1.031:  60%|█████▉    | 372/625 [01:17<00:52,  4.82it/s]
reward: -1.9927, last reward: -0.0054, gradient norm:  0.5594:  60%|█████▉    | 372/625 [01:17<00:52,  4.82it/s]
reward: -1.9927, last reward: -0.0054, gradient norm:  0.5594:  60%|█████▉    | 373/625 [01:17<00:52,  4.83it/s]
reward: -2.4160, last reward: -0.5060, gradient norm:  29.92:  60%|█████▉    | 373/625 [01:18<00:52,  4.83it/s]
reward: -2.4160, last reward: -0.5060, gradient norm:  29.92:  60%|█████▉    | 374/625 [01:18<00:51,  4.84it/s]
reward: -2.5828, last reward: -0.1384, gradient norm:  4.958:  60%|█████▉    | 374/625 [01:18<00:51,  4.84it/s]
reward: -2.5828, last reward: -0.1384, gradient norm:  4.958:  60%|██████    | 375/625 [01:18<00:51,  4.84it/s]
reward: -1.9523, last reward: -0.0269, gradient norm:  1.721:  60%|██████    | 375/625 [01:18<00:51,  4.84it/s]
reward: -1.9523, last reward: -0.0269, gradient norm:  1.721:  60%|██████    | 376/625 [01:18<00:51,  4.82it/s]
reward: -1.8944, last reward: -0.0003, gradient norm:  0.4466:  60%|██████    | 376/625 [01:18<00:51,  4.82it/s]
reward: -1.8944, last reward: -0.0003, gradient norm:  0.4466:  60%|██████    | 377/625 [01:18<00:51,  4.81it/s]
reward: -2.2882, last reward: -0.0140, gradient norm:  1.393:  60%|██████    | 377/625 [01:18<00:51,  4.81it/s]
reward: -2.2882, last reward: -0.0140, gradient norm:  1.393:  60%|██████    | 378/625 [01:18<00:51,  4.82it/s]
reward: -2.2007, last reward: -0.0201, gradient norm:  0.9149:  60%|██████    | 378/625 [01:19<00:51,  4.82it/s]
reward: -2.2007, last reward: -0.0201, gradient norm:  0.9149:  61%|██████    | 379/625 [01:19<00:50,  4.83it/s]
reward: -2.1404, last reward: -0.2498, gradient norm:  0.7904:  61%|██████    | 379/625 [01:19<00:50,  4.83it/s]
reward: -2.1404, last reward: -0.2498, gradient norm:  0.7904:  61%|██████    | 380/625 [01:19<00:50,  4.83it/s]
reward: -1.9428, last reward: -0.0002, gradient norm:  0.3416:  61%|██████    | 380/625 [01:19<00:50,  4.83it/s]
reward: -1.9428, last reward: -0.0002, gradient norm:  0.3416:  61%|██████    | 381/625 [01:19<00:50,  4.83it/s]
reward: -1.6321, last reward: -0.0189, gradient norm:  1.258:  61%|██████    | 381/625 [01:19<00:50,  4.83it/s]
reward: -1.6321, last reward: -0.0189, gradient norm:  1.258:  61%|██████    | 382/625 [01:19<00:50,  4.82it/s]
reward: -1.9240, last reward: -0.0407, gradient norm:  0.8453:  61%|██████    | 382/625 [01:20<00:50,  4.82it/s]
reward: -1.9240, last reward: -0.0407, gradient norm:  0.8453:  61%|██████▏   | 383/625 [01:20<00:50,  4.83it/s]
reward: -1.7657, last reward: -0.1190, gradient norm:  3.86:  61%|██████▏   | 383/625 [01:20<00:50,  4.83it/s]
reward: -1.7657, last reward: -0.1190, gradient norm:  3.86:  61%|██████▏   | 384/625 [01:20<00:49,  4.83it/s]
reward: -2.2517, last reward: -0.0091, gradient norm:  2.363:  61%|██████▏   | 384/625 [01:20<00:49,  4.83it/s]
reward: -2.2517, last reward: -0.0091, gradient norm:  2.363:  62%|██████▏   | 385/625 [01:20<00:49,  4.83it/s]
reward: -2.3202, last reward: -0.0734, gradient norm:  6.84:  62%|██████▏   | 385/625 [01:20<00:49,  4.83it/s]
reward: -2.3202, last reward: -0.0734, gradient norm:  6.84:  62%|██████▏   | 386/625 [01:20<00:49,  4.83it/s]
reward: -2.4757, last reward: -0.1005, gradient norm:  1.801:  62%|██████▏   | 386/625 [01:20<00:49,  4.83it/s]
reward: -2.4757, last reward: -0.1005, gradient norm:  1.801:  62%|██████▏   | 387/625 [01:20<00:49,  4.83it/s]
reward: -2.1148, last reward: -0.4821, gradient norm:  40.67:  62%|██████▏   | 387/625 [01:21<00:49,  4.83it/s]
reward: -2.1148, last reward: -0.4821, gradient norm:  40.67:  62%|██████▏   | 388/625 [01:21<00:49,  4.84it/s]
reward: -2.3243, last reward: -0.1138, gradient norm:  2.966:  62%|██████▏   | 388/625 [01:21<00:49,  4.84it/s]
reward: -2.3243, last reward: -0.1138, gradient norm:  2.966:  62%|██████▏   | 389/625 [01:21<00:48,  4.84it/s]
reward: -2.1412, last reward: -0.0588, gradient norm:  2.561:  62%|██████▏   | 389/625 [01:21<00:48,  4.84it/s]
reward: -2.1412, last reward: -0.0588, gradient norm:  2.561:  62%|██████▏   | 390/625 [01:21<00:48,  4.83it/s]
reward: -1.8031, last reward: -0.0051, gradient norm:  2.107:  62%|██████▏   | 390/625 [01:21<00:48,  4.83it/s]
reward: -1.8031, last reward: -0.0051, gradient norm:  2.107:  63%|██████▎   | 391/625 [01:21<00:48,  4.83it/s]
reward: -2.2578, last reward: -2.3332, gradient norm:  44.11:  63%|██████▎   | 391/625 [01:21<00:48,  4.83it/s]
reward: -2.2578, last reward: -2.3332, gradient norm:  44.11:  63%|██████▎   | 392/625 [01:21<00:48,  4.83it/s]
reward: -2.5711, last reward: -3.2760, gradient norm:  42.22:  63%|██████▎   | 392/625 [01:22<00:48,  4.83it/s]
reward: -2.5711, last reward: -3.2760, gradient norm:  42.22:  63%|██████▎   | 393/625 [01:22<00:48,  4.82it/s]
reward: -2.4667, last reward: -1.7428, gradient norm:  33.16:  63%|██████▎   | 393/625 [01:22<00:48,  4.82it/s]
reward: -2.4667, last reward: -1.7428, gradient norm:  33.16:  63%|██████▎   | 394/625 [01:22<00:47,  4.82it/s]
reward: -2.0998, last reward: -0.0158, gradient norm:  2.666:  63%|██████▎   | 394/625 [01:22<00:47,  4.82it/s]
reward: -2.0998, last reward: -0.0158, gradient norm:  2.666:  63%|██████▎   | 395/625 [01:22<00:47,  4.83it/s]
reward: -2.4835, last reward: -0.1028, gradient norm:  6.602:  63%|██████▎   | 395/625 [01:22<00:47,  4.83it/s]
reward: -2.4835, last reward: -0.1028, gradient norm:  6.602:  63%|██████▎   | 396/625 [01:22<00:47,  4.83it/s]
reward: -4.1513, last reward: -2.9719, gradient norm:  31.03:  63%|██████▎   | 396/625 [01:22<00:47,  4.83it/s]
reward: -4.1513, last reward: -2.9719, gradient norm:  31.03:  64%|██████▎   | 397/625 [01:22<00:47,  4.83it/s]
reward: -3.8985, last reward: -5.0222, gradient norm:  215.2:  64%|██████▎   | 397/625 [01:23<00:47,  4.83it/s]
reward: -3.8985, last reward: -5.0222, gradient norm:  215.2:  64%|██████▎   | 398/625 [01:23<00:46,  4.83it/s]
reward: -2.2914, last reward: -0.1110, gradient norm:  3.192:  64%|██████▎   | 398/625 [01:23<00:46,  4.83it/s]
reward: -2.2914, last reward: -0.1110, gradient norm:  3.192:  64%|██████▍   | 399/625 [01:23<00:46,  4.84it/s]
reward: -1.9166, last reward: -0.0308, gradient norm:  1.668:  64%|██████▍   | 399/625 [01:23<00:46,  4.84it/s]
reward: -1.9166, last reward: -0.0308, gradient norm:  1.668:  64%|██████▍   | 400/625 [01:23<00:46,  4.84it/s]
reward: -1.8214, last reward: -0.0065, gradient norm:  0.6156:  64%|██████▍   | 400/625 [01:23<00:46,  4.84it/s]
reward: -1.8214, last reward: -0.0065, gradient norm:  0.6156:  64%|██████▍   | 401/625 [01:23<00:46,  4.84it/s]
reward: -2.2157, last reward: -2.9038, gradient norm:  114.0:  64%|██████▍   | 401/625 [01:23<00:46,  4.84it/s]
reward: -2.2157, last reward: -2.9038, gradient norm:  114.0:  64%|██████▍   | 402/625 [01:23<00:45,  4.85it/s]
reward: -2.2463, last reward: -3.3530, gradient norm:  120.8:  64%|██████▍   | 402/625 [01:24<00:45,  4.85it/s]
reward: -2.2463, last reward: -3.3530, gradient norm:  120.8:  64%|██████▍   | 403/625 [01:24<00:45,  4.85it/s]
reward: -2.0383, last reward: -0.0227, gradient norm:  1.776:  64%|██████▍   | 403/625 [01:24<00:45,  4.85it/s]
reward: -2.0383, last reward: -0.0227, gradient norm:  1.776:  65%|██████▍   | 404/625 [01:24<00:45,  4.85it/s]
reward: -1.7300, last reward: -0.0007, gradient norm:  0.414:  65%|██████▍   | 404/625 [01:24<00:45,  4.85it/s]
reward: -1.7300, last reward: -0.0007, gradient norm:  0.414:  65%|██████▍   | 405/625 [01:24<00:45,  4.85it/s]
reward: -1.7968, last reward: -0.0107, gradient norm:  0.8298:  65%|██████▍   | 405/625 [01:24<00:45,  4.85it/s]
reward: -1.7968, last reward: -0.0107, gradient norm:  0.8298:  65%|██████▍   | 406/625 [01:24<00:45,  4.83it/s]
reward: -2.0079, last reward: -0.2487, gradient norm:  0.8033:  65%|██████▍   | 406/625 [01:24<00:45,  4.83it/s]
reward: -2.0079, last reward: -0.2487, gradient norm:  0.8033:  65%|██████▌   | 407/625 [01:24<00:45,  4.83it/s]
reward: -1.8478, last reward: -0.0094, gradient norm:  0.7041:  65%|██████▌   | 407/625 [01:25<00:45,  4.83it/s]
reward: -1.8478, last reward: -0.0094, gradient norm:  0.7041:  65%|██████▌   | 408/625 [01:25<00:44,  4.83it/s]
reward: -2.2375, last reward: -0.1252, gradient norm:  0.9001:  65%|██████▌   | 408/625 [01:25<00:44,  4.83it/s]
reward: -2.2375, last reward: -0.1252, gradient norm:  0.9001:  65%|██████▌   | 409/625 [01:25<00:44,  4.83it/s]
reward: -1.9546, last reward: -0.0039, gradient norm:  0.4175:  65%|██████▌   | 409/625 [01:25<00:44,  4.83it/s]
reward: -1.9546, last reward: -0.0039, gradient norm:  0.4175:  66%|██████▌   | 410/625 [01:25<00:44,  4.82it/s]
reward: -2.3546, last reward: -0.0282, gradient norm:  14.68:  66%|██████▌   | 410/625 [01:25<00:44,  4.82it/s]
reward: -2.3546, last reward: -0.0282, gradient norm:  14.68:  66%|██████▌   | 411/625 [01:25<00:44,  4.81it/s]
reward: -2.1190, last reward: -0.7145, gradient norm:  47.83:  66%|██████▌   | 411/625 [01:26<00:44,  4.81it/s]
reward: -2.1190, last reward: -0.7145, gradient norm:  47.83:  66%|██████▌   | 412/625 [01:26<00:44,  4.82it/s]
reward: -2.1732, last reward: -0.0822, gradient norm:  2.868:  66%|██████▌   | 412/625 [01:26<00:44,  4.82it/s]
reward: -2.1732, last reward: -0.0822, gradient norm:  2.868:  66%|██████▌   | 413/625 [01:26<00:43,  4.83it/s]
reward: -2.2304, last reward: -1.3711, gradient norm:  38.48:  66%|██████▌   | 413/625 [01:26<00:43,  4.83it/s]
reward: -2.2304, last reward: -1.3711, gradient norm:  38.48:  66%|██████▌   | 414/625 [01:26<00:43,  4.84it/s]
reward: -2.1892, last reward: -0.2867, gradient norm:  2.725:  66%|██████▌   | 414/625 [01:26<00:43,  4.84it/s]
reward: -2.1892, last reward: -0.2867, gradient norm:  2.725:  66%|██████▋   | 415/625 [01:26<00:43,  4.83it/s]
reward: -1.9492, last reward: -0.0121, gradient norm:  0.8292:  66%|██████▋   | 415/625 [01:26<00:43,  4.83it/s]
reward: -1.9492, last reward: -0.0121, gradient norm:  0.8292:  67%|██████▋   | 416/625 [01:26<00:43,  4.83it/s]
reward: -1.7219, last reward: -0.0048, gradient norm:  0.6598:  67%|██████▋   | 416/625 [01:27<00:43,  4.83it/s]
reward: -1.7219, last reward: -0.0048, gradient norm:  0.6598:  67%|██████▋   | 417/625 [01:27<00:43,  4.83it/s]
reward: -2.1068, last reward: -0.0222, gradient norm:  1.108:  67%|██████▋   | 417/625 [01:27<00:43,  4.83it/s]
reward: -2.1068, last reward: -0.0222, gradient norm:  1.108:  67%|██████▋   | 418/625 [01:27<00:42,  4.83it/s]
reward: -1.7557, last reward: -0.0238, gradient norm:  1.243:  67%|██████▋   | 418/625 [01:27<00:42,  4.83it/s]
reward: -1.7557, last reward: -0.0238, gradient norm:  1.243:  67%|██████▋   | 419/625 [01:27<00:42,  4.83it/s]
reward: -1.8904, last reward: -0.0105, gradient norm:  27.15:  67%|██████▋   | 419/625 [01:27<00:42,  4.83it/s]
reward: -1.8904, last reward: -0.0105, gradient norm:  27.15:  67%|██████▋   | 420/625 [01:27<00:42,  4.81it/s]
reward: -2.1159, last reward: -0.0003, gradient norm:  0.3801:  67%|██████▋   | 420/625 [01:27<00:42,  4.81it/s]
reward: -2.1159, last reward: -0.0003, gradient norm:  0.3801:  67%|██████▋   | 421/625 [01:27<00:42,  4.81it/s]
reward: -1.7220, last reward: -0.0169, gradient norm:  1.102:  67%|██████▋   | 421/625 [01:28<00:42,  4.81it/s]
reward: -1.7220, last reward: -0.0169, gradient norm:  1.102:  68%|██████▊   | 422/625 [01:28<00:42,  4.82it/s]
reward: -1.8886, last reward: -0.0218, gradient norm:  1.461:  68%|██████▊   | 422/625 [01:28<00:42,  4.82it/s]
reward: -1.8886, last reward: -0.0218, gradient norm:  1.461:  68%|██████▊   | 423/625 [01:28<00:41,  4.82it/s]
reward: -1.6002, last reward: -0.0012, gradient norm:  0.08998:  68%|██████▊   | 423/625 [01:28<00:41,  4.82it/s]
reward: -1.6002, last reward: -0.0012, gradient norm:  0.08998:  68%|██████▊   | 424/625 [01:28<00:41,  4.82it/s]
reward: -2.3313, last reward: -0.0031, gradient norm:  0.6231:  68%|██████▊   | 424/625 [01:28<00:41,  4.82it/s]
reward: -2.3313, last reward: -0.0031, gradient norm:  0.6231:  68%|██████▊   | 425/625 [01:28<00:41,  4.81it/s]
reward: -1.9866, last reward: -0.0051, gradient norm:  0.697:  68%|██████▊   | 425/625 [01:28<00:41,  4.81it/s]
reward: -1.9866, last reward: -0.0051, gradient norm:  0.697:  68%|██████▊   | 426/625 [01:28<00:41,  4.81it/s]
reward: -2.2594, last reward: -0.0017, gradient norm:  0.5586:  68%|██████▊   | 426/625 [01:29<00:41,  4.81it/s]
reward: -2.2594, last reward: -0.0017, gradient norm:  0.5586:  68%|██████▊   | 427/625 [01:29<00:41,  4.82it/s]
reward: -2.2575, last reward: -0.0220, gradient norm:  4.928:  68%|██████▊   | 427/625 [01:29<00:41,  4.82it/s]
reward: -2.2575, last reward: -0.0220, gradient norm:  4.928:  68%|██████▊   | 428/625 [01:29<00:40,  4.82it/s]
reward: -1.8807, last reward: -0.0081, gradient norm:  0.9836:  68%|██████▊   | 428/625 [01:29<00:40,  4.82it/s]
reward: -1.8807, last reward: -0.0081, gradient norm:  0.9836:  69%|██████▊   | 429/625 [01:29<00:40,  4.81it/s]
reward: -2.0147, last reward: -0.0003, gradient norm:  0.2705:  69%|██████▊   | 429/625 [01:29<00:40,  4.81it/s]
reward: -2.0147, last reward: -0.0003, gradient norm:  0.2705:  69%|██████▉   | 430/625 [01:29<00:40,  4.79it/s]
reward: -1.8529, last reward: -0.0009, gradient norm:  0.7404:  69%|██████▉   | 430/625 [01:29<00:40,  4.79it/s]
reward: -1.8529, last reward: -0.0009, gradient norm:  0.7404:  69%|██████▉   | 431/625 [01:29<00:40,  4.81it/s]
reward: -1.9336, last reward: -0.0057, gradient norm:  0.6225:  69%|██████▉   | 431/625 [01:30<00:40,  4.81it/s]
reward: -1.9336, last reward: -0.0057, gradient norm:  0.6225:  69%|██████▉   | 432/625 [01:30<00:40,  4.80it/s]
reward: -2.3085, last reward: -0.0506, gradient norm:  1.342:  69%|██████▉   | 432/625 [01:30<00:40,  4.80it/s]
reward: -2.3085, last reward: -0.0506, gradient norm:  1.342:  69%|██████▉   | 433/625 [01:30<00:39,  4.80it/s]
reward: -2.5377, last reward: -0.0226, gradient norm:  0.4431:  69%|██████▉   | 433/625 [01:30<00:39,  4.80it/s]
reward: -2.5377, last reward: -0.0226, gradient norm:  0.4431:  69%|██████▉   | 434/625 [01:30<00:39,  4.82it/s]
reward: -2.1698, last reward: -0.1581, gradient norm:  2.587:  69%|██████▉   | 434/625 [01:30<00:39,  4.82it/s]
reward: -2.1698, last reward: -0.1581, gradient norm:  2.587:  70%|██████▉   | 435/625 [01:30<00:39,  4.81it/s]
reward: -2.5718, last reward: -0.1130, gradient norm:  6.102:  70%|██████▉   | 435/625 [01:31<00:39,  4.81it/s]
reward: -2.5718, last reward: -0.1130, gradient norm:  6.102:  70%|██████▉   | 436/625 [01:31<00:46,  4.06it/s]
reward: -2.2911, last reward: -0.3144, gradient norm:  4.01:  70%|██████▉   | 436/625 [01:31<00:46,  4.06it/s]
reward: -2.2911, last reward: -0.3144, gradient norm:  4.01:  70%|██████▉   | 437/625 [01:31<00:44,  4.25it/s]
reward: -2.7797, last reward: -0.3012, gradient norm:  2.231:  70%|██████▉   | 437/625 [01:31<00:44,  4.25it/s]
reward: -2.7797, last reward: -0.3012, gradient norm:  2.231:  70%|███████   | 438/625 [01:31<00:42,  4.41it/s]
reward: -1.8474, last reward: -0.0199, gradient norm:  1.789:  70%|███████   | 438/625 [01:31<00:42,  4.41it/s]
reward: -1.8474, last reward: -0.0199, gradient norm:  1.789:  70%|███████   | 439/625 [01:31<00:41,  4.52it/s]
reward: -2.0948, last reward: -0.0017, gradient norm:  0.3745:  70%|███████   | 439/625 [01:31<00:41,  4.52it/s]
reward: -2.0948, last reward: -0.0017, gradient norm:  0.3745:  70%|███████   | 440/625 [01:31<00:40,  4.60it/s]
reward: -2.0281, last reward: -0.0024, gradient norm:  0.4722:  70%|███████   | 440/625 [01:32<00:40,  4.60it/s]
reward: -2.0281, last reward: -0.0024, gradient norm:  0.4722:  71%|███████   | 441/625 [01:32<00:39,  4.67it/s]
reward: -2.2455, last reward: -0.0084, gradient norm:  0.9685:  71%|███████   | 441/625 [01:32<00:39,  4.67it/s]
reward: -2.2455, last reward: -0.0084, gradient norm:  0.9685:  71%|███████   | 442/625 [01:32<00:38,  4.71it/s]
reward: -1.9491, last reward: -0.0081, gradient norm:  0.7127:  71%|███████   | 442/625 [01:32<00:38,  4.71it/s]
reward: -1.9491, last reward: -0.0081, gradient norm:  0.7127:  71%|███████   | 443/625 [01:32<00:38,  4.73it/s]
reward: -2.0660, last reward: -0.0011, gradient norm:  0.4463:  71%|███████   | 443/625 [01:32<00:38,  4.73it/s]
reward: -2.0660, last reward: -0.0011, gradient norm:  0.4463:  71%|███████   | 444/625 [01:32<00:38,  4.75it/s]
reward: -2.0021, last reward: -0.0043, gradient norm:  0.8505:  71%|███████   | 444/625 [01:32<00:38,  4.75it/s]
reward: -2.0021, last reward: -0.0043, gradient norm:  0.8505:  71%|███████   | 445/625 [01:32<00:37,  4.77it/s]
reward: -2.2601, last reward: -0.0044, gradient norm:  0.6368:  71%|███████   | 445/625 [01:33<00:37,  4.77it/s]
reward: -2.2601, last reward: -0.0044, gradient norm:  0.6368:  71%|███████▏  | 446/625 [01:33<00:37,  4.79it/s]
reward: -2.1654, last reward: -0.0008, gradient norm:  0.9723:  71%|███████▏  | 446/625 [01:33<00:37,  4.79it/s]
reward: -2.1654, last reward: -0.0008, gradient norm:  0.9723:  72%|███████▏  | 447/625 [01:33<00:37,  4.80it/s]
reward: -1.7645, last reward: -0.0014, gradient norm:  0.6832:  72%|███████▏  | 447/625 [01:33<00:37,  4.80it/s]
reward: -1.7645, last reward: -0.0014, gradient norm:  0.6832:  72%|███████▏  | 448/625 [01:33<00:36,  4.83it/s]
reward: -2.1802, last reward: -0.0016, gradient norm:  0.4254:  72%|███████▏  | 448/625 [01:33<00:36,  4.83it/s]
reward: -2.1802, last reward: -0.0016, gradient norm:  0.4254:  72%|███████▏  | 449/625 [01:33<00:36,  4.82it/s]
reward: -1.9047, last reward: -0.0029, gradient norm:  0.6538:  72%|███████▏  | 449/625 [01:34<00:36,  4.82it/s]
reward: -1.9047, last reward: -0.0029, gradient norm:  0.6538:  72%|███████▏  | 450/625 [01:34<00:36,  4.83it/s]
reward: -2.3640, last reward: -0.0064, gradient norm:  1.098:  72%|███████▏  | 450/625 [01:34<00:36,  4.83it/s]
reward: -2.3640, last reward: -0.0064, gradient norm:  1.098:  72%|███████▏  | 451/625 [01:34<00:36,  4.83it/s]
reward: -2.1285, last reward: -0.0338, gradient norm:  1.303:  72%|███████▏  | 451/625 [01:34<00:36,  4.83it/s]
reward: -2.1285, last reward: -0.0338, gradient norm:  1.303:  72%|███████▏  | 452/625 [01:34<00:35,  4.83it/s]
reward: -1.6215, last reward: -0.0049, gradient norm:  2.223:  72%|███████▏  | 452/625 [01:34<00:35,  4.83it/s]
reward: -1.6215, last reward: -0.0049, gradient norm:  2.223:  72%|███████▏  | 453/625 [01:34<00:35,  4.83it/s]
reward: -1.5373, last reward: -0.0090, gradient norm:  1.162:  72%|███████▏  | 453/625 [01:34<00:35,  4.83it/s]
reward: -1.5373, last reward: -0.0090, gradient norm:  1.162:  73%|███████▎  | 454/625 [01:34<00:35,  4.83it/s]
reward: -1.8666, last reward: -0.0247, gradient norm:  1.893:  73%|███████▎  | 454/625 [01:35<00:35,  4.83it/s]
reward: -1.8666, last reward: -0.0247, gradient norm:  1.893:  73%|███████▎  | 455/625 [01:35<00:35,  4.84it/s]
reward: -1.9899, last reward: -0.0080, gradient norm:  1.12:  73%|███████▎  | 455/625 [01:35<00:35,  4.84it/s]
reward: -1.9899, last reward: -0.0080, gradient norm:  1.12:  73%|███████▎  | 456/625 [01:35<00:34,  4.84it/s]
reward: -2.1262, last reward: -0.1049, gradient norm:  10.91:  73%|███████▎  | 456/625 [01:35<00:34,  4.84it/s]
reward: -2.1262, last reward: -0.1049, gradient norm:  10.91:  73%|███████▎  | 457/625 [01:35<00:34,  4.84it/s]
reward: -2.1425, last reward: -0.0472, gradient norm:  2.676:  73%|███████▎  | 457/625 [01:35<00:34,  4.84it/s]
reward: -2.1425, last reward: -0.0472, gradient norm:  2.676:  73%|███████▎  | 458/625 [01:35<00:34,  4.84it/s]
reward: -2.2573, last reward: -0.0005, gradient norm:  0.3421:  73%|███████▎  | 458/625 [01:35<00:34,  4.84it/s]
reward: -2.2573, last reward: -0.0005, gradient norm:  0.3421:  73%|███████▎  | 459/625 [01:35<00:34,  4.83it/s]
reward: -1.5790, last reward: -0.0079, gradient norm:  0.8352:  73%|███████▎  | 459/625 [01:36<00:34,  4.83it/s]
reward: -1.5790, last reward: -0.0079, gradient norm:  0.8352:  74%|███████▎  | 460/625 [01:36<00:34,  4.83it/s]
reward: -1.8268, last reward: -0.0108, gradient norm:  0.8433:  74%|███████▎  | 460/625 [01:36<00:34,  4.83it/s]
reward: -1.8268, last reward: -0.0108, gradient norm:  0.8433:  74%|███████▍  | 461/625 [01:36<00:33,  4.83it/s]
reward: -1.8524, last reward: -0.0019, gradient norm:  0.4605:  74%|███████▍  | 461/625 [01:36<00:33,  4.83it/s]
reward: -1.8524, last reward: -0.0019, gradient norm:  0.4605:  74%|███████▍  | 462/625 [01:36<00:33,  4.83it/s]
reward: -1.9559, last reward: -0.0026, gradient norm:  2.404:  74%|███████▍  | 462/625 [01:36<00:33,  4.83it/s]
reward: -1.9559, last reward: -0.0026, gradient norm:  2.404:  74%|███████▍  | 463/625 [01:36<00:33,  4.82it/s]
reward: -2.3517, last reward: -2.4639, gradient norm:  109.4:  74%|███████▍  | 463/625 [01:36<00:33,  4.82it/s]
reward: -2.3517, last reward: -2.4639, gradient norm:  109.4:  74%|███████▍  | 464/625 [01:36<00:33,  4.82it/s]
reward: -2.8051, last reward: -4.1254, gradient norm:  80.4:  74%|███████▍  | 464/625 [01:37<00:33,  4.82it/s]
reward: -2.8051, last reward: -4.1254, gradient norm:  80.4:  74%|███████▍  | 465/625 [01:37<00:33,  4.82it/s]
reward: -2.2793, last reward: -3.5528, gradient norm:  133.8:  74%|███████▍  | 465/625 [01:37<00:33,  4.82it/s]
reward: -2.2793, last reward: -3.5528, gradient norm:  133.8:  75%|███████▍  | 466/625 [01:37<00:33,  4.81it/s]
reward: -2.4257, last reward: -0.0111, gradient norm:  0.8815:  75%|███████▍  | 466/625 [01:37<00:33,  4.81it/s]
reward: -2.4257, last reward: -0.0111, gradient norm:  0.8815:  75%|███████▍  | 467/625 [01:37<00:32,  4.81it/s]
reward: -2.0900, last reward: -0.0090, gradient norm:  0.5581:  75%|███████▍  | 467/625 [01:37<00:32,  4.81it/s]
reward: -2.0900, last reward: -0.0090, gradient norm:  0.5581:  75%|███████▍  | 468/625 [01:37<00:32,  4.81it/s]
reward: -2.0726, last reward: -0.0278, gradient norm:  0.9816:  75%|███████▍  | 468/625 [01:37<00:32,  4.81it/s]
reward: -2.0726, last reward: -0.0278, gradient norm:  0.9816:  75%|███████▌  | 469/625 [01:37<00:32,  4.82it/s]
reward: -2.2132, last reward: -0.0311, gradient norm:  1.074:  75%|███████▌  | 469/625 [01:38<00:32,  4.82it/s]
reward: -2.2132, last reward: -0.0311, gradient norm:  1.074:  75%|███████▌  | 470/625 [01:38<00:32,  4.83it/s]
reward: -2.2571, last reward: -0.0172, gradient norm:  0.7882:  75%|███████▌  | 470/625 [01:38<00:32,  4.83it/s]
reward: -2.2571, last reward: -0.0172, gradient norm:  0.7882:  75%|███████▌  | 471/625 [01:38<00:31,  4.82it/s]
reward: -2.0257, last reward: -0.0171, gradient norm:  0.715:  75%|███████▌  | 471/625 [01:38<00:31,  4.82it/s]
reward: -2.0257, last reward: -0.0171, gradient norm:  0.715:  76%|███████▌  | 472/625 [01:38<00:31,  4.81it/s]
reward: -2.7457, last reward: -0.0086, gradient norm:  11.82:  76%|███████▌  | 472/625 [01:38<00:31,  4.81it/s]
reward: -2.7457, last reward: -0.0086, gradient norm:  11.82:  76%|███████▌  | 473/625 [01:38<00:31,  4.82it/s]
reward: -2.3554, last reward: -0.2600, gradient norm:  3.902:  76%|███████▌  | 473/625 [01:39<00:31,  4.82it/s]
reward: -2.3554, last reward: -0.2600, gradient norm:  3.902:  76%|███████▌  | 474/625 [01:39<00:31,  4.81it/s]
reward: -1.9478, last reward: -0.0921, gradient norm:  6.198:  76%|███████▌  | 474/625 [01:39<00:31,  4.81it/s]
reward: -1.9478, last reward: -0.0921, gradient norm:  6.198:  76%|███████▌  | 475/625 [01:39<00:31,  4.81it/s]
reward: -1.8998, last reward: -0.0534, gradient norm:  2.329:  76%|███████▌  | 475/625 [01:39<00:31,  4.81it/s]
reward: -1.8998, last reward: -0.0534, gradient norm:  2.329:  76%|███████▌  | 476/625 [01:39<00:30,  4.82it/s]
reward: -2.2714, last reward: -0.0140, gradient norm:  0.7061:  76%|███████▌  | 476/625 [01:39<00:30,  4.82it/s]
reward: -2.2714, last reward: -0.0140, gradient norm:  0.7061:  76%|███████▋  | 477/625 [01:39<00:30,  4.82it/s]
reward: -1.8072, last reward: -0.0004, gradient norm:  0.2785:  76%|███████▋  | 477/625 [01:39<00:30,  4.82it/s]
reward: -1.8072, last reward: -0.0004, gradient norm:  0.2785:  76%|███████▋  | 478/625 [01:39<00:30,  4.82it/s]
reward: -1.9878, last reward: -0.0031, gradient norm:  0.5887:  76%|███████▋  | 478/625 [01:40<00:30,  4.82it/s]
reward: -1.9878, last reward: -0.0031, gradient norm:  0.5887:  77%|███████▋  | 479/625 [01:40<00:30,  4.83it/s]
reward: -1.9777, last reward: -0.0108, gradient norm:  1.364:  77%|███████▋  | 479/625 [01:40<00:30,  4.83it/s]
reward: -1.9777, last reward: -0.0108, gradient norm:  1.364:  77%|███████▋  | 480/625 [01:40<00:30,  4.83it/s]
reward: -2.2559, last reward: -0.0164, gradient norm:  0.69:  77%|███████▋  | 480/625 [01:40<00:30,  4.83it/s]
reward: -2.2559, last reward: -0.0164, gradient norm:  0.69:  77%|███████▋  | 481/625 [01:40<00:29,  4.83it/s]
reward: -1.9692, last reward: -0.0161, gradient norm:  0.7074:  77%|███████▋  | 481/625 [01:40<00:29,  4.83it/s]
reward: -1.9692, last reward: -0.0161, gradient norm:  0.7074:  77%|███████▋  | 482/625 [01:40<00:29,  4.82it/s]
reward: -1.9088, last reward: -0.0093, gradient norm:  0.5972:  77%|███████▋  | 482/625 [01:40<00:29,  4.82it/s]
reward: -1.9088, last reward: -0.0093, gradient norm:  0.5972:  77%|███████▋  | 483/625 [01:40<00:29,  4.83it/s]
reward: -1.6735, last reward: -0.0022, gradient norm:  0.6743:  77%|███████▋  | 483/625 [01:41<00:29,  4.83it/s]
reward: -1.6735, last reward: -0.0022, gradient norm:  0.6743:  77%|███████▋  | 484/625 [01:41<00:29,  4.83it/s]
reward: -1.5895, last reward: -0.0004, gradient norm:  0.1763:  77%|███████▋  | 484/625 [01:41<00:29,  4.83it/s]
reward: -1.5895, last reward: -0.0004, gradient norm:  0.1763:  78%|███████▊  | 485/625 [01:41<00:28,  4.83it/s]
reward: -2.2496, last reward: -0.0066, gradient norm:  0.5032:  78%|███████▊  | 485/625 [01:41<00:28,  4.83it/s]
reward: -2.2496, last reward: -0.0066, gradient norm:  0.5032:  78%|███████▊  | 486/625 [01:41<00:28,  4.83it/s]
reward: -2.1070, last reward: -0.0170, gradient norm:  0.8796:  78%|███████▊  | 486/625 [01:41<00:28,  4.83it/s]
reward: -2.1070, last reward: -0.0170, gradient norm:  0.8796:  78%|███████▊  | 487/625 [01:41<00:28,  4.82it/s]
reward: -2.1649, last reward: -0.0368, gradient norm:  1.901:  78%|███████▊  | 487/625 [01:41<00:28,  4.82it/s]
reward: -2.1649, last reward: -0.0368, gradient norm:  1.901:  78%|███████▊  | 488/625 [01:41<00:28,  4.82it/s]
reward: -2.3717, last reward: -0.0190, gradient norm:  0.6673:  78%|███████▊  | 488/625 [01:42<00:28,  4.82it/s]
reward: -2.3717, last reward: -0.0190, gradient norm:  0.6673:  78%|███████▊  | 489/625 [01:42<00:28,  4.83it/s]
reward: -2.4690, last reward: -0.0244, gradient norm:  2.987:  78%|███████▊  | 489/625 [01:42<00:28,  4.83it/s]
reward: -2.4690, last reward: -0.0244, gradient norm:  2.987:  78%|███████▊  | 490/625 [01:42<00:28,  4.82it/s]
reward: -3.9800, last reward: -2.4005, gradient norm:  84.83:  78%|███████▊  | 490/625 [01:42<00:28,  4.82it/s]
reward: -3.9800, last reward: -2.4005, gradient norm:  84.83:  79%|███████▊  | 491/625 [01:42<00:27,  4.82it/s]
reward: -3.9788, last reward: -3.1078, gradient norm:  61.26:  79%|███████▊  | 491/625 [01:42<00:27,  4.82it/s]
reward: -3.9788, last reward: -3.1078, gradient norm:  61.26:  79%|███████▊  | 492/625 [01:42<00:27,  4.82it/s]
reward: -2.8486, last reward: -0.2049, gradient norm:  2.378:  79%|███████▊  | 492/625 [01:42<00:27,  4.82it/s]
reward: -2.8486, last reward: -0.2049, gradient norm:  2.378:  79%|███████▉  | 493/625 [01:42<00:27,  4.82it/s]
reward: -2.3804, last reward: -0.2427, gradient norm:  8.888:  79%|███████▉  | 493/625 [01:43<00:27,  4.82it/s]
reward: -2.3804, last reward: -0.2427, gradient norm:  8.888:  79%|███████▉  | 494/625 [01:43<00:27,  4.82it/s]
reward: -2.7383, last reward: -0.0216, gradient norm:  0.3409:  79%|███████▉  | 494/625 [01:43<00:27,  4.82it/s]
reward: -2.7383, last reward: -0.0216, gradient norm:  0.3409:  79%|███████▉  | 495/625 [01:43<00:27,  4.81it/s]
reward: -2.2972, last reward: -0.0008, gradient norm:  0.1397:  79%|███████▉  | 495/625 [01:43<00:27,  4.81it/s]
reward: -2.2972, last reward: -0.0008, gradient norm:  0.1397:  79%|███████▉  | 496/625 [01:43<00:26,  4.81it/s]
reward: -1.7317, last reward: -0.4504, gradient norm:  431.0:  79%|███████▉  | 496/625 [01:43<00:26,  4.81it/s]
reward: -1.7317, last reward: -0.4504, gradient norm:  431.0:  80%|███████▉  | 497/625 [01:43<00:26,  4.81it/s]
reward: -1.9472, last reward: -0.0047, gradient norm:  0.4756:  80%|███████▉  | 497/625 [01:43<00:26,  4.81it/s]
reward: -1.9472, last reward: -0.0047, gradient norm:  0.4756:  80%|███████▉  | 498/625 [01:43<00:26,  4.82it/s]
reward: -2.6030, last reward: -0.0010, gradient norm:  0.7292:  80%|███████▉  | 498/625 [01:44<00:26,  4.82it/s]
reward: -2.6030, last reward: -0.0010, gradient norm:  0.7292:  80%|███████▉  | 499/625 [01:44<00:26,  4.82it/s]
reward: -1.8096, last reward: -0.0002, gradient norm:  0.4949:  80%|███████▉  | 499/625 [01:44<00:26,  4.82it/s]
reward: -1.8096, last reward: -0.0002, gradient norm:  0.4949:  80%|████████  | 500/625 [01:44<00:25,  4.82it/s]
reward: -1.6683, last reward: -0.0004, gradient norm:  0.4736:  80%|████████  | 500/625 [01:44<00:25,  4.82it/s]
reward: -1.6683, last reward: -0.0004, gradient norm:  0.4736:  80%|████████  | 501/625 [01:44<00:25,  4.81it/s]
reward: -1.9906, last reward: -0.0021, gradient norm:  0.673:  80%|████████  | 501/625 [01:44<00:25,  4.81it/s]
reward: -1.9906, last reward: -0.0021, gradient norm:  0.673:  80%|████████  | 502/625 [01:44<00:25,  4.81it/s]
reward: -2.2903, last reward: -0.0044, gradient norm:  0.5502:  80%|████████  | 502/625 [01:45<00:25,  4.81it/s]
reward: -2.2903, last reward: -0.0044, gradient norm:  0.5502:  80%|████████  | 503/625 [01:45<00:25,  4.81it/s]
reward: -1.9797, last reward: -0.0132, gradient norm:  7.029:  80%|████████  | 503/625 [01:45<00:25,  4.81it/s]
reward: -1.9797, last reward: -0.0132, gradient norm:  7.029:  81%|████████  | 504/625 [01:45<00:25,  4.82it/s]
reward: -2.2245, last reward: -0.0062, gradient norm:  0.3676:  81%|████████  | 504/625 [01:45<00:25,  4.82it/s]
reward: -2.2245, last reward: -0.0062, gradient norm:  0.3676:  81%|████████  | 505/625 [01:45<00:25,  4.70it/s]
reward: -1.7487, last reward: -0.0040, gradient norm:  0.3802:  81%|████████  | 505/625 [01:45<00:25,  4.70it/s]
reward: -1.7487, last reward: -0.0040, gradient norm:  0.3802:  81%|████████  | 506/625 [01:45<00:25,  4.71it/s]
reward: -1.9054, last reward: -0.0013, gradient norm:  0.4617:  81%|████████  | 506/625 [01:45<00:25,  4.71it/s]
reward: -1.9054, last reward: -0.0013, gradient norm:  0.4617:  81%|████████  | 507/625 [01:45<00:24,  4.75it/s]
reward: -1.9537, last reward: -0.0003, gradient norm:  0.4139:  81%|████████  | 507/625 [01:46<00:24,  4.75it/s]
reward: -1.9537, last reward: -0.0003, gradient norm:  0.4139:  81%|████████▏ | 508/625 [01:46<00:24,  4.78it/s]
reward: -1.9811, last reward: -0.0037, gradient norm:  0.4968:  81%|████████▏ | 508/625 [01:46<00:24,  4.78it/s]
reward: -1.9811, last reward: -0.0037, gradient norm:  0.4968:  81%|████████▏ | 509/625 [01:46<00:24,  4.79it/s]
reward: -2.0120, last reward: -0.0066, gradient norm:  0.4458:  81%|████████▏ | 509/625 [01:46<00:24,  4.79it/s]
reward: -2.0120, last reward: -0.0066, gradient norm:  0.4458:  82%|████████▏ | 510/625 [01:46<00:23,  4.81it/s]
reward: -2.0880, last reward: -0.0170, gradient norm:  0.4251:  82%|████████▏ | 510/625 [01:46<00:23,  4.81it/s]
reward: -2.0880, last reward: -0.0170, gradient norm:  0.4251:  82%|████████▏ | 511/625 [01:46<00:23,  4.81it/s]
reward: -2.7379, last reward: -0.5845, gradient norm:  22.38:  82%|████████▏ | 511/625 [01:46<00:23,  4.81it/s]
reward: -2.7379, last reward: -0.5845, gradient norm:  22.38:  82%|████████▏ | 512/625 [01:46<00:23,  4.82it/s]
reward: -2.5455, last reward: -0.2139, gradient norm:  6.013:  82%|████████▏ | 512/625 [01:47<00:23,  4.82it/s]
reward: -2.5455, last reward: -0.2139, gradient norm:  6.013:  82%|████████▏ | 513/625 [01:47<00:23,  4.84it/s]
reward: -2.4104, last reward: -0.0107, gradient norm:  0.9234:  82%|████████▏ | 513/625 [01:47<00:23,  4.84it/s]
reward: -2.4104, last reward: -0.0107, gradient norm:  0.9234:  82%|████████▏ | 514/625 [01:47<00:22,  4.84it/s]
reward: -1.9657, last reward: -0.0201, gradient norm:  2.032:  82%|████████▏ | 514/625 [01:47<00:22,  4.84it/s]
reward: -1.9657, last reward: -0.0201, gradient norm:  2.032:  82%|████████▏ | 515/625 [01:47<00:22,  4.84it/s]
reward: -2.2164, last reward: -0.0025, gradient norm:  0.2708:  82%|████████▏ | 515/625 [01:47<00:22,  4.84it/s]
reward: -2.2164, last reward: -0.0025, gradient norm:  0.2708:  83%|████████▎ | 516/625 [01:47<00:22,  4.84it/s]
reward: -2.2957, last reward: -0.0005, gradient norm:  0.9441:  83%|████████▎ | 516/625 [01:47<00:22,  4.84it/s]
reward: -2.2957, last reward: -0.0005, gradient norm:  0.9441:  83%|████████▎ | 517/625 [01:47<00:22,  4.84it/s]
reward: -1.9742, last reward: -0.0045, gradient norm:  0.3999:  83%|████████▎ | 517/625 [01:48<00:22,  4.84it/s]
reward: -1.9742, last reward: -0.0045, gradient norm:  0.3999:  83%|████████▎ | 518/625 [01:48<00:22,  4.84it/s]
reward: -2.1574, last reward: -0.0078, gradient norm:  0.8513:  83%|████████▎ | 518/625 [01:48<00:22,  4.84it/s]
reward: -2.1574, last reward: -0.0078, gradient norm:  0.8513:  83%|████████▎ | 519/625 [01:48<00:21,  4.84it/s]
reward: -1.8835, last reward: -0.0095, gradient norm:  0.5518:  83%|████████▎ | 519/625 [01:48<00:21,  4.84it/s]
reward: -1.8835, last reward: -0.0095, gradient norm:  0.5518:  83%|████████▎ | 520/625 [01:48<00:21,  4.84it/s]
reward: -2.4242, last reward: -0.4031, gradient norm:  225.8:  83%|████████▎ | 520/625 [01:48<00:21,  4.84it/s]
reward: -2.4242, last reward: -0.4031, gradient norm:  225.8:  83%|████████▎ | 521/625 [01:48<00:21,  4.84it/s]
reward: -1.9132, last reward: -0.0034, gradient norm:  0.4315:  83%|████████▎ | 521/625 [01:48<00:21,  4.84it/s]
reward: -1.9132, last reward: -0.0034, gradient norm:  0.4315:  84%|████████▎ | 522/625 [01:48<00:21,  4.84it/s]
reward: -2.3352, last reward: -0.0129, gradient norm:  0.2119:  84%|████████▎ | 522/625 [01:49<00:21,  4.84it/s]
reward: -2.3352, last reward: -0.0129, gradient norm:  0.2119:  84%|████████▎ | 523/625 [01:49<00:21,  4.85it/s]
reward: -2.0629, last reward: -0.2873, gradient norm:  7.375:  84%|████████▎ | 523/625 [01:49<00:21,  4.85it/s]
reward: -2.0629, last reward: -0.2873, gradient norm:  7.375:  84%|████████▍ | 524/625 [01:49<00:20,  4.84it/s]
reward: -2.2347, last reward: -0.0025, gradient norm:  0.4424:  84%|████████▍ | 524/625 [01:49<00:20,  4.84it/s]
reward: -2.2347, last reward: -0.0025, gradient norm:  0.4424:  84%|████████▍ | 525/625 [01:49<00:20,  4.83it/s]
reward: -2.2983, last reward: -0.0170, gradient norm:  0.5518:  84%|████████▍ | 525/625 [01:49<00:20,  4.83it/s]
reward: -2.2983, last reward: -0.0170, gradient norm:  0.5518:  84%|████████▍ | 526/625 [01:49<00:20,  4.83it/s]
reward: -1.6817, last reward: -0.0020, gradient norm:  0.4182:  84%|████████▍ | 526/625 [01:49<00:20,  4.83it/s]
reward: -1.6817, last reward: -0.0020, gradient norm:  0.4182:  84%|████████▍ | 527/625 [01:49<00:20,  4.84it/s]
reward: -2.2043, last reward: -0.0008, gradient norm:  0.2703:  84%|████████▍ | 527/625 [01:50<00:20,  4.84it/s]
reward: -2.2043, last reward: -0.0008, gradient norm:  0.2703:  84%|████████▍ | 528/625 [01:50<00:20,  4.84it/s]
reward: -1.8662, last reward: -0.0026, gradient norm:  1.062:  84%|████████▍ | 528/625 [01:50<00:20,  4.84it/s]
reward: -1.8662, last reward: -0.0026, gradient norm:  1.062:  85%|████████▍ | 529/625 [01:50<00:19,  4.84it/s]
reward: -2.1564, last reward: -0.0035, gradient norm:  0.4355:  85%|████████▍ | 529/625 [01:50<00:19,  4.84it/s]
reward: -2.1564, last reward: -0.0035, gradient norm:  0.4355:  85%|████████▍ | 530/625 [01:50<00:19,  4.84it/s]
reward: -2.5856, last reward: -0.0278, gradient norm:  0.4754:  85%|████████▍ | 530/625 [01:50<00:19,  4.84it/s]
reward: -2.5856, last reward: -0.0278, gradient norm:  0.4754:  85%|████████▍ | 531/625 [01:50<00:19,  4.84it/s]
reward: -2.3204, last reward: -0.0163, gradient norm:  0.5904:  85%|████████▍ | 531/625 [01:51<00:19,  4.84it/s]
reward: -2.3204, last reward: -0.0163, gradient norm:  0.5904:  85%|████████▌ | 532/625 [01:51<00:19,  4.84it/s]
reward: -2.6885, last reward: -0.2438, gradient norm:  2.277:  85%|████████▌ | 532/625 [01:51<00:19,  4.84it/s]
reward: -2.6885, last reward: -0.2438, gradient norm:  2.277:  85%|████████▌ | 533/625 [01:51<00:18,  4.84it/s]
reward: -2.2555, last reward: -0.0452, gradient norm:  0.9628:  85%|████████▌ | 533/625 [01:51<00:18,  4.84it/s]
reward: -2.2555, last reward: -0.0452, gradient norm:  0.9628:  85%|████████▌ | 534/625 [01:51<00:18,  4.84it/s]
reward: -3.0695, last reward: -0.7870, gradient norm:  30.08:  85%|████████▌ | 534/625 [01:51<00:18,  4.84it/s]
reward: -3.0695, last reward: -0.7870, gradient norm:  30.08:  86%|████████▌ | 535/625 [01:51<00:22,  4.08it/s]
reward: -2.9792, last reward: -0.7378, gradient norm:  15.69:  86%|████████▌ | 535/625 [01:51<00:22,  4.08it/s]
reward: -2.9792, last reward: -0.7378, gradient norm:  15.69:  86%|████████▌ | 536/625 [01:51<00:20,  4.29it/s]
reward: -3.3185, last reward: -0.8053, gradient norm:  10.1:  86%|████████▌ | 536/625 [01:52<00:20,  4.29it/s]
reward: -3.3185, last reward: -0.8053, gradient norm:  10.1:  86%|████████▌ | 537/625 [01:52<00:19,  4.45it/s]
reward: -3.3615, last reward: -0.7426, gradient norm:  32.47:  86%|████████▌ | 537/625 [01:52<00:19,  4.45it/s]
reward: -3.3615, last reward: -0.7426, gradient norm:  32.47:  86%|████████▌ | 538/625 [01:52<00:19,  4.56it/s]
reward: -2.8675, last reward: -0.8165, gradient norm:  107.7:  86%|████████▌ | 538/625 [01:52<00:19,  4.56it/s]
reward: -2.8675, last reward: -0.8165, gradient norm:  107.7:  86%|████████▌ | 539/625 [01:52<00:18,  4.64it/s]
reward: -2.1532, last reward: -0.0066, gradient norm:  0.5248:  86%|████████▌ | 539/625 [01:52<00:18,  4.64it/s]
reward: -2.1532, last reward: -0.0066, gradient norm:  0.5248:  86%|████████▋ | 540/625 [01:52<00:18,  4.70it/s]
reward: -1.9298, last reward: -0.0014, gradient norm:  0.328:  86%|████████▋ | 540/625 [01:53<00:18,  4.70it/s]
reward: -1.9298, last reward: -0.0014, gradient norm:  0.328:  87%|████████▋ | 541/625 [01:53<00:17,  4.75it/s]
reward: -2.4598, last reward: -0.0155, gradient norm:  0.431:  87%|████████▋ | 541/625 [01:53<00:17,  4.75it/s]
reward: -2.4598, last reward: -0.0155, gradient norm:  0.431:  87%|████████▋ | 542/625 [01:53<00:17,  4.78it/s]
reward: -2.2100, last reward: -0.0003, gradient norm:  0.4409:  87%|████████▋ | 542/625 [01:53<00:17,  4.78it/s]
reward: -2.2100, last reward: -0.0003, gradient norm:  0.4409:  87%|████████▋ | 543/625 [01:53<00:17,  4.79it/s]
reward: -2.0063, last reward: -0.0017, gradient norm:  0.3312:  87%|████████▋ | 543/625 [01:53<00:17,  4.79it/s]
reward: -2.0063, last reward: -0.0017, gradient norm:  0.3312:  87%|████████▋ | 544/625 [01:53<00:16,  4.81it/s]
reward: -2.1692, last reward: -0.0344, gradient norm:  0.6026:  87%|████████▋ | 544/625 [01:53<00:16,  4.81it/s]
reward: -2.1692, last reward: -0.0344, gradient norm:  0.6026:  87%|████████▋ | 545/625 [01:53<00:16,  4.82it/s]
reward: -2.4494, last reward: -0.0029, gradient norm:  0.2738:  87%|████████▋ | 545/625 [01:54<00:16,  4.82it/s]
reward: -2.4494, last reward: -0.0029, gradient norm:  0.2738:  87%|████████▋ | 546/625 [01:54<00:16,  4.83it/s]
reward: -1.9326, last reward: -0.0023, gradient norm:  0.3547:  87%|████████▋ | 546/625 [01:54<00:16,  4.83it/s]
reward: -1.9326, last reward: -0.0023, gradient norm:  0.3547:  88%|████████▊ | 547/625 [01:54<00:16,  4.84it/s]
reward: -2.0056, last reward: -0.0011, gradient norm:  0.4607:  88%|████████▊ | 547/625 [01:54<00:16,  4.84it/s]
reward: -2.0056, last reward: -0.0011, gradient norm:  0.4607:  88%|████████▊ | 548/625 [01:54<00:15,  4.84it/s]
reward: -2.2037, last reward: -0.0005, gradient norm:  0.4285:  88%|████████▊ | 548/625 [01:54<00:15,  4.84it/s]
reward: -2.2037, last reward: -0.0005, gradient norm:  0.4285:  88%|████████▊ | 549/625 [01:54<00:15,  4.84it/s]
reward: -2.2003, last reward: -0.0001, gradient norm:  0.7362:  88%|████████▊ | 549/625 [01:54<00:15,  4.84it/s]
reward: -2.2003, last reward: -0.0001, gradient norm:  0.7362:  88%|████████▊ | 550/625 [01:54<00:15,  4.84it/s]
reward: -1.2650, last reward: -0.0000, gradient norm:  0.2252:  88%|████████▊ | 550/625 [01:55<00:15,  4.84it/s]
reward: -1.2650, last reward: -0.0000, gradient norm:  0.2252:  88%|████████▊ | 551/625 [01:55<00:15,  4.84it/s]
reward: -1.5291, last reward: -0.0001, gradient norm:  0.351:  88%|████████▊ | 551/625 [01:55<00:15,  4.84it/s]
reward: -1.5291, last reward: -0.0001, gradient norm:  0.351:  88%|████████▊ | 552/625 [01:55<00:15,  4.84it/s]
reward: -2.1972, last reward: -0.0454, gradient norm:  6.832:  88%|████████▊ | 552/625 [01:55<00:15,  4.84it/s]
reward: -2.1972, last reward: -0.0454, gradient norm:  6.832:  88%|████████▊ | 553/625 [01:55<00:14,  4.84it/s]
reward: -1.9404, last reward: -0.0000, gradient norm:  0.4075:  88%|████████▊ | 553/625 [01:55<00:14,  4.84it/s]
reward: -1.9404, last reward: -0.0000, gradient norm:  0.4075:  89%|████████▊ | 554/625 [01:55<00:14,  4.84it/s]
reward: -2.3901, last reward: -0.0043, gradient norm:  0.2454:  89%|████████▊ | 554/625 [01:55<00:14,  4.84it/s]
reward: -2.3901, last reward: -0.0043, gradient norm:  0.2454:  89%|████████▉ | 555/625 [01:55<00:14,  4.83it/s]
reward: -2.1442, last reward: -0.0016, gradient norm:  0.398:  89%|████████▉ | 555/625 [01:56<00:14,  4.83it/s]
reward: -2.1442, last reward: -0.0016, gradient norm:  0.398:  89%|████████▉ | 556/625 [01:56<00:14,  4.82it/s]
reward: -2.5808, last reward: -0.0063, gradient norm:  3.177:  89%|████████▉ | 556/625 [01:56<00:14,  4.82it/s]
reward: -2.5808, last reward: -0.0063, gradient norm:  3.177:  89%|████████▉ | 557/625 [01:56<00:14,  4.82it/s]
reward: -2.3110, last reward: -0.1865, gradient norm:  4.909:  89%|████████▉ | 557/625 [01:56<00:14,  4.82it/s]
reward: -2.3110, last reward: -0.1865, gradient norm:  4.909:  89%|████████▉ | 558/625 [01:56<00:13,  4.83it/s]
reward: -2.2579, last reward: -0.0129, gradient norm:  0.3089:  89%|████████▉ | 558/625 [01:56<00:13,  4.83it/s]
reward: -2.2579, last reward: -0.0129, gradient norm:  0.3089:  89%|████████▉ | 559/625 [01:56<00:13,  4.83it/s]
reward: -2.2661, last reward: -0.0258, gradient norm:  1.13:  89%|████████▉ | 559/625 [01:56<00:13,  4.83it/s]
reward: -2.2661, last reward: -0.0258, gradient norm:  1.13:  90%|████████▉ | 560/625 [01:56<00:13,  4.83it/s]
reward: -2.2963, last reward: -0.4148, gradient norm:  11.31:  90%|████████▉ | 560/625 [01:57<00:13,  4.83it/s]
reward: -2.2963, last reward: -0.4148, gradient norm:  11.31:  90%|████████▉ | 561/625 [01:57<00:13,  4.83it/s]
reward: -2.0830, last reward: -0.0138, gradient norm:  0.3773:  90%|████████▉ | 561/625 [01:57<00:13,  4.83it/s]
reward: -2.0830, last reward: -0.0138, gradient norm:  0.3773:  90%|████████▉ | 562/625 [01:57<00:13,  4.82it/s]
reward: -2.0689, last reward: -0.0016, gradient norm:  1.096:  90%|████████▉ | 562/625 [01:57<00:13,  4.82it/s]
reward: -2.0689, last reward: -0.0016, gradient norm:  1.096:  90%|█████████ | 563/625 [01:57<00:12,  4.83it/s]
reward: -2.2374, last reward: -0.0940, gradient norm:  3.178:  90%|█████████ | 563/625 [01:57<00:12,  4.83it/s]
reward: -2.2374, last reward: -0.0940, gradient norm:  3.178:  90%|█████████ | 564/625 [01:57<00:12,  4.82it/s]
reward: -2.4075, last reward: -0.0054, gradient norm:  0.4273:  90%|█████████ | 564/625 [01:57<00:12,  4.82it/s]
reward: -2.4075, last reward: -0.0054, gradient norm:  0.4273:  90%|█████████ | 565/625 [01:57<00:12,  4.82it/s]
reward: -2.5810, last reward: -0.4576, gradient norm:  30.6:  90%|█████████ | 565/625 [01:58<00:12,  4.82it/s]
reward: -2.5810, last reward: -0.4576, gradient norm:  30.6:  91%|█████████ | 566/625 [01:58<00:12,  4.82it/s]
reward: -2.0336, last reward: -0.0071, gradient norm:  0.3727:  91%|█████████ | 566/625 [01:58<00:12,  4.82it/s]
reward: -2.0336, last reward: -0.0071, gradient norm:  0.3727:  91%|█████████ | 567/625 [01:58<00:12,  4.82it/s]
reward: -2.4358, last reward: -0.0337, gradient norm:  2.027:  91%|█████████ | 567/625 [01:58<00:12,  4.82it/s]
reward: -2.4358, last reward: -0.0337, gradient norm:  2.027:  91%|█████████ | 568/625 [01:58<00:11,  4.82it/s]
reward: -2.3988, last reward: -0.0015, gradient norm:  0.4643:  91%|█████████ | 568/625 [01:58<00:11,  4.82it/s]
reward: -2.3988, last reward: -0.0015, gradient norm:  0.4643:  91%|█████████ | 569/625 [01:58<00:11,  4.82it/s]
reward: -2.2093, last reward: -0.0042, gradient norm:  0.2236:  91%|█████████ | 569/625 [01:59<00:11,  4.82it/s]
reward: -2.2093, last reward: -0.0042, gradient norm:  0.2236:  91%|█████████ | 570/625 [01:59<00:11,  4.83it/s]
reward: -1.7894, last reward: -0.0001, gradient norm:  0.5424:  91%|█████████ | 570/625 [01:59<00:11,  4.83it/s]
reward: -1.7894, last reward: -0.0001, gradient norm:  0.5424:  91%|█████████▏| 571/625 [01:59<00:11,  4.80it/s]
reward: -2.0149, last reward: -0.0005, gradient norm:  0.5926:  91%|█████████▏| 571/625 [01:59<00:11,  4.80it/s]
reward: -2.0149, last reward: -0.0005, gradient norm:  0.5926:  92%|█████████▏| 572/625 [01:59<00:11,  4.81it/s]
reward: -2.3232, last reward: -0.0703, gradient norm:  1.67:  92%|█████████▏| 572/625 [01:59<00:11,  4.81it/s]
reward: -2.3232, last reward: -0.0703, gradient norm:  1.67:  92%|█████████▏| 573/625 [01:59<00:10,  4.80it/s]
reward: -1.5762, last reward: -0.0003, gradient norm:  0.3608:  92%|█████████▏| 573/625 [01:59<00:10,  4.80it/s]
reward: -1.5762, last reward: -0.0003, gradient norm:  0.3608:  92%|█████████▏| 574/625 [01:59<00:10,  4.80it/s]
reward: -2.3711, last reward: -0.0000, gradient norm:  0.3172:  92%|█████████▏| 574/625 [02:00<00:10,  4.80it/s]
reward: -2.3711, last reward: -0.0000, gradient norm:  0.3172:  92%|█████████▏| 575/625 [02:00<00:10,  4.81it/s]
reward: -2.3527, last reward: -0.0001, gradient norm:  3.841:  92%|█████████▏| 575/625 [02:00<00:10,  4.81it/s]
reward: -2.3527, last reward: -0.0001, gradient norm:  3.841:  92%|█████████▏| 576/625 [02:00<00:10,  4.80it/s]
reward: -1.9138, last reward: -0.0004, gradient norm:  0.363:  92%|█████████▏| 576/625 [02:00<00:10,  4.80it/s]
reward: -1.9138, last reward: -0.0004, gradient norm:  0.363:  92%|█████████▏| 577/625 [02:00<00:10,  4.78it/s]
reward: -2.3048, last reward: -0.0007, gradient norm:  0.399:  92%|█████████▏| 577/625 [02:00<00:10,  4.78it/s]
reward: -2.3048, last reward: -0.0007, gradient norm:  0.399:  92%|█████████▏| 578/625 [02:00<00:09,  4.78it/s]
reward: -1.9566, last reward: -0.0011, gradient norm:  0.5855:  92%|█████████▏| 578/625 [02:00<00:09,  4.78it/s]
reward: -1.9566, last reward: -0.0011, gradient norm:  0.5855:  93%|█████████▎| 579/625 [02:00<00:09,  4.80it/s]
reward: -2.4461, last reward: -0.0148, gradient norm:  1.622:  93%|█████████▎| 579/625 [02:01<00:09,  4.80it/s]
reward: -2.4461, last reward: -0.0148, gradient norm:  1.622:  93%|█████████▎| 580/625 [02:01<00:09,  4.81it/s]
reward: -2.6084, last reward: -0.0063, gradient norm:  6.955:  93%|█████████▎| 580/625 [02:01<00:09,  4.81it/s]
reward: -2.6084, last reward: -0.0063, gradient norm:  6.955:  93%|█████████▎| 581/625 [02:01<00:09,  4.81it/s]
reward: -3.1225, last reward: -0.7400, gradient norm:  92.97:  93%|█████████▎| 581/625 [02:01<00:09,  4.81it/s]
reward: -3.1225, last reward: -0.7400, gradient norm:  92.97:  93%|█████████▎| 582/625 [02:01<00:08,  4.82it/s]
reward: -3.3131, last reward: -1.9206, gradient norm:  591.2:  93%|█████████▎| 582/625 [02:01<00:08,  4.82it/s]
reward: -3.3131, last reward: -1.9206, gradient norm:  591.2:  93%|█████████▎| 583/625 [02:01<00:08,  4.82it/s]
reward: -2.5562, last reward: -0.2136, gradient norm:  4.752:  93%|█████████▎| 583/625 [02:01<00:08,  4.82it/s]
reward: -2.5562, last reward: -0.2136, gradient norm:  4.752:  93%|█████████▎| 584/625 [02:01<00:08,  4.83it/s]
reward: -1.9200, last reward: -0.0085, gradient norm:  0.5597:  93%|█████████▎| 584/625 [02:02<00:08,  4.83it/s]
reward: -1.9200, last reward: -0.0085, gradient norm:  0.5597:  94%|█████████▎| 585/625 [02:02<00:08,  4.83it/s]
reward: -2.2839, last reward: -0.0135, gradient norm:  0.5916:  94%|█████████▎| 585/625 [02:02<00:08,  4.83it/s]
reward: -2.2839, last reward: -0.0135, gradient norm:  0.5916:  94%|█████████▍| 586/625 [02:02<00:08,  4.83it/s]
reward: -2.1346, last reward: -0.0095, gradient norm:  2.234:  94%|█████████▍| 586/625 [02:02<00:08,  4.83it/s]
reward: -2.1346, last reward: -0.0095, gradient norm:  2.234:  94%|█████████▍| 587/625 [02:02<00:07,  4.83it/s]
reward: -2.2311, last reward: -0.0026, gradient norm:  0.3546:  94%|█████████▍| 587/625 [02:02<00:07,  4.83it/s]
reward: -2.2311, last reward: -0.0026, gradient norm:  0.3546:  94%|█████████▍| 588/625 [02:02<00:07,  4.84it/s]
reward: -1.8353, last reward: -0.0001, gradient norm:  0.4645:  94%|█████████▍| 588/625 [02:02<00:07,  4.84it/s]
reward: -1.8353, last reward: -0.0001, gradient norm:  0.4645:  94%|█████████▍| 589/625 [02:02<00:07,  4.84it/s]
reward: -1.9739, last reward: -0.0033, gradient norm:  2.222:  94%|█████████▍| 589/625 [02:03<00:07,  4.84it/s]
reward: -1.9739, last reward: -0.0033, gradient norm:  2.222:  94%|█████████▍| 590/625 [02:03<00:07,  4.84it/s]
reward: -2.2696, last reward: -0.1279, gradient norm:  3.818:  94%|█████████▍| 590/625 [02:03<00:07,  4.84it/s]
reward: -2.2696, last reward: -0.1279, gradient norm:  3.818:  95%|█████████▍| 591/625 [02:03<00:07,  4.82it/s]
reward: -2.2685, last reward: -0.0089, gradient norm:  0.844:  95%|█████████▍| 591/625 [02:03<00:07,  4.82it/s]
reward: -2.2685, last reward: -0.0089, gradient norm:  0.844:  95%|█████████▍| 592/625 [02:03<00:06,  4.83it/s]
reward: -2.2583, last reward: -0.0056, gradient norm:  0.2895:  95%|█████████▍| 592/625 [02:03<00:06,  4.83it/s]
reward: -2.2583, last reward: -0.0056, gradient norm:  0.2895:  95%|█████████▍| 593/625 [02:03<00:06,  4.83it/s]
reward: -2.3198, last reward: -0.2449, gradient norm:  18.06:  95%|█████████▍| 593/625 [02:03<00:06,  4.83it/s]
reward: -2.3198, last reward: -0.2449, gradient norm:  18.06:  95%|█████████▌| 594/625 [02:03<00:06,  4.83it/s]
reward: -2.2948, last reward: -0.0019, gradient norm:  0.4655:  95%|█████████▌| 594/625 [02:04<00:06,  4.83it/s]
reward: -2.2948, last reward: -0.0019, gradient norm:  0.4655:  95%|█████████▌| 595/625 [02:04<00:06,  4.84it/s]
reward: -2.1368, last reward: -0.1032, gradient norm:  1.97:  95%|█████████▌| 595/625 [02:04<00:06,  4.84it/s]
reward: -2.1368, last reward: -0.1032, gradient norm:  1.97:  95%|█████████▌| 596/625 [02:04<00:05,  4.83it/s]
reward: -2.0820, last reward: -0.0000, gradient norm:  0.2516:  95%|█████████▌| 596/625 [02:04<00:05,  4.83it/s]
reward: -2.0820, last reward: -0.0000, gradient norm:  0.2516:  96%|█████████▌| 597/625 [02:04<00:05,  4.84it/s]
reward: -2.3768, last reward: -0.0006, gradient norm:  0.723:  96%|█████████▌| 597/625 [02:04<00:05,  4.84it/s]
reward: -2.3768, last reward: -0.0006, gradient norm:  0.723:  96%|█████████▌| 598/625 [02:04<00:05,  4.83it/s]
reward: -2.2649, last reward: -0.0010, gradient norm:  0.8623:  96%|█████████▌| 598/625 [02:05<00:05,  4.83it/s]
reward: -2.2649, last reward: -0.0010, gradient norm:  0.8623:  96%|█████████▌| 599/625 [02:05<00:05,  4.80it/s]
reward: -2.5340, last reward: -0.0005, gradient norm:  0.6933:  96%|█████████▌| 599/625 [02:05<00:05,  4.80it/s]
reward: -2.5340, last reward: -0.0005, gradient norm:  0.6933:  96%|█████████▌| 600/625 [02:05<00:05,  4.80it/s]
reward: -2.5290, last reward: -0.0018, gradient norm:  2.335:  96%|█████████▌| 600/625 [02:05<00:05,  4.80it/s]
reward: -2.5290, last reward: -0.0018, gradient norm:  2.335:  96%|█████████▌| 601/625 [02:05<00:04,  4.81it/s]
reward: -2.1673, last reward: -0.0003, gradient norm:  3.073:  96%|█████████▌| 601/625 [02:05<00:04,  4.81it/s]
reward: -2.1673, last reward: -0.0003, gradient norm:  3.073:  96%|█████████▋| 602/625 [02:05<00:04,  4.83it/s]
reward: -2.6205, last reward: -0.0079, gradient norm:  5.206:  96%|█████████▋| 602/625 [02:05<00:04,  4.83it/s]
reward: -2.6205, last reward: -0.0079, gradient norm:  5.206:  96%|█████████▋| 603/625 [02:05<00:04,  4.83it/s]
reward: -5.1828, last reward: -4.6680, gradient norm:  54.94:  96%|█████████▋| 603/625 [02:06<00:04,  4.83it/s]
reward: -5.1828, last reward: -4.6680, gradient norm:  54.94:  97%|█████████▋| 604/625 [02:06<00:04,  4.83it/s]
reward: -5.8211, last reward: -5.8027, gradient norm:  13.15:  97%|█████████▋| 604/625 [02:06<00:04,  4.83it/s]
reward: -5.8211, last reward: -5.8027, gradient norm:  13.15:  97%|█████████▋| 605/625 [02:06<00:04,  4.82it/s]
reward: -6.0052, last reward: -5.2599, gradient norm:  7.317:  97%|█████████▋| 605/625 [02:06<00:04,  4.82it/s]
reward: -6.0052, last reward: -5.2599, gradient norm:  7.317:  97%|█████████▋| 606/625 [02:06<00:03,  4.83it/s]
reward: -5.9510, last reward: -5.8142, gradient norm:  6.936:  97%|█████████▋| 606/625 [02:06<00:03,  4.83it/s]
reward: -5.9510, last reward: -5.8142, gradient norm:  6.936:  97%|█████████▋| 607/625 [02:06<00:03,  4.83it/s]
reward: -5.4776, last reward: -5.6192, gradient norm:  13.72:  97%|█████████▋| 607/625 [02:06<00:03,  4.83it/s]
reward: -5.4776, last reward: -5.6192, gradient norm:  13.72:  97%|█████████▋| 608/625 [02:06<00:03,  4.84it/s]
reward: -5.0379, last reward: -3.9016, gradient norm:  25.06:  97%|█████████▋| 608/625 [02:07<00:03,  4.84it/s]
reward: -5.0379, last reward: -3.9016, gradient norm:  25.06:  97%|█████████▋| 609/625 [02:07<00:03,  4.84it/s]
reward: -2.5771, last reward: -0.1840, gradient norm:  1.342:  97%|█████████▋| 609/625 [02:07<00:03,  4.84it/s]
reward: -2.5771, last reward: -0.1840, gradient norm:  1.342:  98%|█████████▊| 610/625 [02:07<00:03,  4.83it/s]
reward: -2.4566, last reward: -0.3031, gradient norm:  46.21:  98%|█████████▊| 610/625 [02:07<00:03,  4.83it/s]
reward: -2.4566, last reward: -0.3031, gradient norm:  46.21:  98%|█████████▊| 611/625 [02:07<00:02,  4.83it/s]
reward: -2.3758, last reward: -0.0001, gradient norm:  0.6069:  98%|█████████▊| 611/625 [02:07<00:02,  4.83it/s]
reward: -2.3758, last reward: -0.0001, gradient norm:  0.6069:  98%|█████████▊| 612/625 [02:07<00:02,  4.82it/s]
reward: -2.2030, last reward: -0.0016, gradient norm:  0.5892:  98%|█████████▊| 612/625 [02:07<00:02,  4.82it/s]
reward: -2.2030, last reward: -0.0016, gradient norm:  0.5892:  98%|█████████▊| 613/625 [02:07<00:02,  4.81it/s]
reward: -1.9065, last reward: -0.0472, gradient norm:  1.085:  98%|█████████▊| 613/625 [02:08<00:02,  4.81it/s]
reward: -1.9065, last reward: -0.0472, gradient norm:  1.085:  98%|█████████▊| 614/625 [02:08<00:02,  4.81it/s]
reward: -2.7741, last reward: -0.4854, gradient norm:  23.05:  98%|█████████▊| 614/625 [02:08<00:02,  4.81it/s]
reward: -2.7741, last reward: -0.4854, gradient norm:  23.05:  98%|█████████▊| 615/625 [02:08<00:02,  4.81it/s]
reward: -2.3814, last reward: -2.3419, gradient norm:  107.5:  98%|█████████▊| 615/625 [02:08<00:02,  4.81it/s]
reward: -2.3814, last reward: -2.3419, gradient norm:  107.5:  99%|█████████▊| 616/625 [02:08<00:01,  4.82it/s]
reward: -2.7114, last reward: -1.2236, gradient norm:  16.27:  99%|█████████▊| 616/625 [02:08<00:01,  4.82it/s]
reward: -2.7114, last reward: -1.2236, gradient norm:  16.27:  99%|█████████▊| 617/625 [02:08<00:01,  4.81it/s]
reward: -2.3560, last reward: -0.0010, gradient norm:  2.488:  99%|█████████▊| 617/625 [02:08<00:01,  4.81it/s]
reward: -2.3560, last reward: -0.0010, gradient norm:  2.488:  99%|█████████▉| 618/625 [02:08<00:01,  4.82it/s]
reward: -1.7539, last reward: -0.0022, gradient norm:  0.4706:  99%|█████████▉| 618/625 [02:09<00:01,  4.82it/s]
reward: -1.7539, last reward: -0.0022, gradient norm:  0.4706:  99%|█████████▉| 619/625 [02:09<00:01,  4.82it/s]
reward: -1.9285, last reward: -0.0051, gradient norm:  0.3408:  99%|█████████▉| 619/625 [02:09<00:01,  4.82it/s]
reward: -1.9285, last reward: -0.0051, gradient norm:  0.3408:  99%|█████████▉| 620/625 [02:09<00:01,  4.82it/s]
reward: -2.3782, last reward: -0.0073, gradient norm:  0.4432:  99%|█████████▉| 620/625 [02:09<00:01,  4.82it/s]
reward: -2.3782, last reward: -0.0073, gradient norm:  0.4432:  99%|█████████▉| 621/625 [02:09<00:00,  4.83it/s]
reward: -2.0915, last reward: -0.0086, gradient norm:  0.3351:  99%|█████████▉| 621/625 [02:09<00:00,  4.83it/s]
reward: -2.0915, last reward: -0.0086, gradient norm:  0.3351: 100%|█████████▉| 622/625 [02:09<00:00,  4.82it/s]
reward: -2.5187, last reward: -0.1573, gradient norm:  7.866: 100%|█████████▉| 622/625 [02:10<00:00,  4.82it/s]
reward: -2.5187, last reward: -0.1573, gradient norm:  7.866: 100%|█████████▉| 623/625 [02:10<00:00,  4.83it/s]
reward: -2.4126, last reward: -0.0157, gradient norm:  0.8849: 100%|█████████▉| 623/625 [02:10<00:00,  4.83it/s]
reward: -2.4126, last reward: -0.0157, gradient norm:  0.8849: 100%|█████████▉| 624/625 [02:10<00:00,  4.83it/s]
reward: -2.0543, last reward: -0.0045, gradient norm:  0.2265: 100%|█████████▉| 624/625 [02:10<00:00,  4.83it/s]
reward: -2.0543, last reward: -0.0045, gradient norm:  0.2265: 100%|██████████| 625/625 [02:10<00:00,  4.82it/s]
reward: -2.0543, last reward: -0.0045, gradient norm:  0.2265: 100%|██████████| 625/625 [02:10<00:00,  4.79it/s]

结论#

在本教程中,我们学习了如何从头开始编写无状态环境。我们涉及了以下主题

  • 编写环境时需要注意的四个基本组件(stepreset、播种和构建规范)。我们了解了这些方法和类如何与TensorDict 类交互;

  • 如何使用 check_env_specs() 测试环境是否正确编码;

  • 如何在无状态环境中附加转换以及如何编写自定义转换;

  • 如何在完全可微分的模拟器上训练策略。

脚本总运行时间: (2 分 11.034 秒)