评价此页

钟摆:使用 TorchRL 编写环境和转换#

创建日期:2023年11月09日 | 最后更新:2025年01月27日 | 最后验证:2024年11月05日

作者Vincent Moens

创建环境(模拟器或物理控制系统的接口)是强化学习和控制工程中不可或缺的一部分。

TorchRL 提供了一套在多种场景下实现这一目标的工具。本教程将演示如何从零开始使用 PyTorch 和 TorchRL 编写一个钟摆模拟器。它深受 OpenAI-Gym/Farama-Gymnasium 控制库 中 Pendulum-v1 实现的启发。

Pendulum

简单钟摆#

主要学习内容

  • 如何在 TorchRL 中设计环境:- 编写规格说明(输入、观测和奖励);- 实现行为:设置随机种子、重置和步进。

  • 转换环境的输入和输出,以及编写自定义转换;

  • 如何使用 TensorDict 在整个 代码库 中传递任意数据结构。

    在此过程中,我们将接触 TorchRL 的三个关键组件

为了让您了解 TorchRL 环境的功能,我们将设计一个无状态环境。虽然有状态环境会跟踪遇到的最新物理状态并依赖此状态来模拟状态到状态的转换,但无状态环境要求在每一步中将当前状态与所执行的动作一起提供给它。TorchRL 同时支持这两种类型的环境,但无状态环境更通用,因此涵盖了 TorchRL 环境 API 的更广泛特性。

对无状态环境建模使用户能够完全控制模拟器的输入和输出:可以在任何阶段重置实验,或从外部主动修改动态特性。然而,这假设我们对任务有一定的控制权,但情况并非总是如此:解决无法控制当前状态的问题更具挑战性,但应用场景更广。

无状态环境的另一个优点是它们可以实现转换模拟的批处理执行。如果后端和实现允许,代数运算可以在标量、向量或张量上无缝执行。本教程提供了此类示例。

本教程结构安排如下

  • 我们将首先熟悉环境属性:其形状(batch_size)、其方法(主要是 step()reset()set_seed())以及最终的规格说明。

  • 在编写完模拟器后,我们将演示如何在训练过程中配合转换使用它。

  • 我们将探索遵循 TorchRL API 的新途径,包括:转换输入的可能性、模拟的向量化执行以及通过模拟图进行反向传播的可能性。

  • 最后,我们将训练一个简单的策略来解决我们实现的系统。

from collections import defaultdict
from typing import Optional

import numpy as np
import torch
import tqdm
from tensordict import TensorDict, TensorDictBase
from tensordict.nn import TensorDictModule
from torch import nn

from torchrl.data import BoundedTensorSpec, CompositeSpec, UnboundedContinuousTensorSpec
from torchrl.envs import (
    CatTensors,
    EnvBase,
    Transform,
    TransformedEnv,
    UnsqueezeTransform,
)
from torchrl.envs.transforms.transforms import _apply_to_composite
from torchrl.envs.utils import check_env_specs, step_mdp

DEFAULT_X = np.pi
DEFAULT_Y = 1.0

在设计新的环境类时,有四件事需要特别注意

  • EnvBase._reset(),用于对模拟器在(可能是随机的)初始状态下进行重置;

  • EnvBase._step(),用于编写状态转换动态;

  • EnvBase._set_seed`(),用于实现种子生成机制;

  • 环境规格说明。

让我们先描述一下当前问题:我们想要对一个简单的钟摆进行建模,并能够控制施加在其固定点上的力矩。我们的目标是将钟摆放置在向上位置(按惯例角位置为 0),并使其静止在该位置。为了设计动态系统,我们需要定义两个方程:跟随动作(施加的力矩)后的运动方程,以及构成我们目标函数的奖励方程。

对于运动方程,我们将更新角速度,遵循

\[\dot{\theta}_{t+1} = \dot{\theta}_t + (3 * g / (2 * L) * \sin(\theta_t) + 3 / (m * L^2) * u) * dt\]

其中 \(\dot{\theta}\) 是角速度(弧度/秒),\(g\) 是重力加速度,\(L\) 是钟摆长度,\(m\) 是质量,\(\theta\) 是角位置,\(u\) 是力矩。然后,角位置根据下式更新

\[\theta_{t+1} = \theta_{t} + \dot{\theta}_{t+1} dt\]

我们将奖励定义为

\[r = -(\theta^2 + 0.1 * \dot{\theta}^2 + 0.001 * u^2)\]

当角度接近 0(钟摆处于向上位置)、角速度接近 0(无运动)且力矩也为 0 时,该值将达到最大化。

编写动作的效果:_step()#

首先需要考虑的是 step 方法,因为它将编码我们感兴趣的模拟过程。在 TorchRL 中,EnvBase 类拥有一个 EnvBase.step() 方法,该方法接收一个包含 "action" 条目的 tensordict.TensorDict 实例,以指示要采取的操作。

为了便于从该 tensordict 进行读写,并确保键与库的预期一致,模拟部分被委托给了一个私有抽象方法 _step(),它从 tensordict 读取输入数据,并写入一个包含输出数据的 tensordict

_step() 方法应执行以下操作

  1. 读取输入键(如 "action")并据此执行模拟;

  2. 获取观测值、完成状态和奖励;

  3. 将观测值集合以及奖励和完成状态写入一个新的 TensorDict 中的相应条目。

接下来,step() 方法会将 step() 的输出合并到输入的 tensordict 中,以确保输入/输出的一致性。

通常,对于有状态环境,这看起来像这样

>>> policy(env.reset())
>>> print(tensordict)
TensorDict(
    fields={
        action: Tensor(shape=torch.Size([1]), device=cpu, dtype=torch.float32, is_shared=False),
        done: Tensor(shape=torch.Size([1]), device=cpu, dtype=torch.bool, is_shared=False),
        observation: Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, is_shared=False)},
    batch_size=torch.Size([]),
    device=cpu,
    is_shared=False)
>>> env.step(tensordict)
>>> print(tensordict)
TensorDict(
    fields={
        action: Tensor(shape=torch.Size([1]), device=cpu, dtype=torch.float32, is_shared=False),
        done: Tensor(shape=torch.Size([1]), device=cpu, dtype=torch.bool, is_shared=False),
        next: TensorDict(
            fields={
                done: Tensor(shape=torch.Size([1]), device=cpu, dtype=torch.bool, is_shared=False),
                observation: Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, is_shared=False),
                reward: Tensor(shape=torch.Size([1]), device=cpu, dtype=torch.float32, is_shared=False)},
            batch_size=torch.Size([]),
            device=cpu,
            is_shared=False),
        observation: Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, is_shared=False)},
    batch_size=torch.Size([]),
    device=cpu,
    is_shared=False)

注意,根 tensordict 没有改变,唯一的修改是出现了一个包含新信息的 "next" 条目。

在钟摆示例中,我们的 _step() 方法将从输入 tensordict 读取相关条目,并计算在 "action" 键所编码的力施加在钟摆上后的位置和速度。我们计算新的钟摆角位置 "new_th",结果为先前位置 "th" 加上在时间间隔 dt 内的新速度 "new_thdot"

由于我们的目标是将钟摆向上转动并保持静止,我们的 cost(负奖励)函数对于接近目标的位置和低速度的值较低。确实,我们希望阻碍那些远离“向上”的位置和/或远离 0 的速度。

在我们的示例中,EnvBase._step() 被编码为静态方法,因为我们的环境是无状态的。在有状态设置中,需要 self 参数,因为需要从环境读取状态。

def _step(tensordict):
    th, thdot = tensordict["th"], tensordict["thdot"]  # th := theta

    g_force = tensordict["params", "g"]
    mass = tensordict["params", "m"]
    length = tensordict["params", "l"]
    dt = tensordict["params", "dt"]
    u = tensordict["action"].squeeze(-1)
    u = u.clamp(-tensordict["params", "max_torque"], tensordict["params", "max_torque"])
    costs = angle_normalize(th) ** 2 + 0.1 * thdot**2 + 0.001 * (u**2)

    new_thdot = (
        thdot
        + (3 * g_force / (2 * length) * th.sin() + 3.0 / (mass * length**2) * u) * dt
    )
    new_thdot = new_thdot.clamp(
        -tensordict["params", "max_speed"], tensordict["params", "max_speed"]
    )
    new_th = th + new_thdot * dt
    reward = -costs.view(*tensordict.shape, 1)
    done = torch.zeros_like(reward, dtype=torch.bool)
    out = TensorDict(
        {
            "th": new_th,
            "thdot": new_thdot,
            "params": tensordict["params"],
            "reward": reward,
            "done": done,
        },
        tensordict.shape,
    )
    return out


def angle_normalize(x):
    return ((x + torch.pi) % (2 * torch.pi)) - torch.pi

重置模拟器:_reset()#

我们需要关心的第二个方法是 _reset() 方法。与 _step() 一样,它应该在其输出的 tensordict 中写入观测条目以及可能的完成状态(如果省略完成状态,父方法 reset() 会将其填充为 False)。在某些上下文中,要求 _reset 方法接收调用它的函数的命令(例如,在多智能体设置中,我们可能需要指示哪些智能体需要被重置)。这就是为什么 _reset() 方法也期望一个 tensordict 作为输入,尽管它完全可以是空的或 None

父类 EnvBase.reset() 执行一些简单的检查,就像 EnvBase.step() 所做的那样,例如确保在输出的 tensordict 中返回一个 "done" 状态,并且形状与规格说明中的预期相符。

对我们而言,唯一重要的是考虑 EnvBase._reset() 是否包含所有预期的观测值。同样,由于我们使用的是无状态环境,我们将钟摆的配置传递给一个名为 "params" 的嵌套 tensordict

在此示例中,我们没有传递完成状态,因为对于 _reset() 来说这不是强制性的,而且我们的环境是非终止的,因此我们始终预期它为 False

def _reset(self, tensordict):
    if tensordict is None or tensordict.is_empty():
        # if no ``tensordict`` is passed, we generate a single set of hyperparameters
        # Otherwise, we assume that the input ``tensordict`` contains all the relevant
        # parameters to get started.
        tensordict = self.gen_params(batch_size=self.batch_size)

    high_th = torch.tensor(DEFAULT_X, device=self.device)
    high_thdot = torch.tensor(DEFAULT_Y, device=self.device)
    low_th = -high_th
    low_thdot = -high_thdot

    # for non batch-locked environments, the input ``tensordict`` shape dictates the number
    # of simulators run simultaneously. In other contexts, the initial
    # random state's shape will depend upon the environment batch-size instead.
    th = (
        torch.rand(tensordict.shape, generator=self.rng, device=self.device)
        * (high_th - low_th)
        + low_th
    )
    thdot = (
        torch.rand(tensordict.shape, generator=self.rng, device=self.device)
        * (high_thdot - low_thdot)
        + low_thdot
    )
    out = TensorDict(
        {
            "th": th,
            "thdot": thdot,
            "params": tensordict["params"],
        },
        batch_size=tensordict.shape,
    )
    return out

环境元数据:env.*_spec#

规格说明定义了环境的输入和输出域。重要的是规格说明要准确定义运行时将接收的张量,因为它们通常用于在多进程和分布式设置中携带有关环境的信息。它们也可用于实例化延迟定义的神经网络和测试脚本,而无需实际查询环境(例如,对于现实世界的物理系统,查询成本可能很高)。

在我们的环境中必须编码四个规格说明

  • EnvBase.observation_spec:这是一个 CompositeSpec 实例,其中每个键都是一个观测值(CompositeSpec 可以被视为规格说明的字典)。

  • EnvBase.action_spec:它可以是任何类型的规格说明,但必须要求它对应于输入 tensordict 中的 "action" 条目;

  • EnvBase.reward_spec:提供有关奖励空间的信息;

  • EnvBase.done_spec:提供有关完成标志空间的信息。

TorchRL 的规格说明组织在两个通用容器中:input_spec,它包含 step 函数读取的信息规格说明(分为包含动作的 action_spec 和包含其余所有内容的 state_spec);以及 output_spec,它编码了 step 输出的规格说明(observation_specreward_specdone_spec)。通常,您不应直接与 output_specinput_spec 交互,而只能与它们的内容交互:observation_specreward_specdone_specaction_specstate_spec。原因在于规格说明在 output_specinput_spec 中以非平凡的方式组织,这两者都不应被直接修改。

换句话说,observation_spec 和相关属性是输出和输入规格说明容器内容的便捷快捷方式。

TorchRL 提供了多个 TensorSpec 子类 来编码环境的输入和输出特征。

规格说明形状#

环境规格说明的前导维度必须与环境的 batch-size 匹配。这样做是为了强制确保环境的每个组件(包括其转换)都具有预期的输入和输出形状的准确表示。这是在有状态设置中必须准确编码的内容。

对于非 batch-locked 的环境(例如我们示例中的环境,见下文),这一点不相关,因为环境的 batch size 很可能为空。

def _make_spec(self, td_params):
    # Under the hood, this will populate self.output_spec["observation"]
    self.observation_spec = CompositeSpec(
        th=BoundedTensorSpec(
            low=-torch.pi,
            high=torch.pi,
            shape=(),
            dtype=torch.float32,
        ),
        thdot=BoundedTensorSpec(
            low=-td_params["params", "max_speed"],
            high=td_params["params", "max_speed"],
            shape=(),
            dtype=torch.float32,
        ),
        # we need to add the ``params`` to the observation specs, as we want
        # to pass it at each step during a rollout
        params=make_composite_from_td(td_params["params"]),
        shape=(),
    )
    # since the environment is stateless, we expect the previous output as input.
    # For this, ``EnvBase`` expects some state_spec to be available
    self.state_spec = self.observation_spec.clone()
    # action-spec will be automatically wrapped in input_spec when
    # `self.action_spec = spec` will be called supported
    self.action_spec = BoundedTensorSpec(
        low=-td_params["params", "max_torque"],
        high=td_params["params", "max_torque"],
        shape=(1,),
        dtype=torch.float32,
    )
    self.reward_spec = UnboundedContinuousTensorSpec(shape=(*td_params.shape, 1))


def make_composite_from_td(td):
    # custom function to convert a ``tensordict`` in a similar spec structure
    # of unbounded values.
    composite = CompositeSpec(
        {
            key: make_composite_from_td(tensor)
            if isinstance(tensor, TensorDictBase)
            else UnboundedContinuousTensorSpec(
                dtype=tensor.dtype, device=tensor.device, shape=tensor.shape
            )
            for key, tensor in td.items()
        },
        shape=td.shape,
    )
    return composite

可重复实验:种子设置#

为环境设置种子是初始化实验时的常见操作。EnvBase._set_seed() 的唯一目标是设置所包含模拟器的种子。如果可能,此操作不应调用 reset() 或与环境执行进行交互。父方法 EnvBase.set_seed() 合并了一种机制,允许使用不同的伪随机和可重复种子来为多个环境设置种子。

def _set_seed(self, seed: Optional[int]):
    rng = torch.manual_seed(seed)
    self.rng = rng

整合各部分:EnvBase#

我们终于可以把这些碎片整合在一起,设计我们的环境类。规格说明初始化需要在环境构建过程中执行,因此我们必须注意在 PendulumEnv.__init__() 中调用 _make_spec() 方法。

我们添加一个静态方法 PendulumEnv.gen_params(),它确定性地生成在执行期间使用的一组超参数。

def gen_params(g=10.0, batch_size=None) -> TensorDictBase:
    """Returns a ``tensordict`` containing the physical parameters such as gravitational force and torque or speed limits."""
    if batch_size is None:
        batch_size = []
    td = TensorDict(
        {
            "params": TensorDict(
                {
                    "max_speed": 8,
                    "max_torque": 2.0,
                    "dt": 0.05,
                    "g": g,
                    "m": 1.0,
                    "l": 1.0,
                },
                [],
            )
        },
        [],
    )
    if batch_size:
        td = td.expand(batch_size).contiguous()
    return td

我们将环境定义为非 batch_locked,将同名属性转为 False。这意味着我们不会强制输入 tensordict 具有与环境匹配的 batch-size

以下代码将整合我们上面编写的部分。

class PendulumEnv(EnvBase):
    metadata = {
        "render_modes": ["human", "rgb_array"],
        "render_fps": 30,
    }
    batch_locked = False

    def __init__(self, td_params=None, seed=None, device="cpu"):
        if td_params is None:
            td_params = self.gen_params()

        super().__init__(device=device, batch_size=[])
        self._make_spec(td_params)
        if seed is None:
            seed = torch.empty((), dtype=torch.int64).random_().item()
        self.set_seed(seed)

    # Helpers: _make_step and gen_params
    gen_params = staticmethod(gen_params)
    _make_spec = _make_spec

    # Mandatory methods: _step, _reset and _set_seed
    _reset = _reset
    _step = staticmethod(_step)
    _set_seed = _set_seed

测试我们的环境#

TorchRL 提供了一个简单的函数 check_env_specs() 来检查(转换后的)环境是否具有与规格说明所定义的输入/输出结构相匹配。让我们试一试。

/usr/local/lib/python3.10/dist-packages/torchrl/data/tensor_specs.py:7085: DeprecationWarning: The BoundedTensorSpec has been deprecated and will be removed in v0.8. Please use Bounded instead.
  warnings.warn(
/usr/local/lib/python3.10/dist-packages/torchrl/data/tensor_specs.py:7085: DeprecationWarning: The UnboundedContinuousTensorSpec has been deprecated and will be removed in v0.8. Please use Unbounded instead.
  warnings.warn(
/usr/local/lib/python3.10/dist-packages/torchrl/data/tensor_specs.py:7085: DeprecationWarning: The CompositeSpec has been deprecated and will be removed in v0.8. Please use Composite instead.
  warnings.warn(
2026-06-03 01:00:51,462 [torchrl][INFO]    check_env_specs succeeded! [END]

我们可以看看我们的规格说明,以获得环境签名的视觉表示。

print("observation_spec:", env.observation_spec)
print("state_spec:", env.state_spec)
print("reward_spec:", env.reward_spec)
observation_spec: CompositeSpec(
    th: BoundedContinuous(
        shape=torch.Size([]),
        space=ContinuousBox(
            low=Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, contiguous=True),
            high=Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, contiguous=True)),
        device=cpu,
        dtype=torch.float32,
        domain=continuous),
    thdot: BoundedContinuous(
        shape=torch.Size([]),
        space=ContinuousBox(
            low=Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, contiguous=True),
            high=Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, contiguous=True)),
        device=cpu,
        dtype=torch.float32,
        domain=continuous),
    params: CompositeSpec(
        max_speed: UnboundedDiscrete(
            shape=torch.Size([]),
            space=ContinuousBox(
                low=Tensor(shape=torch.Size([]), device=cpu, dtype=torch.int64, contiguous=True),
                high=Tensor(shape=torch.Size([]), device=cpu, dtype=torch.int64, contiguous=True)),
            device=cpu,
            dtype=torch.int64,
            domain=discrete),
        max_torque: UnboundedContinuous(
            shape=torch.Size([]),
            space=ContinuousBox(
                low=Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, contiguous=True),
                high=Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, contiguous=True)),
            device=cpu,
            dtype=torch.float32,
            domain=continuous),
        dt: UnboundedContinuous(
            shape=torch.Size([]),
            space=ContinuousBox(
                low=Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, contiguous=True),
                high=Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, contiguous=True)),
            device=cpu,
            dtype=torch.float32,
            domain=continuous),
        g: UnboundedContinuous(
            shape=torch.Size([]),
            space=ContinuousBox(
                low=Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, contiguous=True),
                high=Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, contiguous=True)),
            device=cpu,
            dtype=torch.float32,
            domain=continuous),
        m: UnboundedContinuous(
            shape=torch.Size([]),
            space=ContinuousBox(
                low=Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, contiguous=True),
                high=Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, contiguous=True)),
            device=cpu,
            dtype=torch.float32,
            domain=continuous),
        l: UnboundedContinuous(
            shape=torch.Size([]),
            space=ContinuousBox(
                low=Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, contiguous=True),
                high=Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, contiguous=True)),
            device=cpu,
            dtype=torch.float32,
            domain=continuous),
        device=cpu,
        shape=torch.Size([]),
        data_cls=None),
    device=cpu,
    shape=torch.Size([]),
    data_cls=None)
state_spec: CompositeSpec(
    th: BoundedContinuous(
        shape=torch.Size([]),
        space=ContinuousBox(
            low=Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, contiguous=True),
            high=Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, contiguous=True)),
        device=cpu,
        dtype=torch.float32,
        domain=continuous),
    thdot: BoundedContinuous(
        shape=torch.Size([]),
        space=ContinuousBox(
            low=Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, contiguous=True),
            high=Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, contiguous=True)),
        device=cpu,
        dtype=torch.float32,
        domain=continuous),
    params: CompositeSpec(
        max_speed: UnboundedDiscrete(
            shape=torch.Size([]),
            space=ContinuousBox(
                low=Tensor(shape=torch.Size([]), device=cpu, dtype=torch.int64, contiguous=True),
                high=Tensor(shape=torch.Size([]), device=cpu, dtype=torch.int64, contiguous=True)),
            device=cpu,
            dtype=torch.int64,
            domain=discrete),
        max_torque: UnboundedContinuous(
            shape=torch.Size([]),
            space=ContinuousBox(
                low=Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, contiguous=True),
                high=Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, contiguous=True)),
            device=cpu,
            dtype=torch.float32,
            domain=continuous),
        dt: UnboundedContinuous(
            shape=torch.Size([]),
            space=ContinuousBox(
                low=Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, contiguous=True),
                high=Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, contiguous=True)),
            device=cpu,
            dtype=torch.float32,
            domain=continuous),
        g: UnboundedContinuous(
            shape=torch.Size([]),
            space=ContinuousBox(
                low=Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, contiguous=True),
                high=Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, contiguous=True)),
            device=cpu,
            dtype=torch.float32,
            domain=continuous),
        m: UnboundedContinuous(
            shape=torch.Size([]),
            space=ContinuousBox(
                low=Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, contiguous=True),
                high=Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, contiguous=True)),
            device=cpu,
            dtype=torch.float32,
            domain=continuous),
        l: UnboundedContinuous(
            shape=torch.Size([]),
            space=ContinuousBox(
                low=Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, contiguous=True),
                high=Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, contiguous=True)),
            device=cpu,
            dtype=torch.float32,
            domain=continuous),
        device=cpu,
        shape=torch.Size([]),
        data_cls=None),
    device=cpu,
    shape=torch.Size([]),
    data_cls=None)
reward_spec: UnboundedContinuous(
    shape=torch.Size([1]),
    space=ContinuousBox(
        low=Tensor(shape=torch.Size([1]), device=cpu, dtype=torch.float32, contiguous=True),
        high=Tensor(shape=torch.Size([1]), device=cpu, dtype=torch.float32, contiguous=True)),
    device=cpu,
    dtype=torch.float32,
    domain=continuous)

我们也可以执行几个命令来检查输出结构是否符合预期。

td = env.reset()
print("reset tensordict", td)
reset tensordict TensorDict(
    fields={
        done: Tensor(shape=torch.Size([1]), device=cpu, dtype=torch.bool, is_shared=False),
        params: TensorDict(
            fields={
                dt: Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, is_shared=False),
                g: Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, is_shared=False),
                l: Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, is_shared=False),
                m: Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, is_shared=False),
                max_speed: Tensor(shape=torch.Size([]), device=cpu, dtype=torch.int64, is_shared=False),
                max_torque: Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, is_shared=False)},
            batch_size=torch.Size([]),
            device=None,
            is_shared=False),
        terminated: Tensor(shape=torch.Size([1]), device=cpu, dtype=torch.bool, is_shared=False),
        th: Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, is_shared=False),
        thdot: Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, is_shared=False)},
    batch_size=torch.Size([]),
    device=None,
    is_shared=False)

我们可以运行 env.rand_step()action_spec 域中随机生成一个动作。由于我们的环境是无状态的,必须传入一个包含超参数和当前状态的 tensordict。在有状态设置中,env.rand_step() 也能完美运行。

td = env.rand_step(td)
print("random step tensordict", td)
random step tensordict TensorDict(
    fields={
        action: Tensor(shape=torch.Size([1]), device=cpu, dtype=torch.float32, is_shared=False),
        done: Tensor(shape=torch.Size([1]), device=cpu, dtype=torch.bool, is_shared=False),
        next: TensorDict(
            fields={
                done: Tensor(shape=torch.Size([1]), device=cpu, dtype=torch.bool, is_shared=False),
                params: TensorDict(
                    fields={
                        dt: Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, is_shared=False),
                        g: Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, is_shared=False),
                        l: Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, is_shared=False),
                        m: Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, is_shared=False),
                        max_speed: Tensor(shape=torch.Size([]), device=cpu, dtype=torch.int64, is_shared=False),
                        max_torque: Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, is_shared=False)},
                    batch_size=torch.Size([]),
                    device=None,
                    is_shared=False),
                reward: Tensor(shape=torch.Size([1]), device=cpu, dtype=torch.float32, is_shared=False),
                terminated: Tensor(shape=torch.Size([1]), device=cpu, dtype=torch.bool, is_shared=False),
                th: Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, is_shared=False),
                thdot: Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, is_shared=False)},
            batch_size=torch.Size([]),
            device=None,
            is_shared=False),
        params: TensorDict(
            fields={
                dt: Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, is_shared=False),
                g: Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, is_shared=False),
                l: Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, is_shared=False),
                m: Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, is_shared=False),
                max_speed: Tensor(shape=torch.Size([]), device=cpu, dtype=torch.int64, is_shared=False),
                max_torque: Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, is_shared=False)},
            batch_size=torch.Size([]),
            device=None,
            is_shared=False),
        terminated: Tensor(shape=torch.Size([1]), device=cpu, dtype=torch.bool, is_shared=False),
        th: Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, is_shared=False),
        thdot: Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, is_shared=False)},
    batch_size=torch.Size([]),
    device=None,
    is_shared=False)

转换环境#

为无状态模拟器编写环境转换比为有状态模拟器编写稍微复杂:转换一个需要在下一次迭代中读取的输出条目,需要在下次调用 meth.step() 之前应用反向转换。这是展示 TorchRL 转换所有功能的理想场景!

例如,在下面的转换环境中,我们 unsqueeze 了条目 ["th", "thdot"] 以便能够沿最后一个维度堆叠它们。我们还将其作为 in_keys_inv 传递,以便在下一次迭代中作为输入传递时将其挤压回原始形状。

env = TransformedEnv(
    env,
    # ``Unsqueeze`` the observations that we will concatenate
    UnsqueezeTransform(
        dim=-1,
        in_keys=["th", "thdot"],
        in_keys_inv=["th", "thdot"],
    ),
)

编写自定义转换#

TorchRL 的转换可能无法涵盖在执行环境后想要执行的所有操作。编写转换并不需要费多大劲。与环境设计一样,编写转换有两个步骤

  • 正确处理动态(前向和反向);

  • 适配环境规格说明。

转换可以在两种设置中使用:它可以作为 Module 单独使用。它也可以附加到 TransformedEnv 上。该类的结构允许在不同上下文中自定义行为。

Transform 骨架可总结如下

class Transform(nn.Module):
    def forward(self, tensordict):
        ...
    def _apply_transform(self, tensordict):
        ...
    def _step(self, tensordict):
        ...
    def _call(self, tensordict):
        ...
    def inv(self, tensordict):
        ...
    def _inv_apply_transform(self, tensordict):
        ...

有三个入口点(forward()_step()inv()),它们都接收 tensordict.TensorDict 实例。前两个最终将经过 in_keys 指示的键,并调用 _apply_transform() 到这些键中的每一个。结果将被写入由 Transform.out_keys 指向的条目中(如果提供;如果没有,则使用转换后的值更新 in_keys)。如果需要执行反向转换,将执行类似的数据流,但会使用 Transform.inv()Transform._inv_apply_transform() 方法,并跨越 in_keys_invout_keys_inv 键列表。下图总结了环境和重放缓冲区的这一流程。

转换 API

在某些情况下,转换不会以单一方式作用于键的子集,而是会在父环境上执行某些操作,或者使用整个输入 tensordict。在这种情况下,应重写 _call()forward() 方法,并且可以跳过 _apply_transform() 方法。

让我们编写新的转换,计算位置角的 正弦余弦 值,因为这些值对于学习策略来说比原始角度值更有用。

class SinTransform(Transform):
    def _apply_transform(self, obs: torch.Tensor) -> None:
        return obs.sin()

    # The transform must also modify the data at reset time
    def _reset(
        self, tensordict: TensorDictBase, tensordict_reset: TensorDictBase
    ) -> TensorDictBase:
        return self._call(tensordict_reset)

    # _apply_to_composite will execute the observation spec transform across all
    # in_keys/out_keys pairs and write the result in the observation_spec which
    # is of type ``Composite``
    @_apply_to_composite
    def transform_observation_spec(self, observation_spec):
        return BoundedTensorSpec(
            low=-1,
            high=1,
            shape=observation_spec.shape,
            dtype=observation_spec.dtype,
            device=observation_spec.device,
        )


class CosTransform(Transform):
    def _apply_transform(self, obs: torch.Tensor) -> None:
        return obs.cos()

    # The transform must also modify the data at reset time
    def _reset(
        self, tensordict: TensorDictBase, tensordict_reset: TensorDictBase
    ) -> TensorDictBase:
        return self._call(tensordict_reset)

    # _apply_to_composite will execute the observation spec transform across all
    # in_keys/out_keys pairs and write the result in the observation_spec which
    # is of type ``Composite``
    @_apply_to_composite
    def transform_observation_spec(self, observation_spec):
        return BoundedTensorSpec(
            low=-1,
            high=1,
            shape=observation_spec.shape,
            dtype=observation_spec.dtype,
            device=observation_spec.device,
        )


t_sin = SinTransform(in_keys=["th"], out_keys=["sin"])
t_cos = CosTransform(in_keys=["th"], out_keys=["cos"])
env.append_transform(t_sin)
env.append_transform(t_cos)
TransformedEnv(
    env=PendulumEnv(),
    transform=Compose(
            UnsqueezeTransform(dim=-1, in_keys=['th', 'thdot'], out_keys=['th', 'thdot'], in_keys_inv=['th', 'thdot'], out_keys_inv=['th', 'thdot']),
            SinTransform(keys=['th']),
            CosTransform(keys=['th'])))

将观测值连接到“观测”条目上。del_keys=False 确保我们保留这些值用于下一次迭代。

cat_transform = CatTensors(
    in_keys=["sin", "cos", "thdot"], dim=-1, out_key="observation", del_keys=False
)
env.append_transform(cat_transform)
TransformedEnv(
    env=PendulumEnv(),
    transform=Compose(
            UnsqueezeTransform(dim=-1, in_keys=['th', 'thdot'], out_keys=['th', 'thdot'], in_keys_inv=['th', 'thdot'], out_keys_inv=['th', 'thdot']),
            SinTransform(keys=['th']),
            CosTransform(keys=['th']),
            CatTensors(in_keys=['cos', 'sin', 'thdot'], out_key=observation)))

再次检查我们的环境规格说明是否与接收到的内容匹配。

2026-06-03 01:00:51,498 [torchrl][INFO]    check_env_specs succeeded! [END]

执行上线滚动(Rollout)#

执行 Rollout 是一系列简单的步骤

  • 重置环境

  • 当某些条件未满足时

    • 根据策略计算动作

    • 根据此动作执行步骤

    • 收集数据

    • 执行一个 MDP 步骤

  • 汇总数据并返回

这些操作已方便地封装在 rollout() 方法中,我们在下文中提供了其简化版本。

def simple_rollout(steps=100):
    # preallocate:
    data = TensorDict({}, [steps])
    # reset
    _data = env.reset()
    for i in range(steps):
        _data["action"] = env.action_spec.rand()
        _data = env.step(_data)
        data[i] = _data
        _data = step_mdp(_data, keep_other=True)
    return data


print("data from rollout:", simple_rollout(100))
data from rollout: TensorDict(
    fields={
        action: Tensor(shape=torch.Size([100, 1]), device=cpu, dtype=torch.float32, is_shared=False),
        cos: Tensor(shape=torch.Size([100, 1]), device=cpu, dtype=torch.float32, is_shared=False),
        done: Tensor(shape=torch.Size([100, 1]), device=cpu, dtype=torch.bool, is_shared=False),
        next: TensorDict(
            fields={
                cos: Tensor(shape=torch.Size([100, 1]), device=cpu, dtype=torch.float32, is_shared=False),
                done: Tensor(shape=torch.Size([100, 1]), device=cpu, dtype=torch.bool, is_shared=False),
                observation: Tensor(shape=torch.Size([100, 3]), device=cpu, dtype=torch.float32, is_shared=False),
                params: TensorDict(
                    fields={
                        dt: Tensor(shape=torch.Size([100]), device=cpu, dtype=torch.float32, is_shared=False),
                        g: Tensor(shape=torch.Size([100]), device=cpu, dtype=torch.float32, is_shared=False),
                        l: Tensor(shape=torch.Size([100]), device=cpu, dtype=torch.float32, is_shared=False),
                        m: Tensor(shape=torch.Size([100]), device=cpu, dtype=torch.float32, is_shared=False),
                        max_speed: Tensor(shape=torch.Size([100]), device=cpu, dtype=torch.int64, is_shared=False),
                        max_torque: Tensor(shape=torch.Size([100]), device=cpu, dtype=torch.float32, is_shared=False)},
                    batch_size=torch.Size([100]),
                    device=None,
                    is_shared=False),
                reward: Tensor(shape=torch.Size([100, 1]), device=cpu, dtype=torch.float32, is_shared=False),
                sin: Tensor(shape=torch.Size([100, 1]), device=cpu, dtype=torch.float32, is_shared=False),
                terminated: Tensor(shape=torch.Size([100, 1]), device=cpu, dtype=torch.bool, is_shared=False),
                th: Tensor(shape=torch.Size([100, 1]), device=cpu, dtype=torch.float32, is_shared=False),
                thdot: Tensor(shape=torch.Size([100, 1]), device=cpu, dtype=torch.float32, is_shared=False)},
            batch_size=torch.Size([100]),
            device=None,
            is_shared=False),
        observation: Tensor(shape=torch.Size([100, 3]), device=cpu, dtype=torch.float32, is_shared=False),
        params: TensorDict(
            fields={
                dt: Tensor(shape=torch.Size([100]), device=cpu, dtype=torch.float32, is_shared=False),
                g: Tensor(shape=torch.Size([100]), device=cpu, dtype=torch.float32, is_shared=False),
                l: Tensor(shape=torch.Size([100]), device=cpu, dtype=torch.float32, is_shared=False),
                m: Tensor(shape=torch.Size([100]), device=cpu, dtype=torch.float32, is_shared=False),
                max_speed: Tensor(shape=torch.Size([100]), device=cpu, dtype=torch.int64, is_shared=False),
                max_torque: Tensor(shape=torch.Size([100]), device=cpu, dtype=torch.float32, is_shared=False)},
            batch_size=torch.Size([100]),
            device=None,
            is_shared=False),
        sin: Tensor(shape=torch.Size([100, 1]), device=cpu, dtype=torch.float32, is_shared=False),
        terminated: Tensor(shape=torch.Size([100, 1]), device=cpu, dtype=torch.bool, is_shared=False),
        th: Tensor(shape=torch.Size([100, 1]), device=cpu, dtype=torch.float32, is_shared=False),
        thdot: Tensor(shape=torch.Size([100, 1]), device=cpu, dtype=torch.float32, is_shared=False)},
    batch_size=torch.Size([100]),
    device=None,
    is_shared=False)

批处理计算#

我们教程最后一个未探索的部分是我们在 TorchRL 中批量计算的能力。因为我们的环境不对输入数据形状做任何假设,所以我们可以无缝地在数据批次上执行它。更妙的是:对于像我们钟摆这样的非 batch-locked 环境,我们可以即时更改批大小,而无需重新创建环境。为此,我们只需生成具有所需形状的参数。

batch_size = 10  # number of environments to be executed in batch
td = env.reset(env.gen_params(batch_size=[batch_size]))
print("reset (batch size of 10)", td)
td = env.rand_step(td)
print("rand step (batch size of 10)", td)
reset (batch size of 10) TensorDict(
    fields={
        cos: Tensor(shape=torch.Size([10, 1]), device=cpu, dtype=torch.float32, is_shared=False),
        done: Tensor(shape=torch.Size([10, 1]), device=cpu, dtype=torch.bool, is_shared=False),
        observation: Tensor(shape=torch.Size([10, 3]), device=cpu, dtype=torch.float32, is_shared=False),
        params: TensorDict(
            fields={
                dt: Tensor(shape=torch.Size([10]), device=cpu, dtype=torch.float32, is_shared=False),
                g: Tensor(shape=torch.Size([10]), device=cpu, dtype=torch.float32, is_shared=False),
                l: Tensor(shape=torch.Size([10]), device=cpu, dtype=torch.float32, is_shared=False),
                m: Tensor(shape=torch.Size([10]), device=cpu, dtype=torch.float32, is_shared=False),
                max_speed: Tensor(shape=torch.Size([10]), device=cpu, dtype=torch.int64, is_shared=False),
                max_torque: Tensor(shape=torch.Size([10]), device=cpu, dtype=torch.float32, is_shared=False)},
            batch_size=torch.Size([10]),
            device=None,
            is_shared=False),
        sin: Tensor(shape=torch.Size([10, 1]), device=cpu, dtype=torch.float32, is_shared=False),
        terminated: Tensor(shape=torch.Size([10, 1]), device=cpu, dtype=torch.bool, is_shared=False),
        th: Tensor(shape=torch.Size([10, 1]), device=cpu, dtype=torch.float32, is_shared=False),
        thdot: Tensor(shape=torch.Size([10, 1]), device=cpu, dtype=torch.float32, is_shared=False)},
    batch_size=torch.Size([10]),
    device=None,
    is_shared=False)
rand step (batch size of 10) TensorDict(
    fields={
        action: Tensor(shape=torch.Size([10, 1]), device=cpu, dtype=torch.float32, is_shared=False),
        cos: Tensor(shape=torch.Size([10, 1]), device=cpu, dtype=torch.float32, is_shared=False),
        done: Tensor(shape=torch.Size([10, 1]), device=cpu, dtype=torch.bool, is_shared=False),
        next: TensorDict(
            fields={
                cos: Tensor(shape=torch.Size([10, 1]), device=cpu, dtype=torch.float32, is_shared=False),
                done: Tensor(shape=torch.Size([10, 1]), device=cpu, dtype=torch.bool, is_shared=False),
                observation: Tensor(shape=torch.Size([10, 3]), device=cpu, dtype=torch.float32, is_shared=False),
                params: TensorDict(
                    fields={
                        dt: Tensor(shape=torch.Size([10]), device=cpu, dtype=torch.float32, is_shared=False),
                        g: Tensor(shape=torch.Size([10]), device=cpu, dtype=torch.float32, is_shared=False),
                        l: Tensor(shape=torch.Size([10]), device=cpu, dtype=torch.float32, is_shared=False),
                        m: Tensor(shape=torch.Size([10]), device=cpu, dtype=torch.float32, is_shared=False),
                        max_speed: Tensor(shape=torch.Size([10]), device=cpu, dtype=torch.int64, is_shared=False),
                        max_torque: Tensor(shape=torch.Size([10]), device=cpu, dtype=torch.float32, is_shared=False)},
                    batch_size=torch.Size([10]),
                    device=None,
                    is_shared=False),
                reward: Tensor(shape=torch.Size([10, 1]), device=cpu, dtype=torch.float32, is_shared=False),
                sin: Tensor(shape=torch.Size([10, 1]), device=cpu, dtype=torch.float32, is_shared=False),
                terminated: Tensor(shape=torch.Size([10, 1]), device=cpu, dtype=torch.bool, is_shared=False),
                th: Tensor(shape=torch.Size([10, 1]), device=cpu, dtype=torch.float32, is_shared=False),
                thdot: Tensor(shape=torch.Size([10, 1]), device=cpu, dtype=torch.float32, is_shared=False)},
            batch_size=torch.Size([10]),
            device=None,
            is_shared=False),
        observation: Tensor(shape=torch.Size([10, 3]), device=cpu, dtype=torch.float32, is_shared=False),
        params: TensorDict(
            fields={
                dt: Tensor(shape=torch.Size([10]), device=cpu, dtype=torch.float32, is_shared=False),
                g: Tensor(shape=torch.Size([10]), device=cpu, dtype=torch.float32, is_shared=False),
                l: Tensor(shape=torch.Size([10]), device=cpu, dtype=torch.float32, is_shared=False),
                m: Tensor(shape=torch.Size([10]), device=cpu, dtype=torch.float32, is_shared=False),
                max_speed: Tensor(shape=torch.Size([10]), device=cpu, dtype=torch.int64, is_shared=False),
                max_torque: Tensor(shape=torch.Size([10]), device=cpu, dtype=torch.float32, is_shared=False)},
            batch_size=torch.Size([10]),
            device=None,
            is_shared=False),
        sin: Tensor(shape=torch.Size([10, 1]), device=cpu, dtype=torch.float32, is_shared=False),
        terminated: Tensor(shape=torch.Size([10, 1]), device=cpu, dtype=torch.bool, is_shared=False),
        th: Tensor(shape=torch.Size([10, 1]), device=cpu, dtype=torch.float32, is_shared=False),
        thdot: Tensor(shape=torch.Size([10, 1]), device=cpu, dtype=torch.float32, is_shared=False)},
    batch_size=torch.Size([10]),
    device=None,
    is_shared=False)

使用数据批次执行 Rollout 要求我们在 Rollout 函数之外重置环境,因为我们需要动态定义 batch_size,而 rollout() 不支持此操作。

rollout = env.rollout(
    3,
    auto_reset=False,  # we're executing the reset out of the ``rollout`` call
    tensordict=env.reset(env.gen_params(batch_size=[batch_size])),
)
print("rollout of len 3 (batch size of 10):", rollout)
rollout of len 3 (batch size of 10): TensorDict(
    fields={
        action: Tensor(shape=torch.Size([10, 3, 1]), device=cpu, dtype=torch.float32, is_shared=False),
        cos: Tensor(shape=torch.Size([10, 3, 1]), device=cpu, dtype=torch.float32, is_shared=False),
        done: Tensor(shape=torch.Size([10, 3, 1]), device=cpu, dtype=torch.bool, is_shared=False),
        next: TensorDict(
            fields={
                cos: Tensor(shape=torch.Size([10, 3, 1]), device=cpu, dtype=torch.float32, is_shared=False),
                done: Tensor(shape=torch.Size([10, 3, 1]), device=cpu, dtype=torch.bool, is_shared=False),
                observation: Tensor(shape=torch.Size([10, 3, 3]), device=cpu, dtype=torch.float32, is_shared=False),
                params: TensorDict(
                    fields={
                        dt: Tensor(shape=torch.Size([10, 3]), device=cpu, dtype=torch.float32, is_shared=False),
                        g: Tensor(shape=torch.Size([10, 3]), device=cpu, dtype=torch.float32, is_shared=False),
                        l: Tensor(shape=torch.Size([10, 3]), device=cpu, dtype=torch.float32, is_shared=False),
                        m: Tensor(shape=torch.Size([10, 3]), device=cpu, dtype=torch.float32, is_shared=False),
                        max_speed: Tensor(shape=torch.Size([10, 3]), device=cpu, dtype=torch.int64, is_shared=False),
                        max_torque: Tensor(shape=torch.Size([10, 3]), device=cpu, dtype=torch.float32, is_shared=False)},
                    batch_size=torch.Size([10, 3]),
                    device=None,
                    is_shared=False),
                reward: Tensor(shape=torch.Size([10, 3, 1]), device=cpu, dtype=torch.float32, is_shared=False),
                sin: Tensor(shape=torch.Size([10, 3, 1]), device=cpu, dtype=torch.float32, is_shared=False),
                terminated: Tensor(shape=torch.Size([10, 3, 1]), device=cpu, dtype=torch.bool, is_shared=False),
                th: Tensor(shape=torch.Size([10, 3, 1]), device=cpu, dtype=torch.float32, is_shared=False),
                thdot: Tensor(shape=torch.Size([10, 3, 1]), device=cpu, dtype=torch.float32, is_shared=False)},
            batch_size=torch.Size([10, 3]),
            device=None,
            is_shared=False),
        observation: Tensor(shape=torch.Size([10, 3, 3]), device=cpu, dtype=torch.float32, is_shared=False),
        params: TensorDict(
            fields={
                dt: Tensor(shape=torch.Size([10, 3]), device=cpu, dtype=torch.float32, is_shared=False),
                g: Tensor(shape=torch.Size([10, 3]), device=cpu, dtype=torch.float32, is_shared=False),
                l: Tensor(shape=torch.Size([10, 3]), device=cpu, dtype=torch.float32, is_shared=False),
                m: Tensor(shape=torch.Size([10, 3]), device=cpu, dtype=torch.float32, is_shared=False),
                max_speed: Tensor(shape=torch.Size([10, 3]), device=cpu, dtype=torch.int64, is_shared=False),
                max_torque: Tensor(shape=torch.Size([10, 3]), device=cpu, dtype=torch.float32, is_shared=False)},
            batch_size=torch.Size([10, 3]),
            device=None,
            is_shared=False),
        sin: Tensor(shape=torch.Size([10, 3, 1]), device=cpu, dtype=torch.float32, is_shared=False),
        terminated: Tensor(shape=torch.Size([10, 3, 1]), device=cpu, dtype=torch.bool, is_shared=False),
        th: Tensor(shape=torch.Size([10, 3, 1]), device=cpu, dtype=torch.float32, is_shared=False),
        thdot: Tensor(shape=torch.Size([10, 3, 1]), device=cpu, dtype=torch.float32, is_shared=False)},
    batch_size=torch.Size([10, 3]),
    device=None,
    is_shared=False)

训练一个简单策略#

在此示例中,我们将使用奖励作为可微分目标(例如负损失)来训练一个简单策略。我们将利用我们的动态系统是完全可微分的事实,通过轨迹回报进行反向传播,并调整策略权重以直接最大化此值。当然,在许多设置中,我们所做的许多假设并不成立,例如可微分系统和对底层机制的完全访问权限。

尽管如此,这是一个非常简单的示例,展示了如何在 TorchRL 中使用自定义环境编写训练循环。

让我们首先编写策略网络

torch.manual_seed(0)
env.set_seed(0)

net = nn.Sequential(
    nn.LazyLinear(64),
    nn.Tanh(),
    nn.LazyLinear(64),
    nn.Tanh(),
    nn.LazyLinear(64),
    nn.Tanh(),
    nn.LazyLinear(1),
)
policy = TensorDictModule(
    net,
    in_keys=["observation"],
    out_keys=["action"],
)

以及我们的优化器

训练循环#

我们将依次

  • 生成一条轨迹

  • 汇总奖励

  • 通过这些操作定义的图进行反向传播

  • 裁剪梯度范数并执行优化步骤

  • 重复

在训练循环结束时,我们最终的奖励应该接近 0,这表明钟摆如预期那样处于向上且静止的状态。

batch_size = 32
pbar = tqdm.tqdm(range(20_000 // batch_size))
scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(optim, 20_000)
logs = defaultdict(list)

for _ in pbar:
    init_td = env.reset(env.gen_params(batch_size=[batch_size]))
    rollout = env.rollout(100, policy, tensordict=init_td, auto_reset=False)
    traj_return = rollout["next", "reward"].mean()
    (-traj_return).backward()
    gn = torch.nn.utils.clip_grad_norm_(net.parameters(), 1.0)
    optim.step()
    optim.zero_grad()
    pbar.set_description(
        f"reward: {traj_return: 4.4f}, "
        f"last reward: {rollout[..., -1]['next', 'reward'].mean(): 4.4f}, gradient norm: {gn: 4.4}"
    )
    logs["return"].append(traj_return.item())
    logs["last_reward"].append(rollout[..., -1]["next", "reward"].mean().item())
    scheduler.step()


def plot():
    import matplotlib
    from matplotlib import pyplot as plt

    is_ipython = "inline" in matplotlib.get_backend()
    if is_ipython:
        from IPython import display

    with plt.ion():
        plt.figure(figsize=(10, 5))
        plt.subplot(1, 2, 1)
        plt.plot(logs["return"])
        plt.title("returns")
        plt.xlabel("iteration")
        plt.subplot(1, 2, 2)
        plt.plot(logs["last_reward"])
        plt.title("last reward")
        plt.xlabel("iteration")
        if is_ipython:
            display.display(plt.gcf())
            display.clear_output(wait=True)
        plt.show()


plot()
returns, last reward
  0%|          | 0/625 [00:00<?, ?it/s]
reward: -6.0488, last reward: -5.0748, gradient norm:  8.519:   0%|          | 0/625 [00:00<?, ?it/s]
reward: -6.0488, last reward: -5.0748, gradient norm:  8.519:   0%|          | 1/625 [00:00<03:31,  2.95it/s]
reward: -7.0499, last reward: -7.4472, gradient norm:  5.073:   0%|          | 1/625 [00:00<03:31,  2.95it/s]
reward: -7.0499, last reward: -7.4472, gradient norm:  5.073:   0%|          | 2/625 [00:00<02:44,  3.78it/s]
reward: -7.0685, last reward: -7.0408, gradient norm:  5.552:   0%|          | 2/625 [00:00<02:44,  3.78it/s]
reward: -7.0685, last reward: -7.0408, gradient norm:  5.552:   0%|          | 3/625 [00:00<02:29,  4.16it/s]
reward: -6.5154, last reward: -5.9086, gradient norm:  2.527:   0%|          | 3/625 [00:00<02:29,  4.16it/s]
reward: -6.5154, last reward: -5.9086, gradient norm:  2.527:   1%|          | 4/625 [00:00<02:21,  4.39it/s]
reward: -6.2006, last reward: -5.9385, gradient norm:  8.155:   1%|          | 4/625 [00:01<02:21,  4.39it/s]
reward: -6.2006, last reward: -5.9385, gradient norm:  8.155:   1%|          | 5/625 [00:01<02:17,  4.52it/s]
reward: -6.2568, last reward: -5.4981, gradient norm:  6.223:   1%|          | 5/625 [00:01<02:17,  4.52it/s]
reward: -6.2568, last reward: -5.4981, gradient norm:  6.223:   1%|          | 6/625 [00:01<02:14,  4.60it/s]
reward: -5.8929, last reward: -8.4491, gradient norm:  4.581:   1%|          | 6/625 [00:01<02:14,  4.60it/s]
reward: -5.8929, last reward: -8.4491, gradient norm:  4.581:   1%|          | 7/625 [00:01<02:12,  4.66it/s]
reward: -6.3233, last reward: -9.0664, gradient norm:  7.596:   1%|          | 7/625 [00:01<02:12,  4.66it/s]
reward: -6.3233, last reward: -9.0664, gradient norm:  7.596:   1%|▏         | 8/625 [00:01<02:10,  4.71it/s]
reward: -6.1021, last reward: -9.5263, gradient norm:  0.9579:   1%|▏         | 8/625 [00:02<02:10,  4.71it/s]
reward: -6.1021, last reward: -9.5263, gradient norm:  0.9579:   1%|▏         | 9/625 [00:02<02:09,  4.74it/s]
reward: -6.5807, last reward: -8.8075, gradient norm:  3.212:   1%|▏         | 9/625 [00:02<02:09,  4.74it/s]
reward: -6.5807, last reward: -8.8075, gradient norm:  3.212:   2%|▏         | 10/625 [00:02<02:09,  4.76it/s]
reward: -6.2009, last reward: -8.5525, gradient norm:  2.914:   2%|▏         | 10/625 [00:02<02:09,  4.76it/s]
reward: -6.2009, last reward: -8.5525, gradient norm:  2.914:   2%|▏         | 11/625 [00:02<02:08,  4.76it/s]
reward: -6.2894, last reward: -8.0115, gradient norm:  52.06:   2%|▏         | 11/625 [00:02<02:08,  4.76it/s]
reward: -6.2894, last reward: -8.0115, gradient norm:  52.06:   2%|▏         | 12/625 [00:02<02:08,  4.77it/s]
reward: -6.0977, last reward: -6.1845, gradient norm:  18.09:   2%|▏         | 12/625 [00:02<02:08,  4.77it/s]
reward: -6.0977, last reward: -6.1845, gradient norm:  18.09:   2%|▏         | 13/625 [00:02<02:08,  4.78it/s]
reward: -6.1830, last reward: -7.4858, gradient norm:  5.233:   2%|▏         | 13/625 [00:03<02:08,  4.78it/s]
reward: -6.1830, last reward: -7.4858, gradient norm:  5.233:   2%|▏         | 14/625 [00:03<02:07,  4.78it/s]
reward: -6.2863, last reward: -5.0297, gradient norm:  1.464:   2%|▏         | 14/625 [00:03<02:07,  4.78it/s]
reward: -6.2863, last reward: -5.0297, gradient norm:  1.464:   2%|▏         | 15/625 [00:03<02:07,  4.79it/s]
reward: -6.4617, last reward: -5.5997, gradient norm:  2.903:   2%|▏         | 15/625 [00:03<02:07,  4.79it/s]
reward: -6.4617, last reward: -5.5997, gradient norm:  2.903:   3%|▎         | 16/625 [00:03<02:06,  4.80it/s]
reward: -6.1647, last reward: -6.0777, gradient norm:  4.918:   3%|▎         | 16/625 [00:03<02:06,  4.80it/s]
reward: -6.1647, last reward: -6.0777, gradient norm:  4.918:   3%|▎         | 17/625 [00:03<02:06,  4.80it/s]
reward: -6.4709, last reward: -6.6813, gradient norm:  0.8319:   3%|▎         | 17/625 [00:03<02:06,  4.80it/s]
reward: -6.4709, last reward: -6.6813, gradient norm:  0.8319:   3%|▎         | 18/625 [00:03<02:06,  4.80it/s]
reward: -6.3221, last reward: -6.5577, gradient norm:  0.8415:   3%|▎         | 18/625 [00:04<02:06,  4.80it/s]
reward: -6.3221, last reward: -6.5577, gradient norm:  0.8415:   3%|▎         | 19/625 [00:04<02:06,  4.80it/s]
reward: -6.3229, last reward: -8.3322, gradient norm:  27.31:   3%|▎         | 19/625 [00:04<02:06,  4.80it/s]
reward: -6.3229, last reward: -8.3322, gradient norm:  27.31:   3%|▎         | 20/625 [00:04<02:06,  4.80it/s]
reward: -6.0258, last reward: -8.0581, gradient norm:  12.32:   3%|▎         | 20/625 [00:04<02:06,  4.80it/s]
reward: -6.0258, last reward: -8.0581, gradient norm:  12.32:   3%|▎         | 21/625 [00:04<02:05,  4.80it/s]
reward: -5.7295, last reward: -6.7230, gradient norm:  24.23:   3%|▎         | 21/625 [00:04<02:05,  4.80it/s]
reward: -5.7295, last reward: -6.7230, gradient norm:  24.23:   4%|▎         | 22/625 [00:04<02:05,  4.80it/s]
reward: -6.0265, last reward: -6.6077, gradient norm:  52.82:   4%|▎         | 22/625 [00:04<02:05,  4.80it/s]
reward: -6.0265, last reward: -6.6077, gradient norm:  52.82:   4%|▎         | 23/625 [00:04<02:05,  4.80it/s]
reward: -6.1081, last reward: -6.1347, gradient norm:  31.16:   4%|▎         | 23/625 [00:05<02:05,  4.80it/s]
reward: -6.1081, last reward: -6.1347, gradient norm:  31.16:   4%|▍         | 24/625 [00:05<02:05,  4.80it/s]
reward: -5.5231, last reward: -4.8435, gradient norm:  11.51:   4%|▍         | 24/625 [00:05<02:05,  4.80it/s]
reward: -5.5231, last reward: -4.8435, gradient norm:  11.51:   4%|▍         | 25/625 [00:05<02:05,  4.79it/s]
reward: -5.5310, last reward: -6.5397, gradient norm:  13.18:   4%|▍         | 25/625 [00:05<02:05,  4.79it/s]
reward: -5.5310, last reward: -6.5397, gradient norm:  13.18:   4%|▍         | 26/625 [00:05<02:05,  4.79it/s]
reward: -5.6382, last reward: -4.8204, gradient norm:  10.72:   4%|▍         | 26/625 [00:05<02:05,  4.79it/s]
reward: -5.6382, last reward: -4.8204, gradient norm:  10.72:   4%|▍         | 27/625 [00:05<02:04,  4.79it/s]
reward: -5.8162, last reward: -5.1618, gradient norm:  10.44:   4%|▍         | 27/625 [00:05<02:04,  4.79it/s]
reward: -5.8162, last reward: -5.1618, gradient norm:  10.44:   4%|▍         | 28/625 [00:05<02:04,  4.78it/s]
reward: -6.1180, last reward: -5.4640, gradient norm:  7.744:   4%|▍         | 28/625 [00:06<02:04,  4.78it/s]
reward: -6.1180, last reward: -5.4640, gradient norm:  7.744:   5%|▍         | 29/625 [00:06<02:04,  4.79it/s]
reward: -5.8759, last reward: -5.7826, gradient norm:  1.796:   5%|▍         | 29/625 [00:06<02:04,  4.79it/s]
reward: -5.8759, last reward: -5.7826, gradient norm:  1.796:   5%|▍         | 30/625 [00:06<02:04,  4.79it/s]
reward: -5.8296, last reward: -6.4808, gradient norm:  2.25:   5%|▍         | 30/625 [00:06<02:04,  4.79it/s]
reward: -5.8296, last reward: -6.4808, gradient norm:  2.25:   5%|▍         | 31/625 [00:06<02:03,  4.79it/s]
reward: -5.7578, last reward: -7.5124, gradient norm:  30.52:   5%|▍         | 31/625 [00:06<02:03,  4.79it/s]
reward: -5.7578, last reward: -7.5124, gradient norm:  30.52:   5%|▌         | 32/625 [00:06<02:03,  4.78it/s]
reward: -5.9313, last reward: -7.5212, gradient norm:  7.652:   5%|▌         | 32/625 [00:07<02:03,  4.78it/s]
reward: -5.9313, last reward: -7.5212, gradient norm:  7.652:   5%|▌         | 33/625 [00:07<02:03,  4.78it/s]
reward: -6.0223, last reward: -6.6343, gradient norm:  4.224:   5%|▌         | 33/625 [00:07<02:03,  4.78it/s]
reward: -6.0223, last reward: -6.6343, gradient norm:  4.224:   5%|▌         | 34/625 [00:07<02:03,  4.78it/s]
reward: -6.2886, last reward: -5.1441, gradient norm:  3.539:   5%|▌         | 34/625 [00:07<02:03,  4.78it/s]
reward: -6.2886, last reward: -5.1441, gradient norm:  3.539:   6%|▌         | 35/625 [00:07<02:03,  4.78it/s]
reward: -6.1060, last reward: -7.1638, gradient norm:  2.407:   6%|▌         | 35/625 [00:07<02:03,  4.78it/s]
reward: -6.1060, last reward: -7.1638, gradient norm:  2.407:   6%|▌         | 36/625 [00:07<02:03,  4.77it/s]
reward: -6.2230, last reward: -5.2917, gradient norm:  5.425:   6%|▌         | 36/625 [00:07<02:03,  4.77it/s]
reward: -6.2230, last reward: -5.2917, gradient norm:  5.425:   6%|▌         | 37/625 [00:07<02:02,  4.78it/s]
reward: -6.2950, last reward: -6.2126, gradient norm:  6.035:   6%|▌         | 37/625 [00:08<02:02,  4.78it/s]
reward: -6.2950, last reward: -6.2126, gradient norm:  6.035:   6%|▌         | 38/625 [00:08<02:02,  4.80it/s]
reward: -5.9786, last reward: -5.8757, gradient norm:  2.098:   6%|▌         | 38/625 [00:08<02:02,  4.80it/s]
reward: -5.9786, last reward: -5.8757, gradient norm:  2.098:   6%|▌         | 39/625 [00:08<02:02,  4.79it/s]
reward: -6.0730, last reward: -5.1952, gradient norm:  3.982:   6%|▌         | 39/625 [00:08<02:02,  4.79it/s]
reward: -6.0730, last reward: -5.1952, gradient norm:  3.982:   6%|▋         | 40/625 [00:08<02:01,  4.80it/s]
reward: -5.9481, last reward: -5.7122, gradient norm:  4.42:   6%|▋         | 40/625 [00:08<02:01,  4.80it/s]
reward: -5.9481, last reward: -5.7122, gradient norm:  4.42:   7%|▋         | 41/625 [00:08<02:01,  4.79it/s]
reward: -6.0875, last reward: -6.7567, gradient norm:  7.728:   7%|▋         | 41/625 [00:08<02:01,  4.79it/s]
reward: -6.0875, last reward: -6.7567, gradient norm:  7.728:   7%|▋         | 42/625 [00:08<02:01,  4.79it/s]
reward: -5.6301, last reward: -6.2249, gradient norm:  9.824:   7%|▋         | 42/625 [00:09<02:01,  4.79it/s]
reward: -5.6301, last reward: -6.2249, gradient norm:  9.824:   7%|▋         | 43/625 [00:09<02:01,  4.79it/s]
reward: -5.5281, last reward: -5.7749, gradient norm:  7.223:   7%|▋         | 43/625 [00:09<02:01,  4.79it/s]
reward: -5.5281, last reward: -5.7749, gradient norm:  7.223:   7%|▋         | 44/625 [00:09<02:01,  4.79it/s]
reward: -5.5904, last reward: -5.0048, gradient norm:  11.73:   7%|▋         | 44/625 [00:09<02:01,  4.79it/s]
reward: -5.5904, last reward: -5.0048, gradient norm:  11.73:   7%|▋         | 45/625 [00:09<02:00,  4.80it/s]
reward: -5.7882, last reward: -4.8660, gradient norm:  2.094:   7%|▋         | 45/625 [00:09<02:00,  4.80it/s]
reward: -5.7882, last reward: -4.8660, gradient norm:  2.094:   7%|▋         | 46/625 [00:09<02:01,  4.78it/s]
reward: -5.8592, last reward: -4.4848, gradient norm:  30.4:   7%|▋         | 46/625 [00:09<02:01,  4.78it/s]
reward: -5.8592, last reward: -4.4848, gradient norm:  30.4:   8%|▊         | 47/625 [00:09<02:01,  4.76it/s]
reward: -5.3849, last reward: -3.5828, gradient norm:  2.244:   8%|▊         | 47/625 [00:10<02:01,  4.76it/s]
reward: -5.3849, last reward: -3.5828, gradient norm:  2.244:   8%|▊         | 48/625 [00:10<02:00,  4.77it/s]
reward: -5.5785, last reward: -2.4216, gradient norm:  0.8946:   8%|▊         | 48/625 [00:10<02:00,  4.77it/s]
reward: -5.5785, last reward: -2.4216, gradient norm:  0.8946:   8%|▊         | 49/625 [00:10<02:00,  4.78it/s]
reward: -5.4433, last reward: -3.4306, gradient norm:  16.48:   8%|▊         | 49/625 [00:10<02:00,  4.78it/s]
reward: -5.4433, last reward: -3.4306, gradient norm:  16.48:   8%|▊         | 50/625 [00:10<02:00,  4.78it/s]
reward: -5.5546, last reward: -5.3443, gradient norm:  8.319:   8%|▊         | 50/625 [00:10<02:00,  4.78it/s]
reward: -5.5546, last reward: -5.3443, gradient norm:  8.319:   8%|▊         | 51/625 [00:10<02:00,  4.75it/s]
reward: -5.5681, last reward: -7.5266, gradient norm:  5.593:   8%|▊         | 51/625 [00:11<02:00,  4.75it/s]
reward: -5.5681, last reward: -7.5266, gradient norm:  5.593:   8%|▊         | 52/625 [00:11<02:00,  4.75it/s]
reward: -5.6418, last reward: -8.1904, gradient norm:  12.34:   8%|▊         | 52/625 [00:11<02:00,  4.75it/s]
reward: -5.6418, last reward: -8.1904, gradient norm:  12.34:   8%|▊         | 53/625 [00:11<02:00,  4.76it/s]
reward: -5.6517, last reward: -8.3856, gradient norm:  4.565:   8%|▊         | 53/625 [00:11<02:00,  4.76it/s]
reward: -5.6517, last reward: -8.3856, gradient norm:  4.565:   9%|▊         | 54/625 [00:11<02:24,  3.94it/s]
reward: -5.9653, last reward: -8.4339, gradient norm:  12.73:   9%|▊         | 54/625 [00:11<02:24,  3.94it/s]
reward: -5.9653, last reward: -8.4339, gradient norm:  12.73:   9%|▉         | 55/625 [00:11<02:16,  4.16it/s]
reward: -6.0832, last reward: -8.9027, gradient norm:  6.07:   9%|▉         | 55/625 [00:11<02:16,  4.16it/s]
reward: -6.0832, last reward: -8.9027, gradient norm:  6.07:   9%|▉         | 56/625 [00:11<02:11,  4.33it/s]
reward: -6.2454, last reward: -8.9134, gradient norm:  9.312:   9%|▉         | 56/625 [00:12<02:11,  4.33it/s]
reward: -6.2454, last reward: -8.9134, gradient norm:  9.312:   9%|▉         | 57/625 [00:12<02:07,  4.46it/s]
reward: -6.1343, last reward: -9.4171, gradient norm:  16.74:   9%|▉         | 57/625 [00:12<02:07,  4.46it/s]
reward: -6.1343, last reward: -9.4171, gradient norm:  16.74:   9%|▉         | 58/625 [00:12<02:04,  4.55it/s]
reward: -5.7796, last reward: -11.1745, gradient norm:  20.83:   9%|▉         | 58/625 [00:12<02:04,  4.55it/s]
reward: -5.7796, last reward: -11.1745, gradient norm:  20.83:   9%|▉         | 59/625 [00:12<02:02,  4.60it/s]
reward: -5.4783, last reward: -6.2441, gradient norm:  8.777:   9%|▉         | 59/625 [00:12<02:02,  4.60it/s]
reward: -5.4783, last reward: -6.2441, gradient norm:  8.777:  10%|▉         | 60/625 [00:12<02:01,  4.65it/s]
reward: -5.5816, last reward: -4.1932, gradient norm:  6.328:  10%|▉         | 60/625 [00:13<02:01,  4.65it/s]
reward: -5.5816, last reward: -4.1932, gradient norm:  6.328:  10%|▉         | 61/625 [00:13<02:00,  4.69it/s]
reward: -5.6604, last reward: -4.1629, gradient norm:  3.516:  10%|▉         | 61/625 [00:13<02:00,  4.69it/s]
reward: -5.6604, last reward: -4.1629, gradient norm:  3.516:  10%|▉         | 62/625 [00:13<01:59,  4.72it/s]
reward: -5.4195, last reward: -5.1296, gradient norm:  8.378:  10%|▉         | 62/625 [00:13<01:59,  4.72it/s]
reward: -5.4195, last reward: -5.1296, gradient norm:  8.378:  10%|█         | 63/625 [00:13<01:59,  4.71it/s]
reward: -5.5165, last reward: -3.0986, gradient norm:  17.72:  10%|█         | 63/625 [00:13<01:59,  4.71it/s]
reward: -5.5165, last reward: -3.0986, gradient norm:  17.72:  10%|█         | 64/625 [00:13<01:58,  4.73it/s]
reward: -5.5596, last reward: -4.2442, gradient norm:  11.38:  10%|█         | 64/625 [00:13<01:58,  4.73it/s]
reward: -5.5596, last reward: -4.2442, gradient norm:  11.38:  10%|█         | 65/625 [00:13<01:58,  4.74it/s]
reward: -5.9834, last reward: -6.0432, gradient norm:  8.038:  10%|█         | 65/625 [00:14<01:58,  4.74it/s]
reward: -5.9834, last reward: -6.0432, gradient norm:  8.038:  11%|█         | 66/625 [00:14<01:58,  4.72it/s]
reward: -5.7958, last reward: -5.1525, gradient norm:  8.564:  11%|█         | 66/625 [00:14<01:58,  4.72it/s]
reward: -5.7958, last reward: -5.1525, gradient norm:  8.564:  11%|█         | 67/625 [00:14<01:58,  4.73it/s]
reward: -5.8544, last reward: -5.2747, gradient norm:  7.632:  11%|█         | 67/625 [00:14<01:58,  4.73it/s]
reward: -5.8544, last reward: -5.2747, gradient norm:  7.632:  11%|█         | 68/625 [00:14<01:57,  4.75it/s]
reward: -5.3922, last reward: -4.5267, gradient norm:  18.13:  11%|█         | 68/625 [00:14<01:57,  4.75it/s]
reward: -5.3922, last reward: -4.5267, gradient norm:  18.13:  11%|█         | 69/625 [00:14<01:57,  4.73it/s]
reward: -5.0917, last reward: -3.3025, gradient norm:  2.33:  11%|█         | 69/625 [00:14<01:57,  4.73it/s]
reward: -5.0917, last reward: -3.3025, gradient norm:  2.33:  11%|█         | 70/625 [00:14<01:56,  4.75it/s]
reward: -5.0968, last reward: -6.1214, gradient norm:  11.27:  11%|█         | 70/625 [00:15<01:56,  4.75it/s]
reward: -5.0968, last reward: -6.1214, gradient norm:  11.27:  11%|█▏        | 71/625 [00:15<01:56,  4.76it/s]
reward: -5.2523, last reward: -4.0580, gradient norm:  22.2:  11%|█▏        | 71/625 [00:15<01:56,  4.76it/s]
reward: -5.2523, last reward: -4.0580, gradient norm:  22.2:  12%|█▏        | 72/625 [00:15<01:56,  4.77it/s]
reward: -5.4829, last reward: -6.6886, gradient norm:  12.37:  12%|█▏        | 72/625 [00:15<01:56,  4.77it/s]
reward: -5.4829, last reward: -6.6886, gradient norm:  12.37:  12%|█▏        | 73/625 [00:15<01:55,  4.76it/s]
reward: -5.7293, last reward: -9.4615, gradient norm:  15.07:  12%|█▏        | 73/625 [00:15<01:55,  4.76it/s]
reward: -5.7293, last reward: -9.4615, gradient norm:  15.07:  12%|█▏        | 74/625 [00:15<01:57,  4.71it/s]
reward: -5.7735, last reward: -9.0859, gradient norm:  892.4:  12%|█▏        | 74/625 [00:15<01:57,  4.71it/s]
reward: -5.7735, last reward: -9.0859, gradient norm:  892.4:  12%|█▏        | 75/625 [00:15<01:56,  4.74it/s]
reward: -6.1616, last reward: -9.2996, gradient norm:  9.569:  12%|█▏        | 75/625 [00:16<01:56,  4.74it/s]
reward: -6.1616, last reward: -9.2996, gradient norm:  9.569:  12%|█▏        | 76/625 [00:16<01:55,  4.74it/s]
reward: -6.2202, last reward: -9.3199, gradient norm:  8.919:  12%|█▏        | 76/625 [00:16<01:55,  4.74it/s]
reward: -6.2202, last reward: -9.3199, gradient norm:  8.919:  12%|█▏        | 77/625 [00:16<01:55,  4.75it/s]
reward: -6.1349, last reward: -9.9361, gradient norm:  10.06:  12%|█▏        | 77/625 [00:16<01:55,  4.75it/s]
reward: -6.1349, last reward: -9.9361, gradient norm:  10.06:  12%|█▏        | 78/625 [00:16<01:54,  4.76it/s]
reward: -6.0374, last reward: -10.4791, gradient norm:  45.37:  12%|█▏        | 78/625 [00:16<01:54,  4.76it/s]
reward: -6.0374, last reward: -10.4791, gradient norm:  45.37:  13%|█▎        | 79/625 [00:16<01:54,  4.76it/s]
reward: -5.6990, last reward: -9.0426, gradient norm:  32.93:  13%|█▎        | 79/625 [00:17<01:54,  4.76it/s]
reward: -5.6990, last reward: -9.0426, gradient norm:  32.93:  13%|█▎        | 80/625 [00:17<01:54,  4.77it/s]
reward: -5.3303, last reward: -4.9148, gradient norm:  307.4:  13%|█▎        | 80/625 [00:17<01:54,  4.77it/s]
reward: -5.3303, last reward: -4.9148, gradient norm:  307.4:  13%|█▎        | 81/625 [00:17<01:54,  4.77it/s]
reward: -5.2291, last reward: -3.3632, gradient norm:  2.828:  13%|█▎        | 81/625 [00:17<01:54,  4.77it/s]
reward: -5.2291, last reward: -3.3632, gradient norm:  2.828:  13%|█▎        | 82/625 [00:17<01:54,  4.76it/s]
reward: -5.0228, last reward: -3.1018, gradient norm:  32.56:  13%|█▎        | 82/625 [00:17<01:54,  4.76it/s]
reward: -5.0228, last reward: -3.1018, gradient norm:  32.56:  13%|█▎        | 83/625 [00:17<01:54,  4.75it/s]
reward: -5.0364, last reward: -3.8503, gradient norm:  8.948:  13%|█▎        | 83/625 [00:17<01:54,  4.75it/s]
reward: -5.0364, last reward: -3.8503, gradient norm:  8.948:  13%|█▎        | 84/625 [00:17<01:53,  4.77it/s]
reward: -4.9341, last reward: -6.9319, gradient norm:  119.2:  13%|█▎        | 84/625 [00:18<01:53,  4.77it/s]
reward: -4.9341, last reward: -6.9319, gradient norm:  119.2:  14%|█▎        | 85/625 [00:18<01:52,  4.78it/s]
reward: -5.0693, last reward: -6.4436, gradient norm:  5.28:  14%|█▎        | 85/625 [00:18<01:52,  4.78it/s]
reward: -5.0693, last reward: -6.4436, gradient norm:  5.28:  14%|█▍        | 86/625 [00:18<01:52,  4.78it/s]
reward: -4.9258, last reward: -6.0461, gradient norm:  4.376:  14%|█▍        | 86/625 [00:18<01:52,  4.78it/s]
reward: -4.9258, last reward: -6.0461, gradient norm:  4.376:  14%|█▍        | 87/625 [00:18<01:52,  4.77it/s]
reward: -4.9910, last reward: -4.5681, gradient norm:  25.14:  14%|█▍        | 87/625 [00:18<01:52,  4.77it/s]
reward: -4.9910, last reward: -4.5681, gradient norm:  25.14:  14%|█▍        | 88/625 [00:18<01:52,  4.79it/s]
reward: -5.1716, last reward: -5.3157, gradient norm:  15.5:  14%|█▍        | 88/625 [00:18<01:52,  4.79it/s]
reward: -5.1716, last reward: -5.3157, gradient norm:  15.5:  14%|█▍        | 89/625 [00:18<01:51,  4.79it/s]
reward: -4.9816, last reward: -3.5950, gradient norm:  7.403:  14%|█▍        | 89/625 [00:19<01:51,  4.79it/s]
reward: -4.9816, last reward: -3.5950, gradient norm:  7.403:  14%|█▍        | 90/625 [00:19<01:51,  4.79it/s]
reward: -4.7252, last reward: -4.8815, gradient norm:  10.07:  14%|█▍        | 90/625 [00:19<01:51,  4.79it/s]
reward: -4.7252, last reward: -4.8815, gradient norm:  10.07:  15%|█▍        | 91/625 [00:19<01:51,  4.80it/s]
reward: -4.9986, last reward: -5.8680, gradient norm:  14.26:  15%|█▍        | 91/625 [00:19<01:51,  4.80it/s]
reward: -4.9986, last reward: -5.8680, gradient norm:  14.26:  15%|█▍        | 92/625 [00:19<01:51,  4.79it/s]
reward: -4.9029, last reward: -5.7132, gradient norm:  21.65:  15%|█▍        | 92/625 [00:19<01:51,  4.79it/s]
reward: -4.9029, last reward: -5.7132, gradient norm:  21.65:  15%|█▍        | 93/625 [00:19<01:51,  4.76it/s]
reward: -4.7814, last reward: -6.5231, gradient norm:  27.4:  15%|█▍        | 93/625 [00:19<01:51,  4.76it/s]
reward: -4.7814, last reward: -6.5231, gradient norm:  27.4:  15%|█▌        | 94/625 [00:19<01:51,  4.77it/s]
reward: -4.7013, last reward: -6.0821, gradient norm:  22.53:  15%|█▌        | 94/625 [00:20<01:51,  4.77it/s]
reward: -4.7013, last reward: -6.0821, gradient norm:  22.53:  15%|█▌        | 95/625 [00:20<01:51,  4.77it/s]
reward: -4.3526, last reward: -5.3718, gradient norm:  28.77:  15%|█▌        | 95/625 [00:20<01:51,  4.77it/s]
reward: -4.3526, last reward: -5.3718, gradient norm:  28.77:  15%|█▌        | 96/625 [00:20<01:50,  4.78it/s]
reward: -5.0901, last reward: -5.0493, gradient norm:  8.428:  15%|█▌        | 96/625 [00:20<01:50,  4.78it/s]
reward: -5.0901, last reward: -5.0493, gradient norm:  8.428:  16%|█▌        | 97/625 [00:20<01:50,  4.79it/s]
reward: -4.9341, last reward: -4.0375, gradient norm:  17.1:  16%|█▌        | 97/625 [00:20<01:50,  4.79it/s]
reward: -4.9341, last reward: -4.0375, gradient norm:  17.1:  16%|█▌        | 98/625 [00:20<01:50,  4.78it/s]
reward: -5.0707, last reward: -5.9903, gradient norm:  12.01:  16%|█▌        | 98/625 [00:21<01:50,  4.78it/s]
reward: -5.0707, last reward: -5.9903, gradient norm:  12.01:  16%|█▌        | 99/625 [00:21<01:50,  4.76it/s]
reward: -4.8171, last reward: -4.1591, gradient norm:  47.69:  16%|█▌        | 99/625 [00:21<01:50,  4.76it/s]
reward: -4.8171, last reward: -4.1591, gradient norm:  47.69:  16%|█▌        | 100/625 [00:21<01:49,  4.78it/s]
reward: -4.8621, last reward: -4.1783, gradient norm:  9.28:  16%|█▌        | 100/625 [00:21<01:49,  4.78it/s]
reward: -4.8621, last reward: -4.1783, gradient norm:  9.28:  16%|█▌        | 101/625 [00:21<01:49,  4.77it/s]
reward: -4.4683, last reward: -2.4896, gradient norm:  10.58:  16%|█▌        | 101/625 [00:21<01:49,  4.77it/s]
reward: -4.4683, last reward: -2.4896, gradient norm:  10.58:  16%|█▋        | 102/625 [00:21<01:50,  4.75it/s]
reward: -4.5413, last reward: -5.7029, gradient norm:  8.056:  16%|█▋        | 102/625 [00:21<01:50,  4.75it/s]
reward: -4.5413, last reward: -5.7029, gradient norm:  8.056:  16%|█▋        | 103/625 [00:21<01:49,  4.77it/s]
reward: -4.6580, last reward: -8.4799, gradient norm:  34.32:  16%|█▋        | 103/625 [00:22<01:49,  4.77it/s]
reward: -4.6580, last reward: -8.4799, gradient norm:  34.32:  17%|█▋        | 104/625 [00:22<01:49,  4.78it/s]
reward: -4.6693, last reward: -7.4469, gradient norm:  81.33:  17%|█▋        | 104/625 [00:22<01:49,  4.78it/s]
reward: -4.6693, last reward: -7.4469, gradient norm:  81.33:  17%|█▋        | 105/625 [00:22<01:48,  4.78it/s]
reward: -4.7061, last reward: -3.6757, gradient norm:  13.94:  17%|█▋        | 105/625 [00:22<01:48,  4.78it/s]
reward: -4.7061, last reward: -3.6757, gradient norm:  13.94:  17%|█▋        | 106/625 [00:22<01:48,  4.77it/s]
reward: -4.4342, last reward: -3.6883, gradient norm:  26.25:  17%|█▋        | 106/625 [00:22<01:48,  4.77it/s]
reward: -4.4342, last reward: -3.6883, gradient norm:  26.25:  17%|█▋        | 107/625 [00:22<01:48,  4.77it/s]
reward: -4.3992, last reward: -2.4497, gradient norm:  15.67:  17%|█▋        | 107/625 [00:22<01:48,  4.77it/s]
reward: -4.3992, last reward: -2.4497, gradient norm:  15.67:  17%|█▋        | 108/625 [00:22<01:48,  4.77it/s]
reward: -4.3980, last reward: -4.0425, gradient norm:  13.06:  17%|█▋        | 108/625 [00:23<01:48,  4.77it/s]
reward: -4.3980, last reward: -4.0425, gradient norm:  13.06:  17%|█▋        | 109/625 [00:23<01:47,  4.78it/s]
reward: -5.2514, last reward: -4.0430, gradient norm:  8.778:  17%|█▋        | 109/625 [00:23<01:47,  4.78it/s]
reward: -5.2514, last reward: -4.0430, gradient norm:  8.778:  18%|█▊        | 110/625 [00:23<01:47,  4.78it/s]
reward: -5.2656, last reward: -5.0365, gradient norm:  8.68:  18%|█▊        | 110/625 [00:23<01:47,  4.78it/s]
reward: -5.2656, last reward: -5.0365, gradient norm:  8.68:  18%|█▊        | 111/625 [00:23<01:47,  4.78it/s]
reward: -5.2567, last reward: -5.9920, gradient norm:  11.66:  18%|█▊        | 111/625 [00:23<01:47,  4.78it/s]
reward: -5.2567, last reward: -5.9920, gradient norm:  11.66:  18%|█▊        | 112/625 [00:23<01:47,  4.76it/s]
reward: -5.0847, last reward: -5.2160, gradient norm:  12.61:  18%|█▊        | 112/625 [00:23<01:47,  4.76it/s]
reward: -5.0847, last reward: -5.2160, gradient norm:  12.61:  18%|█▊        | 113/625 [00:23<01:47,  4.76it/s]
reward: -4.8941, last reward: -5.0903, gradient norm:  14.7:  18%|█▊        | 113/625 [00:24<01:47,  4.76it/s]
reward: -4.8941, last reward: -5.0903, gradient norm:  14.7:  18%|█▊        | 114/625 [00:24<01:47,  4.77it/s]
reward: -4.5529, last reward: -3.4350, gradient norm:  24.5:  18%|█▊        | 114/625 [00:24<01:47,  4.77it/s]
reward: -4.5529, last reward: -3.4350, gradient norm:  24.5:  18%|█▊        | 115/625 [00:24<01:46,  4.78it/s]
reward: -4.4047, last reward: -3.9059, gradient norm:  11.8:  18%|█▊        | 115/625 [00:24<01:46,  4.78it/s]
reward: -4.4047, last reward: -3.9059, gradient norm:  11.8:  19%|█▊        | 116/625 [00:24<01:46,  4.78it/s]
reward: -4.7905, last reward: -4.2659, gradient norm:  14.6:  19%|█▊        | 116/625 [00:24<01:46,  4.78it/s]
reward: -4.7905, last reward: -4.2659, gradient norm:  14.6:  19%|█▊        | 117/625 [00:24<01:46,  4.79it/s]
reward: -5.1685, last reward: -5.0558, gradient norm:  2.069:  19%|█▊        | 117/625 [00:24<01:46,  4.79it/s]
reward: -5.1685, last reward: -5.0558, gradient norm:  2.069:  19%|█▉        | 118/625 [00:24<01:45,  4.79it/s]
reward: -5.3224, last reward: -3.9649, gradient norm:  22.7:  19%|█▉        | 118/625 [00:25<01:45,  4.79it/s]
reward: -5.3224, last reward: -3.9649, gradient norm:  22.7:  19%|█▉        | 119/625 [00:25<01:45,  4.79it/s]
reward: -5.3083, last reward: -4.9055, gradient norm:  13.3:  19%|█▉        | 119/625 [00:25<01:45,  4.79it/s]
reward: -5.3083, last reward: -4.9055, gradient norm:  13.3:  19%|█▉        | 120/625 [00:25<01:45,  4.78it/s]
reward: -5.1928, last reward: -6.0475, gradient norm:  59.18:  19%|█▉        | 120/625 [00:25<01:45,  4.78it/s]
reward: -5.1928, last reward: -6.0475, gradient norm:  59.18:  19%|█▉        | 121/625 [00:25<01:46,  4.75it/s]
reward: -5.0833, last reward: -4.8086, gradient norm:  20.01:  19%|█▉        | 121/625 [00:25<01:46,  4.75it/s]
reward: -5.0833, last reward: -4.8086, gradient norm:  20.01:  20%|█▉        | 122/625 [00:25<01:46,  4.74it/s]
reward: -4.6719, last reward: -8.9463, gradient norm:  54.76:  20%|█▉        | 122/625 [00:26<01:46,  4.74it/s]
reward: -4.6719, last reward: -8.9463, gradient norm:  54.76:  20%|█▉        | 123/625 [00:26<01:46,  4.72it/s]
reward: -4.2157, last reward: -3.4610, gradient norm:  10.41:  20%|█▉        | 123/625 [00:26<01:46,  4.72it/s]
reward: -4.2157, last reward: -3.4610, gradient norm:  10.41:  20%|█▉        | 124/625 [00:26<01:46,  4.71it/s]
reward: -4.4119, last reward: -2.9298, gradient norm:  50.3:  20%|█▉        | 124/625 [00:26<01:46,  4.71it/s]
reward: -4.4119, last reward: -2.9298, gradient norm:  50.3:  20%|██        | 125/625 [00:26<01:46,  4.72it/s]
reward: -4.7378, last reward: -4.1409, gradient norm:  12.45:  20%|██        | 125/625 [00:26<01:46,  4.72it/s]
reward: -4.7378, last reward: -4.1409, gradient norm:  12.45:  20%|██        | 126/625 [00:26<01:45,  4.74it/s]
reward: -4.0920, last reward: -4.0036, gradient norm:  17.08:  20%|██        | 126/625 [00:26<01:45,  4.74it/s]
reward: -4.0920, last reward: -4.0036, gradient norm:  17.08:  20%|██        | 127/625 [00:26<01:44,  4.74it/s]
reward: -4.4453, last reward: -2.8994, gradient norm:  26.63:  20%|██        | 127/625 [00:27<01:44,  4.74it/s]
reward: -4.4453, last reward: -2.8994, gradient norm:  26.63:  20%|██        | 128/625 [00:27<01:44,  4.77it/s]
reward: -4.2940, last reward: -4.9240, gradient norm:  113.7:  20%|██        | 128/625 [00:27<01:44,  4.77it/s]
reward: -4.2940, last reward: -4.9240, gradient norm:  113.7:  21%|██        | 129/625 [00:27<01:43,  4.78it/s]
reward: -4.4657, last reward: -5.8249, gradient norm:  15.75:  21%|██        | 129/625 [00:27<01:43,  4.78it/s]
reward: -4.4657, last reward: -5.8249, gradient norm:  15.75:  21%|██        | 130/625 [00:27<01:43,  4.78it/s]
reward: -4.6821, last reward: -6.2320, gradient norm:  24.59:  21%|██        | 130/625 [00:27<01:43,  4.78it/s]
reward: -4.6821, last reward: -6.2320, gradient norm:  24.59:  21%|██        | 131/625 [00:27<01:43,  4.78it/s]
reward: -4.7717, last reward: -7.0348, gradient norm:  21.43:  21%|██        | 131/625 [00:27<01:43,  4.78it/s]
reward: -4.7717, last reward: -7.0348, gradient norm:  21.43:  21%|██        | 132/625 [00:27<01:42,  4.79it/s]
reward: -4.5923, last reward: -9.1746, gradient norm:  38.4:  21%|██        | 132/625 [00:28<01:42,  4.79it/s]
reward: -4.5923, last reward: -9.1746, gradient norm:  38.4:  21%|██▏       | 133/625 [00:28<01:42,  4.79it/s]
reward: -4.2964, last reward: -4.3941, gradient norm:  7.475:  21%|██▏       | 133/625 [00:28<01:42,  4.79it/s]
reward: -4.2964, last reward: -4.3941, gradient norm:  7.475:  21%|██▏       | 134/625 [00:28<01:42,  4.80it/s]
reward: -4.2730, last reward: -3.0781, gradient norm:  22.33:  21%|██▏       | 134/625 [00:28<01:42,  4.80it/s]
reward: -4.2730, last reward: -3.0781, gradient norm:  22.33:  22%|██▏       | 135/625 [00:28<01:42,  4.80it/s]
reward: -4.2718, last reward: -3.1451, gradient norm:  8.063:  22%|██▏       | 135/625 [00:28<01:42,  4.80it/s]
reward: -4.2718, last reward: -3.1451, gradient norm:  8.063:  22%|██▏       | 136/625 [00:28<01:41,  4.80it/s]
reward: -4.3199, last reward: -5.0931, gradient norm:  131.1:  22%|██▏       | 136/625 [00:28<01:41,  4.80it/s]
reward: -4.3199, last reward: -5.0931, gradient norm:  131.1:  22%|██▏       | 137/625 [00:28<01:41,  4.80it/s]
reward: -4.4474, last reward: -5.2053, gradient norm:  22.13:  22%|██▏       | 137/625 [00:29<01:41,  4.80it/s]
reward: -4.4474, last reward: -5.2053, gradient norm:  22.13:  22%|██▏       | 138/625 [00:29<01:41,  4.80it/s]
reward: -4.9233, last reward: -3.8841, gradient norm:  6.794:  22%|██▏       | 138/625 [00:29<01:41,  4.80it/s]
reward: -4.9233, last reward: -3.8841, gradient norm:  6.794:  22%|██▏       | 139/625 [00:29<01:41,  4.79it/s]
reward: -4.7412, last reward: -4.6784, gradient norm:  15.88:  22%|██▏       | 139/625 [00:29<01:41,  4.79it/s]
reward: -4.7412, last reward: -4.6784, gradient norm:  15.88:  22%|██▏       | 140/625 [00:29<01:41,  4.76it/s]
reward: -4.4236, last reward: -3.8232, gradient norm:  95.06:  22%|██▏       | 140/625 [00:29<01:41,  4.76it/s]
reward: -4.4236, last reward: -3.8232, gradient norm:  95.06:  23%|██▎       | 141/625 [00:29<01:41,  4.78it/s]
reward: -4.2859, last reward: -5.9936, gradient norm:  19.62:  23%|██▎       | 141/625 [00:30<01:41,  4.78it/s]
reward: -4.2859, last reward: -5.9936, gradient norm:  19.62:  23%|██▎       | 142/625 [00:30<01:40,  4.78it/s]
reward: -4.4756, last reward: -3.0061, gradient norm:  58.42:  23%|██▎       | 142/625 [00:30<01:40,  4.78it/s]
reward: -4.4756, last reward: -3.0061, gradient norm:  58.42:  23%|██▎       | 143/625 [00:30<01:40,  4.79it/s]
reward: -4.6419, last reward: -2.8358, gradient norm:  21.94:  23%|██▎       | 143/625 [00:30<01:40,  4.79it/s]
reward: -4.6419, last reward: -2.8358, gradient norm:  21.94:  23%|██▎       | 144/625 [00:30<01:40,  4.79it/s]
reward: -4.5489, last reward: -4.8108, gradient norm:  26.27:  23%|██▎       | 144/625 [00:30<01:40,  4.79it/s]
reward: -4.5489, last reward: -4.8108, gradient norm:  26.27:  23%|██▎       | 145/625 [00:30<01:40,  4.76it/s]
reward: -4.4234, last reward: -6.1971, gradient norm:  24.6:  23%|██▎       | 145/625 [00:30<01:40,  4.76it/s]
reward: -4.4234, last reward: -6.1971, gradient norm:  24.6:  23%|██▎       | 146/625 [00:30<01:40,  4.74it/s]
reward: -4.6739, last reward: -4.1551, gradient norm:  8.242:  23%|██▎       | 146/625 [00:31<01:40,  4.74it/s]
reward: -4.6739, last reward: -4.1551, gradient norm:  8.242:  24%|██▎       | 147/625 [00:31<01:41,  4.73it/s]
reward: -4.4584, last reward: -5.1256, gradient norm:  4.714:  24%|██▎       | 147/625 [00:31<01:41,  4.73it/s]
reward: -4.4584, last reward: -5.1256, gradient norm:  4.714:  24%|██▎       | 148/625 [00:31<01:40,  4.75it/s]
reward: -4.3930, last reward: -3.8382, gradient norm:  2.931:  24%|██▎       | 148/625 [00:31<01:40,  4.75it/s]
reward: -4.3930, last reward: -3.8382, gradient norm:  2.931:  24%|██▍       | 149/625 [00:31<01:40,  4.74it/s]
reward: -4.8215, last reward: -3.7751, gradient norm:  12.4:  24%|██▍       | 149/625 [00:31<01:40,  4.74it/s]
reward: -4.8215, last reward: -3.7751, gradient norm:  12.4:  24%|██▍       | 150/625 [00:31<01:40,  4.74it/s]
reward: -4.9927, last reward: -4.0620, gradient norm:  9.91:  24%|██▍       | 150/625 [00:31<01:40,  4.74it/s]
reward: -4.9927, last reward: -4.0620, gradient norm:  9.91:  24%|██▍       | 151/625 [00:31<01:40,  4.74it/s]
reward: -4.7118, last reward: -4.4055, gradient norm:  14.72:  24%|██▍       | 151/625 [00:32<01:40,  4.74it/s]
reward: -4.7118, last reward: -4.4055, gradient norm:  14.72:  24%|██▍       | 152/625 [00:32<01:39,  4.75it/s]
reward: -4.5860, last reward: -3.0642, gradient norm:  12.02:  24%|██▍       | 152/625 [00:32<01:39,  4.75it/s]
reward: -4.5860, last reward: -3.0642, gradient norm:  12.02:  24%|██▍       | 153/625 [00:32<01:38,  4.77it/s]
reward: -4.2358, last reward: -3.0014, gradient norm:  20.68:  24%|██▍       | 153/625 [00:32<01:38,  4.77it/s]
reward: -4.2358, last reward: -3.0014, gradient norm:  20.68:  25%|██▍       | 154/625 [00:32<01:38,  4.78it/s]
reward: -4.3053, last reward: -4.5390, gradient norm:  14.11:  25%|██▍       | 154/625 [00:32<01:38,  4.78it/s]
reward: -4.3053, last reward: -4.5390, gradient norm:  14.11:  25%|██▍       | 155/625 [00:32<01:38,  4.79it/s]
reward: -4.4845, last reward: -7.6566, gradient norm:  51.89:  25%|██▍       | 155/625 [00:32<01:38,  4.79it/s]
reward: -4.4845, last reward: -7.6566, gradient norm:  51.89:  25%|██▍       | 156/625 [00:32<01:38,  4.78it/s]
reward: -4.7679, last reward: -8.4566, gradient norm:  19.11:  25%|██▍       | 156/625 [00:33<01:38,  4.78it/s]
reward: -4.7679, last reward: -8.4566, gradient norm:  19.11:  25%|██▌       | 157/625 [00:33<01:37,  4.79it/s]
reward: -4.6030, last reward: -6.4867, gradient norm:  24.21:  25%|██▌       | 157/625 [00:33<01:37,  4.79it/s]
reward: -4.6030, last reward: -6.4867, gradient norm:  24.21:  25%|██▌       | 158/625 [00:33<01:37,  4.79it/s]
reward: -4.3156, last reward: -4.3057, gradient norm:  26.15:  25%|██▌       | 158/625 [00:33<01:37,  4.79it/s]
reward: -4.3156, last reward: -4.3057, gradient norm:  26.15:  25%|██▌       | 159/625 [00:33<01:37,  4.79it/s]
reward: -4.1515, last reward: -2.7400, gradient norm:  46.67:  25%|██▌       | 159/625 [00:33<01:37,  4.79it/s]
reward: -4.1515, last reward: -2.7400, gradient norm:  46.67:  26%|██▌       | 160/625 [00:33<01:36,  4.80it/s]
reward: -4.1984, last reward: -3.1343, gradient norm:  10.44:  26%|██▌       | 160/625 [00:34<01:36,  4.80it/s]
reward: -4.1984, last reward: -3.1343, gradient norm:  10.44:  26%|██▌       | 161/625 [00:34<01:36,  4.80it/s]
reward: -4.7794, last reward: -4.1895, gradient norm:  15.07:  26%|██▌       | 161/625 [00:34<01:36,  4.80it/s]
reward: -4.7794, last reward: -4.1895, gradient norm:  15.07:  26%|██▌       | 162/625 [00:34<01:36,  4.81it/s]
reward: -4.8227, last reward: -3.9495, gradient norm:  10.96:  26%|██▌       | 162/625 [00:34<01:36,  4.81it/s]
reward: -4.8227, last reward: -3.9495, gradient norm:  10.96:  26%|██▌       | 163/625 [00:34<01:36,  4.81it/s]
reward: -5.0627, last reward: -2.8677, gradient norm:  8.216:  26%|██▌       | 163/625 [00:34<01:36,  4.81it/s]
reward: -5.0627, last reward: -2.8677, gradient norm:  8.216:  26%|██▌       | 164/625 [00:34<01:35,  4.81it/s]
reward: -4.3039, last reward: -3.8106, gradient norm:  15.09:  26%|██▌       | 164/625 [00:34<01:35,  4.81it/s]
reward: -4.3039, last reward: -3.8106, gradient norm:  15.09:  26%|██▋       | 165/625 [00:34<01:35,  4.81it/s]
reward: -4.2623, last reward: -3.6619, gradient norm:  22.77:  26%|██▋       | 165/625 [00:35<01:35,  4.81it/s]
reward: -4.2623, last reward: -3.6619, gradient norm:  22.77:  27%|██▋       | 166/625 [00:35<01:35,  4.81it/s]
reward: -4.0987, last reward: -3.0736, gradient norm:  20.92:  27%|██▋       | 166/625 [00:35<01:35,  4.81it/s]
reward: -4.0987, last reward: -3.0736, gradient norm:  20.92:  27%|██▋       | 167/625 [00:35<01:35,  4.80it/s]
reward: -4.3893, last reward: -5.3442, gradient norm:  9.876:  27%|██▋       | 167/625 [00:35<01:35,  4.80it/s]
reward: -4.3893, last reward: -5.3442, gradient norm:  9.876:  27%|██▋       | 168/625 [00:35<01:54,  3.98it/s]
reward: -4.6078, last reward: -7.7466, gradient norm:  16.06:  27%|██▋       | 168/625 [00:35<01:54,  3.98it/s]
reward: -4.6078, last reward: -7.7466, gradient norm:  16.06:  27%|██▋       | 169/625 [00:35<01:48,  4.19it/s]
reward: -4.5928, last reward: -6.5101, gradient norm:  20.69:  27%|██▋       | 169/625 [00:36<01:48,  4.19it/s]
reward: -4.5928, last reward: -6.5101, gradient norm:  20.69:  27%|██▋       | 170/625 [00:36<01:44,  4.36it/s]
reward: -4.3683, last reward: -3.9307, gradient norm:  78.59:  27%|██▋       | 170/625 [00:36<01:44,  4.36it/s]
reward: -4.3683, last reward: -3.9307, gradient norm:  78.59:  27%|██▋       | 171/625 [00:36<01:41,  4.48it/s]
reward: -4.1301, last reward: -2.4966, gradient norm:  41.21:  27%|██▋       | 171/625 [00:36<01:41,  4.48it/s]
reward: -4.1301, last reward: -2.4966, gradient norm:  41.21:  28%|██▊       | 172/625 [00:36<01:39,  4.57it/s]
reward: -4.0062, last reward: -2.8255, gradient norm:  4.798:  28%|██▊       | 172/625 [00:36<01:39,  4.57it/s]
reward: -4.0062, last reward: -2.8255, gradient norm:  4.798:  28%|██▊       | 173/625 [00:36<01:37,  4.63it/s]
reward: -4.1558, last reward: -3.7388, gradient norm:  214.8:  28%|██▊       | 173/625 [00:36<01:37,  4.63it/s]
reward: -4.1558, last reward: -3.7388, gradient norm:  214.8:  28%|██▊       | 174/625 [00:36<01:36,  4.68it/s]
reward: -4.2803, last reward: -3.7403, gradient norm:  15.82:  28%|██▊       | 174/625 [00:37<01:36,  4.68it/s]
reward: -4.2803, last reward: -3.7403, gradient norm:  15.82:  28%|██▊       | 175/625 [00:37<01:35,  4.73it/s]
reward: -4.4744, last reward: -2.6246, gradient norm:  8.711:  28%|██▊       | 175/625 [00:37<01:35,  4.73it/s]
reward: -4.4744, last reward: -2.6246, gradient norm:  8.711:  28%|██▊       | 176/625 [00:37<01:35,  4.72it/s]
reward: -4.3930, last reward: -4.4075, gradient norm:  5.093:  28%|██▊       | 176/625 [00:37<01:35,  4.72it/s]
reward: -4.3930, last reward: -4.4075, gradient norm:  5.093:  28%|██▊       | 177/625 [00:37<01:34,  4.74it/s]
reward: -4.5119, last reward: -5.6155, gradient norm:  6.556:  28%|██▊       | 177/625 [00:37<01:34,  4.74it/s]
reward: -4.5119, last reward: -5.6155, gradient norm:  6.556:  28%|██▊       | 178/625 [00:37<01:34,  4.75it/s]
reward: -4.4439, last reward: -4.5042, gradient norm:  4.911:  28%|██▊       | 178/625 [00:37<01:34,  4.75it/s]
reward: -4.4439, last reward: -4.5042, gradient norm:  4.911:  29%|██▊       | 179/625 [00:37<01:33,  4.75it/s]
reward: -3.9554, last reward: -2.5403, gradient norm:  13.88:  29%|██▊       | 179/625 [00:38<01:33,  4.75it/s]
reward: -3.9554, last reward: -2.5403, gradient norm:  13.88:  29%|██▉       | 180/625 [00:38<01:33,  4.76it/s]
reward: -4.3505, last reward: -2.7444, gradient norm:  4.01:  29%|██▉       | 180/625 [00:38<01:33,  4.76it/s]
reward: -4.3505, last reward: -2.7444, gradient norm:  4.01:  29%|██▉       | 181/625 [00:38<01:33,  4.77it/s]
reward: -4.4148, last reward: -4.6757, gradient norm:  9.661:  29%|██▉       | 181/625 [00:38<01:33,  4.77it/s]
reward: -4.4148, last reward: -4.6757, gradient norm:  9.661:  29%|██▉       | 182/625 [00:38<01:33,  4.76it/s]
reward: -4.7255, last reward: -4.1250, gradient norm:  13.23:  29%|██▉       | 182/625 [00:38<01:33,  4.76it/s]
reward: -4.7255, last reward: -4.1250, gradient norm:  13.23:  29%|██▉       | 183/625 [00:38<01:33,  4.73it/s]
reward: -4.7526, last reward: -4.5914, gradient norm:  10.12:  29%|██▉       | 183/625 [00:38<01:33,  4.73it/s]
reward: -4.7526, last reward: -4.5914, gradient norm:  10.12:  29%|██▉       | 184/625 [00:38<01:33,  4.73it/s]
reward: -4.6860, last reward: -3.1830, gradient norm:  11.02:  29%|██▉       | 184/625 [00:39<01:33,  4.73it/s]
reward: -4.6860, last reward: -3.1830, gradient norm:  11.02:  30%|██▉       | 185/625 [00:39<01:32,  4.75it/s]
reward: -4.3758, last reward: -4.4231, gradient norm:  21.28:  30%|██▉       | 185/625 [00:39<01:32,  4.75it/s]
reward: -4.3758, last reward: -4.4231, gradient norm:  21.28:  30%|██▉       | 186/625 [00:39<01:32,  4.76it/s]
reward: -4.1488, last reward: -4.7337, gradient norm:  9.908:  30%|██▉       | 186/625 [00:39<01:32,  4.76it/s]
reward: -4.1488, last reward: -4.7337, gradient norm:  9.908:  30%|██▉       | 187/625 [00:39<01:31,  4.77it/s]
reward: -3.9613, last reward: -3.1772, gradient norm:  15.58:  30%|██▉       | 187/625 [00:39<01:31,  4.77it/s]
reward: -3.9613, last reward: -3.1772, gradient norm:  15.58:  30%|███       | 188/625 [00:39<01:31,  4.76it/s]
reward: -4.2562, last reward: -4.2022, gradient norm:  28.65:  30%|███       | 188/625 [00:40<01:31,  4.76it/s]
reward: -4.2562, last reward: -4.2022, gradient norm:  28.65:  30%|███       | 189/625 [00:40<01:31,  4.76it/s]
reward: -4.6174, last reward: -5.0209, gradient norm:  20.98:  30%|███       | 189/625 [00:40<01:31,  4.76it/s]
reward: -4.6174, last reward: -5.0209, gradient norm:  20.98:  30%|███       | 190/625 [00:40<01:31,  4.75it/s]
reward: -4.5392, last reward: -6.6212, gradient norm:  26.19:  30%|███       | 190/625 [00:40<01:31,  4.75it/s]
reward: -4.5392, last reward: -6.6212, gradient norm:  26.19:  31%|███       | 191/625 [00:40<01:31,  4.74it/s]
reward: -4.4612, last reward: -5.7472, gradient norm:  25.55:  31%|███       | 191/625 [00:40<01:31,  4.74it/s]
reward: -4.4612, last reward: -5.7472, gradient norm:  25.55:  31%|███       | 192/625 [00:40<01:31,  4.74it/s]
reward: -3.7723, last reward: -2.9722, gradient norm:  55.78:  31%|███       | 192/625 [00:40<01:31,  4.74it/s]
reward: -3.7723, last reward: -2.9722, gradient norm:  55.78:  31%|███       | 193/625 [00:40<01:31,  4.74it/s]
reward: -3.7303, last reward: -4.6766, gradient norm:  57.47:  31%|███       | 193/625 [00:41<01:31,  4.74it/s]
reward: -3.7303, last reward: -4.6766, gradient norm:  57.47:  31%|███       | 194/625 [00:41<01:30,  4.76it/s]
reward: -4.5050, last reward: -3.5319, gradient norm:  12.82:  31%|███       | 194/625 [00:41<01:30,  4.76it/s]
reward: -4.5050, last reward: -3.5319, gradient norm:  12.82:  31%|███       | 195/625 [00:41<01:30,  4.77it/s]
reward: -4.9510, last reward: -4.2900, gradient norm:  10.02:  31%|███       | 195/625 [00:41<01:30,  4.77it/s]
reward: -4.9510, last reward: -4.2900, gradient norm:  10.02:  31%|███▏      | 196/625 [00:41<01:29,  4.78it/s]
reward: -4.8987, last reward: -3.8858, gradient norm:  11.21:  31%|███▏      | 196/625 [00:41<01:29,  4.78it/s]
reward: -4.8987, last reward: -3.8858, gradient norm:  11.21:  32%|███▏      | 197/625 [00:41<01:29,  4.79it/s]
reward: -4.7844, last reward: -4.1996, gradient norm:  16.9:  32%|███▏      | 197/625 [00:41<01:29,  4.79it/s]
reward: -4.7844, last reward: -4.1996, gradient norm:  16.9:  32%|███▏      | 198/625 [00:41<01:29,  4.79it/s]
reward: -4.7041, last reward: -3.7807, gradient norm:  12.8:  32%|███▏      | 198/625 [00:42<01:29,  4.79it/s]
reward: -4.7041, last reward: -3.7807, gradient norm:  12.8:  32%|███▏      | 199/625 [00:42<01:29,  4.78it/s]
reward: -4.5883, last reward: -3.1343, gradient norm:  5.33:  32%|███▏      | 199/625 [00:42<01:29,  4.78it/s]
reward: -4.5883, last reward: -3.1343, gradient norm:  5.33:  32%|███▏      | 200/625 [00:42<01:29,  4.77it/s]
reward: -4.3860, last reward: -4.1545, gradient norm:  12.24:  32%|███▏      | 200/625 [00:42<01:29,  4.77it/s]
reward: -4.3860, last reward: -4.1545, gradient norm:  12.24:  32%|███▏      | 201/625 [00:42<01:28,  4.78it/s]
reward: -4.3071, last reward: -5.9397, gradient norm:  70.8:  32%|███▏      | 201/625 [00:42<01:28,  4.78it/s]
reward: -4.3071, last reward: -5.9397, gradient norm:  70.8:  32%|███▏      | 202/625 [00:42<01:28,  4.79it/s]
reward: -3.8351, last reward: -2.9276, gradient norm:  28.92:  32%|███▏      | 202/625 [00:42<01:28,  4.79it/s]
reward: -3.8351, last reward: -2.9276, gradient norm:  28.92:  32%|███▏      | 203/625 [00:42<01:28,  4.80it/s]
reward: -3.6451, last reward: -3.3669, gradient norm:  133.9:  32%|███▏      | 203/625 [00:43<01:28,  4.80it/s]
reward: -3.6451, last reward: -3.3669, gradient norm:  133.9:  33%|███▎      | 204/625 [00:43<01:27,  4.80it/s]
reward: -3.9093, last reward: -2.9751, gradient norm:  34.3:  33%|███▎      | 204/625 [00:43<01:27,  4.80it/s]
reward: -3.9093, last reward: -2.9751, gradient norm:  34.3:  33%|███▎      | 205/625 [00:43<01:27,  4.79it/s]
reward: -4.0323, last reward: -1.9548, gradient norm:  18.41:  33%|███▎      | 205/625 [00:43<01:27,  4.79it/s]
reward: -4.0323, last reward: -1.9548, gradient norm:  18.41:  33%|███▎      | 206/625 [00:43<01:27,  4.80it/s]
reward: -3.4461, last reward: -2.4580, gradient norm:  25.43:  33%|███▎      | 206/625 [00:43<01:27,  4.80it/s]
reward: -3.4461, last reward: -2.4580, gradient norm:  25.43:  33%|███▎      | 207/625 [00:43<01:27,  4.80it/s]
reward: -3.7982, last reward: -2.7564, gradient norm:  107.4:  33%|███▎      | 207/625 [00:43<01:27,  4.80it/s]
reward: -3.7982, last reward: -2.7564, gradient norm:  107.4:  33%|███▎      | 208/625 [00:43<01:26,  4.80it/s]
reward: -3.8554, last reward: -3.2339, gradient norm:  20.46:  33%|███▎      | 208/625 [00:44<01:26,  4.80it/s]
reward: -3.8554, last reward: -3.2339, gradient norm:  20.46:  33%|███▎      | 209/625 [00:44<01:26,  4.81it/s]
reward: -3.7704, last reward: -3.8807, gradient norm:  33.34:  33%|███▎      | 209/625 [00:44<01:26,  4.81it/s]
reward: -3.7704, last reward: -3.8807, gradient norm:  33.34:  34%|███▎      | 210/625 [00:44<01:26,  4.79it/s]
reward: -3.9760, last reward: -4.4843, gradient norm:  25.69:  34%|███▎      | 210/625 [00:44<01:26,  4.79it/s]
reward: -3.9760, last reward: -4.4843, gradient norm:  25.69:  34%|███▍      | 211/625 [00:44<01:26,  4.80it/s]
reward: -3.7967, last reward: -5.2582, gradient norm:  25.03:  34%|███▍      | 211/625 [00:44<01:26,  4.80it/s]
reward: -3.7967, last reward: -5.2582, gradient norm:  25.03:  34%|███▍      | 212/625 [00:44<01:26,  4.80it/s]
reward: -3.7655, last reward: -4.4343, gradient norm:  46.35:  34%|███▍      | 212/625 [00:45<01:26,  4.80it/s]
reward: -3.7655, last reward: -4.4343, gradient norm:  46.35:  34%|███▍      | 213/625 [00:45<01:25,  4.81it/s]
reward: -4.1830, last reward: -3.9914, gradient norm:  48.97:  34%|███▍      | 213/625 [00:45<01:25,  4.81it/s]
reward: -4.1830, last reward: -3.9914, gradient norm:  48.97:  34%|███▍      | 214/625 [00:45<01:25,  4.81it/s]
reward: -4.3355, last reward: -4.1371, gradient norm:  10.28:  34%|███▍      | 214/625 [00:45<01:25,  4.81it/s]
reward: -4.3355, last reward: -4.1371, gradient norm:  10.28:  34%|███▍      | 215/625 [00:45<01:25,  4.79it/s]
reward: -4.2021, last reward: -2.7219, gradient norm:  12.34:  34%|███▍      | 215/625 [00:45<01:25,  4.79it/s]
reward: -4.2021, last reward: -2.7219, gradient norm:  12.34:  35%|███▍      | 216/625 [00:45<01:25,  4.78it/s]
reward: -4.1103, last reward: -3.1725, gradient norm:  11.8:  35%|███▍      | 216/625 [00:45<01:25,  4.78it/s]
reward: -4.1103, last reward: -3.1725, gradient norm:  11.8:  35%|███▍      | 217/625 [00:45<01:25,  4.79it/s]
reward: -4.4244, last reward: -4.2578, gradient norm:  11.67:  35%|███▍      | 217/625 [00:46<01:25,  4.79it/s]
reward: -4.4244, last reward: -4.2578, gradient norm:  11.67:  35%|███▍      | 218/625 [00:46<01:24,  4.79it/s]
reward: -4.0961, last reward: -2.4116, gradient norm:  4.52:  35%|███▍      | 218/625 [00:46<01:24,  4.79it/s]
reward: -4.0961, last reward: -2.4116, gradient norm:  4.52:  35%|███▌      | 219/625 [00:46<01:24,  4.79it/s]
reward: -4.1262, last reward: -2.6491, gradient norm:  12.21:  35%|███▌      | 219/625 [00:46<01:24,  4.79it/s]
reward: -4.1262, last reward: -2.6491, gradient norm:  12.21:  35%|███▌      | 220/625 [00:46<01:25,  4.76it/s]
reward: -4.2716, last reward: -3.9329, gradient norm:  18.67:  35%|███▌      | 220/625 [00:46<01:25,  4.76it/s]
reward: -4.2716, last reward: -3.9329, gradient norm:  18.67:  35%|███▌      | 221/625 [00:46<01:24,  4.76it/s]
reward: -3.8580, last reward: -3.1444, gradient norm:  52.86:  35%|███▌      | 221/625 [00:46<01:24,  4.76it/s]
reward: -3.8580, last reward: -3.1444, gradient norm:  52.86:  36%|███▌      | 222/625 [00:46<01:24,  4.76it/s]
reward: -4.3621, last reward: -3.7214, gradient norm:  16.0:  36%|███▌      | 222/625 [00:47<01:24,  4.76it/s]
reward: -4.3621, last reward: -3.7214, gradient norm:  16.0:  36%|███▌      | 223/625 [00:47<01:24,  4.78it/s]
reward: -4.4639, last reward: -5.2648, gradient norm:  24.71:  36%|███▌      | 223/625 [00:47<01:24,  4.78it/s]
reward: -4.4639, last reward: -5.2648, gradient norm:  24.71:  36%|███▌      | 224/625 [00:47<01:23,  4.79it/s]
reward: -4.6842, last reward: -4.6974, gradient norm:  14.15:  36%|███▌      | 224/625 [00:47<01:23,  4.79it/s]
reward: -4.6842, last reward: -4.6974, gradient norm:  14.15:  36%|███▌      | 225/625 [00:47<01:23,  4.79it/s]
reward: -3.8237, last reward: -3.6540, gradient norm:  21.16:  36%|███▌      | 225/625 [00:47<01:23,  4.79it/s]
reward: -3.8237, last reward: -3.6540, gradient norm:  21.16:  36%|███▌      | 226/625 [00:47<01:23,  4.80it/s]
reward: -4.0712, last reward: -4.1515, gradient norm:  7.923:  36%|███▌      | 226/625 [00:47<01:23,  4.80it/s]
reward: -4.0712, last reward: -4.1515, gradient norm:  7.923:  36%|███▋      | 227/625 [00:47<01:22,  4.80it/s]
reward: -4.0174, last reward: -3.0392, gradient norm:  16.69:  36%|███▋      | 227/625 [00:48<01:22,  4.80it/s]
reward: -4.0174, last reward: -3.0392, gradient norm:  16.69:  36%|███▋      | 228/625 [00:48<01:22,  4.80it/s]
reward: -4.0842, last reward: -3.7785, gradient norm:  19.62:  36%|███▋      | 228/625 [00:48<01:22,  4.80it/s]
reward: -4.0842, last reward: -3.7785, gradient norm:  19.62:  37%|███▋      | 229/625 [00:48<01:22,  4.79it/s]
reward: -4.0530, last reward: -4.4058, gradient norm:  16.16:  37%|███▋      | 229/625 [00:48<01:22,  4.79it/s]
reward: -4.0530, last reward: -4.4058, gradient norm:  16.16:  37%|███▋      | 230/625 [00:48<01:22,  4.78it/s]
reward: -4.0566, last reward: -3.0590, gradient norm:  46.33:  37%|███▋      | 230/625 [00:48<01:22,  4.78it/s]
reward: -4.0566, last reward: -3.0590, gradient norm:  46.33:  37%|███▋      | 231/625 [00:48<01:22,  4.79it/s]
reward: -3.8513, last reward: -2.7985, gradient norm:  47.95:  37%|███▋      | 231/625 [00:48<01:22,  4.79it/s]
reward: -3.8513, last reward: -2.7985, gradient norm:  47.95:  37%|███▋      | 232/625 [00:48<01:22,  4.79it/s]
reward: -3.7363, last reward: -3.3588, gradient norm:  6.625:  37%|███▋      | 232/625 [00:49<01:22,  4.79it/s]
reward: -3.7363, last reward: -3.3588, gradient norm:  6.625:  37%|███▋      | 233/625 [00:49<01:21,  4.78it/s]
reward: -3.7676, last reward: -4.5312, gradient norm:  5.029:  37%|███▋      | 233/625 [00:49<01:21,  4.78it/s]
reward: -3.7676, last reward: -4.5312, gradient norm:  5.029:  37%|███▋      | 234/625 [00:49<01:21,  4.78it/s]
reward: -3.7305, last reward: -3.6823, gradient norm:  23.2:  37%|███▋      | 234/625 [00:49<01:21,  4.78it/s]
reward: -3.7305, last reward: -3.6823, gradient norm:  23.2:  38%|███▊      | 235/625 [00:49<01:21,  4.76it/s]
reward: -4.1303, last reward: -4.9328, gradient norm:  19.52:  38%|███▊      | 235/625 [00:49<01:21,  4.76it/s]
reward: -4.1303, last reward: -4.9328, gradient norm:  19.52:  38%|███▊      | 236/625 [00:49<01:21,  4.75it/s]
reward: -4.1665, last reward: -5.0729, gradient norm:  33.78:  38%|███▊      | 236/625 [00:50<01:21,  4.75it/s]
reward: -4.1665, last reward: -5.0729, gradient norm:  33.78:  38%|███▊      | 237/625 [00:50<01:21,  4.77it/s]
reward: -4.1188, last reward: -5.8531, gradient norm:  36.56:  38%|███▊      | 237/625 [00:50<01:21,  4.77it/s]
reward: -4.1188, last reward: -5.8531, gradient norm:  36.56:  38%|███▊      | 238/625 [00:50<01:20,  4.79it/s]
reward: -3.5453, last reward: -2.3132, gradient norm:  10.89:  38%|███▊      | 238/625 [00:50<01:20,  4.79it/s]
reward: -3.5453, last reward: -2.3132, gradient norm:  10.89:  38%|███▊      | 239/625 [00:50<01:20,  4.79it/s]
reward: -3.2605, last reward: -2.8357, gradient norm:  13.73:  38%|███▊      | 239/625 [00:50<01:20,  4.79it/s]
reward: -3.2605, last reward: -2.8357, gradient norm:  13.73:  38%|███▊      | 240/625 [00:50<01:20,  4.76it/s]
reward: -3.7712, last reward: -1.9925, gradient norm:  45.24:  38%|███▊      | 240/625 [00:50<01:20,  4.76it/s]
reward: -3.7712, last reward: -1.9925, gradient norm:  45.24:  39%|███▊      | 241/625 [00:50<01:20,  4.76it/s]
reward: -3.7126, last reward: -2.1642, gradient norm:  6.793:  39%|███▊      | 241/625 [00:51<01:20,  4.76it/s]
reward: -3.7126, last reward: -2.1642, gradient norm:  6.793:  39%|███▊      | 242/625 [00:51<01:20,  4.75it/s]
reward: -3.4435, last reward: -2.1223, gradient norm:  30.3:  39%|███▊      | 242/625 [00:51<01:20,  4.75it/s]
reward: -3.4435, last reward: -2.1223, gradient norm:  30.3:  39%|███▉      | 243/625 [00:51<01:20,  4.75it/s]
reward: -3.8483, last reward: -1.9589, gradient norm:  76.23:  39%|███▉      | 243/625 [00:51<01:20,  4.75it/s]
reward: -3.8483, last reward: -1.9589, gradient norm:  76.23:  39%|███▉      | 244/625 [00:51<01:19,  4.76it/s]
reward: -3.7243, last reward: -3.9248, gradient norm:  77.73:  39%|███▉      | 244/625 [00:51<01:19,  4.76it/s]
reward: -3.7243, last reward: -3.9248, gradient norm:  77.73:  39%|███▉      | 245/625 [00:51<01:19,  4.76it/s]
reward: -4.7954, last reward: -3.4635, gradient norm:  13.38:  39%|███▉      | 245/625 [00:51<01:19,  4.76it/s]
reward: -4.7954, last reward: -3.4635, gradient norm:  13.38:  39%|███▉      | 246/625 [00:51<01:19,  4.77it/s]
reward: -4.6425, last reward: -4.7224, gradient norm:  14.12:  39%|███▉      | 246/625 [00:52<01:19,  4.77it/s]
reward: -4.6425, last reward: -4.7224, gradient norm:  14.12:  40%|███▉      | 247/625 [00:52<01:19,  4.77it/s]
reward: -4.2372, last reward: -4.5707, gradient norm:  21.06:  40%|███▉      | 247/625 [00:52<01:19,  4.77it/s]
reward: -4.2372, last reward: -4.5707, gradient norm:  21.06:  40%|███▉      | 248/625 [00:52<01:18,  4.79it/s]
reward: -3.9959, last reward: -3.4874, gradient norm:  60.15:  40%|███▉      | 248/625 [00:52<01:18,  4.79it/s]
reward: -3.9959, last reward: -3.4874, gradient norm:  60.15:  40%|███▉      | 249/625 [00:52<01:18,  4.78it/s]
reward: -4.0894, last reward: -3.5227, gradient norm:  14.05:  40%|███▉      | 249/625 [00:52<01:18,  4.78it/s]
reward: -4.0894, last reward: -3.5227, gradient norm:  14.05:  40%|████      | 250/625 [00:52<01:18,  4.78it/s]
reward: -4.5161, last reward: -6.4950, gradient norm:  135.6:  40%|████      | 250/625 [00:52<01:18,  4.78it/s]
reward: -4.5161, last reward: -6.4950, gradient norm:  135.6:  40%|████      | 251/625 [00:52<01:18,  4.77it/s]
reward: -4.0824, last reward: -3.0430, gradient norm:  18.15:  40%|████      | 251/625 [00:53<01:18,  4.77it/s]
reward: -4.0824, last reward: -3.0430, gradient norm:  18.15:  40%|████      | 252/625 [00:53<01:18,  4.75it/s]
reward: -4.6468, last reward: -3.6022, gradient norm:  16.69:  40%|████      | 252/625 [00:53<01:18,  4.75it/s]
reward: -4.6468, last reward: -3.6022, gradient norm:  16.69:  40%|████      | 253/625 [00:53<01:18,  4.73it/s]
reward: -4.0601, last reward: -3.4058, gradient norm:  30.29:  40%|████      | 253/625 [00:53<01:18,  4.73it/s]
reward: -4.0601, last reward: -3.4058, gradient norm:  30.29:  41%|████      | 254/625 [00:53<01:18,  4.72it/s]
reward: -4.2424, last reward: -3.7108, gradient norm:  19.45:  41%|████      | 254/625 [00:53<01:18,  4.72it/s]
reward: -4.2424, last reward: -3.7108, gradient norm:  19.45:  41%|████      | 255/625 [00:53<01:18,  4.72it/s]
reward: -3.5179, last reward: -2.3462, gradient norm:  127.3:  41%|████      | 255/625 [00:54<01:18,  4.72it/s]
reward: -3.5179, last reward: -2.3462, gradient norm:  127.3:  41%|████      | 256/625 [00:54<01:17,  4.74it/s]
reward: -3.5197, last reward: -4.0831, gradient norm:  17.4:  41%|████      | 256/625 [00:54<01:17,  4.74it/s]
reward: -3.5197, last reward: -4.0831, gradient norm:  17.4:  41%|████      | 257/625 [00:54<01:17,  4.75it/s]
reward: -3.8827, last reward: -4.6454, gradient norm:  13.75:  41%|████      | 257/625 [00:54<01:17,  4.75it/s]
reward: -3.8827, last reward: -4.6454, gradient norm:  13.75:  41%|████▏     | 258/625 [00:54<01:17,  4.74it/s]
reward: -3.4425, last reward: -2.8616, gradient norm:  30.91:  41%|████▏     | 258/625 [00:54<01:17,  4.74it/s]
reward: -3.4425, last reward: -2.8616, gradient norm:  30.91:  41%|████▏     | 259/625 [00:54<01:16,  4.75it/s]
reward: -3.3707, last reward: -1.6766, gradient norm:  89.46:  41%|████▏     | 259/625 [00:54<01:16,  4.75it/s]
reward: -3.3707, last reward: -1.6766, gradient norm:  89.46:  42%|████▏     | 260/625 [00:54<01:16,  4.76it/s]
reward: -3.7682, last reward: -2.7231, gradient norm:  15.74:  42%|████▏     | 260/625 [00:55<01:16,  4.76it/s]
reward: -3.7682, last reward: -2.7231, gradient norm:  15.74:  42%|████▏     | 261/625 [00:55<01:16,  4.77it/s]
reward: -3.9477, last reward: -3.8103, gradient norm:  14.7:  42%|████▏     | 261/625 [00:55<01:16,  4.77it/s]
reward: -3.9477, last reward: -3.8103, gradient norm:  14.7:  42%|████▏     | 262/625 [00:55<01:16,  4.77it/s]
reward: -3.7253, last reward: -3.3617, gradient norm:  15.5:  42%|████▏     | 262/625 [00:55<01:16,  4.77it/s]
reward: -3.7253, last reward: -3.3617, gradient norm:  15.5:  42%|████▏     | 263/625 [00:55<01:15,  4.77it/s]
reward: -3.8854, last reward: -2.6403, gradient norm:  46.48:  42%|████▏     | 263/625 [00:55<01:15,  4.77it/s]
reward: -3.8854, last reward: -2.6403, gradient norm:  46.48:  42%|████▏     | 264/625 [00:55<01:15,  4.77it/s]
reward: -2.2784, last reward: -0.3983, gradient norm:  2.552:  42%|████▏     | 264/625 [00:55<01:15,  4.77it/s]
reward: -2.2784, last reward: -0.3983, gradient norm:  2.552:  42%|████▏     | 265/625 [00:55<01:15,  4.78it/s]
reward: -3.3063, last reward: -1.4367, gradient norm:  12.58:  42%|████▏     | 265/625 [00:56<01:15,  4.78it/s]
reward: -3.3063, last reward: -1.4367, gradient norm:  12.58:  43%|████▎     | 266/625 [00:56<01:14,  4.79it/s]
reward: -2.9484, last reward: -2.5394, gradient norm:  28.81:  43%|████▎     | 266/625 [00:56<01:14,  4.79it/s]
reward: -2.9484, last reward: -2.5394, gradient norm:  28.81:  43%|████▎     | 267/625 [00:56<01:14,  4.79it/s]
reward: -3.4480, last reward: -4.8011, gradient norm:  69.75:  43%|████▎     | 267/625 [00:56<01:14,  4.79it/s]
reward: -3.4480, last reward: -4.8011, gradient norm:  69.75:  43%|████▎     | 268/625 [00:56<01:14,  4.80it/s]
reward: -3.2181, last reward: -1.7389, gradient norm:  18.54:  43%|████▎     | 268/625 [00:56<01:14,  4.80it/s]
reward: -3.2181, last reward: -1.7389, gradient norm:  18.54:  43%|████▎     | 269/625 [00:56<01:14,  4.81it/s]
reward: -3.5885, last reward: -2.3872, gradient norm:  1.067e+03:  43%|████▎     | 269/625 [00:56<01:14,  4.81it/s]
reward: -3.5885, last reward: -2.3872, gradient norm:  1.067e+03:  43%|████▎     | 270/625 [00:56<01:13,  4.81it/s]
reward: -3.5645, last reward: -2.3470, gradient norm:  10.39:  43%|████▎     | 270/625 [00:57<01:13,  4.81it/s]
reward: -3.5645, last reward: -2.3470, gradient norm:  10.39:  43%|████▎     | 271/625 [00:57<01:13,  4.81it/s]
reward: -3.1180, last reward: -2.9837, gradient norm:  21.35:  43%|████▎     | 271/625 [00:57<01:13,  4.81it/s]
reward: -3.1180, last reward: -2.9837, gradient norm:  21.35:  44%|████▎     | 272/625 [00:57<01:13,  4.81it/s]
reward: -3.0020, last reward: -1.7848, gradient norm:  14.11:  44%|████▎     | 272/625 [00:57<01:13,  4.81it/s]
reward: -3.0020, last reward: -1.7848, gradient norm:  14.11:  44%|████▎     | 273/625 [00:57<01:13,  4.81it/s]
reward: -2.9024, last reward: -1.2560, gradient norm:  48.93:  44%|████▎     | 273/625 [00:57<01:13,  4.81it/s]
reward: -2.9024, last reward: -1.2560, gradient norm:  48.93:  44%|████▍     | 274/625 [00:57<01:12,  4.81it/s]
reward: -2.3769, last reward: -0.9803, gradient norm:  403.4:  44%|████▍     | 274/625 [00:57<01:12,  4.81it/s]
reward: -2.3769, last reward: -0.9803, gradient norm:  403.4:  44%|████▍     | 275/625 [00:57<01:12,  4.81it/s]
reward: -3.1577, last reward: -1.9462, gradient norm:  25.01:  44%|████▍     | 275/625 [00:58<01:12,  4.81it/s]
reward: -3.1577, last reward: -1.9462, gradient norm:  25.01:  44%|████▍     | 276/625 [00:58<01:12,  4.82it/s]
reward: -3.7512, last reward: -3.6302, gradient norm:  47.82:  44%|████▍     | 276/625 [00:58<01:12,  4.82it/s]
reward: -3.7512, last reward: -3.6302, gradient norm:  47.82:  44%|████▍     | 277/625 [00:58<01:12,  4.82it/s]
reward: -3.3241, last reward: -1.4824, gradient norm:  29.08:  44%|████▍     | 277/625 [00:58<01:12,  4.82it/s]
reward: -3.3241, last reward: -1.4824, gradient norm:  29.08:  44%|████▍     | 278/625 [00:58<01:11,  4.82it/s]
reward: -2.8900, last reward: -1.5340, gradient norm:  6.86:  44%|████▍     | 278/625 [00:58<01:11,  4.82it/s]
reward: -2.8900, last reward: -1.5340, gradient norm:  6.86:  45%|████▍     | 279/625 [00:58<01:11,  4.82it/s]
reward: -2.4089, last reward: -0.1335, gradient norm:  1.654:  45%|████▍     | 279/625 [00:59<01:11,  4.82it/s]
reward: -2.4089, last reward: -0.1335, gradient norm:  1.654:  45%|████▍     | 280/625 [00:59<01:11,  4.83it/s]
reward: -2.1500, last reward: -0.0078, gradient norm:  0.7977:  45%|████▍     | 280/625 [00:59<01:11,  4.83it/s]
reward: -2.1500, last reward: -0.0078, gradient norm:  0.7977:  45%|████▍     | 281/625 [00:59<01:11,  4.83it/s]
reward: -2.8219, last reward: -0.0230, gradient norm:  1.073:  45%|████▍     | 281/625 [00:59<01:11,  4.83it/s]
reward: -2.8219, last reward: -0.0230, gradient norm:  1.073:  45%|████▌     | 282/625 [00:59<01:25,  4.00it/s]
reward: -3.3674, last reward: -2.5903, gradient norm:  28.51:  45%|████▌     | 282/625 [00:59<01:25,  4.00it/s]
reward: -3.3674, last reward: -2.5903, gradient norm:  28.51:  45%|████▌     | 283/625 [00:59<01:21,  4.21it/s]
reward: -2.6695, last reward: -1.1400, gradient norm:  9.986:  45%|████▌     | 283/625 [01:00<01:21,  4.21it/s]
reward: -2.6695, last reward: -1.1400, gradient norm:  9.986:  45%|████▌     | 284/625 [01:00<01:17,  4.37it/s]
reward: -3.9000, last reward: -2.8705, gradient norm:  21.76:  45%|████▌     | 284/625 [01:00<01:17,  4.37it/s]
reward: -3.9000, last reward: -2.8705, gradient norm:  21.76:  46%|████▌     | 285/625 [01:00<01:15,  4.50it/s]
reward: -3.3866, last reward: -2.6675, gradient norm:  25.97:  46%|████▌     | 285/625 [01:00<01:15,  4.50it/s]
reward: -3.3866, last reward: -2.6675, gradient norm:  25.97:  46%|████▌     | 286/625 [01:00<01:13,  4.58it/s]
reward: -3.1383, last reward: -2.5193, gradient norm:  28.38:  46%|████▌     | 286/625 [01:00<01:13,  4.58it/s]
reward: -3.1383, last reward: -2.5193, gradient norm:  28.38:  46%|████▌     | 287/625 [01:00<01:12,  4.64it/s]
reward: -1.9981, last reward: -1.1067, gradient norm:  22.2:  46%|████▌     | 287/625 [01:00<01:12,  4.64it/s]
reward: -1.9981, last reward: -1.1067, gradient norm:  22.2:  46%|████▌     | 288/625 [01:00<01:11,  4.69it/s]
reward: -2.4183, last reward: -0.6585, gradient norm:  12.21:  46%|████▌     | 288/625 [01:01<01:11,  4.69it/s]
reward: -2.4183, last reward: -0.6585, gradient norm:  12.21:  46%|████▌     | 289/625 [01:01<01:11,  4.72it/s]
reward: -2.2903, last reward: -0.1044, gradient norm:  1.397:  46%|████▌     | 289/625 [01:01<01:11,  4.72it/s]
reward: -2.2903, last reward: -0.1044, gradient norm:  1.397:  46%|████▋     | 290/625 [01:01<01:10,  4.74it/s]
reward: -2.3470, last reward: -0.0267, gradient norm:  1.381:  46%|████▋     | 290/625 [01:01<01:10,  4.74it/s]
reward: -2.3470, last reward: -0.0267, gradient norm:  1.381:  47%|████▋     | 291/625 [01:01<01:10,  4.76it/s]
reward: -2.4752, last reward: -0.2300, gradient norm:  0.4783:  47%|████▋     | 291/625 [01:01<01:10,  4.76it/s]
reward: -2.4752, last reward: -0.2300, gradient norm:  0.4783:  47%|████▋     | 292/625 [01:01<01:09,  4.78it/s]
reward: -2.2931, last reward: -0.0729, gradient norm:  4.72:  47%|████▋     | 292/625 [01:01<01:09,  4.78it/s]
reward: -2.2931, last reward: -0.0729, gradient norm:  4.72:  47%|████▋     | 293/625 [01:01<01:09,  4.78it/s]
reward: -2.5747, last reward: -0.0695, gradient norm:  2.437:  47%|████▋     | 293/625 [01:02<01:09,  4.78it/s]
reward: -2.5747, last reward: -0.0695, gradient norm:  2.437:  47%|████▋     | 294/625 [01:02<01:09,  4.78it/s]
reward: -2.3089, last reward: -0.0061, gradient norm:  0.6729:  47%|████▋     | 294/625 [01:02<01:09,  4.78it/s]
reward: -2.3089, last reward: -0.0061, gradient norm:  0.6729:  47%|████▋     | 295/625 [01:02<01:08,  4.79it/s]
reward: -2.3122, last reward: -0.0378, gradient norm:  1.651:  47%|████▋     | 295/625 [01:02<01:08,  4.79it/s]
reward: -2.3122, last reward: -0.0378, gradient norm:  1.651:  47%|████▋     | 296/625 [01:02<01:08,  4.79it/s]
reward: -1.8535, last reward: -0.0574, gradient norm:  2.329:  47%|████▋     | 296/625 [01:02<01:08,  4.79it/s]
reward: -1.8535, last reward: -0.0574, gradient norm:  2.329:  48%|████▊     | 297/625 [01:02<01:08,  4.80it/s]
reward: -2.3665, last reward: -0.0111, gradient norm:  0.9808:  48%|████▊     | 297/625 [01:02<01:08,  4.80it/s]
reward: -2.3665, last reward: -0.0111, gradient norm:  0.9808:  48%|████▊     | 298/625 [01:02<01:08,  4.79it/s]
reward: -2.0677, last reward: -0.0970, gradient norm:  5.651:  48%|████▊     | 298/625 [01:03<01:08,  4.79it/s]
reward: -2.0677, last reward: -0.0970, gradient norm:  5.651:  48%|████▊     | 299/625 [01:03<01:08,  4.79it/s]
reward: -2.8268, last reward: -1.0460, gradient norm:  15.6:  48%|████▊     | 299/625 [01:03<01:08,  4.79it/s]
reward: -2.8268, last reward: -1.0460, gradient norm:  15.6:  48%|████▊     | 300/625 [01:03<01:07,  4.79it/s]
reward: -2.2015, last reward: -0.2860, gradient norm:  22.44:  48%|████▊     | 300/625 [01:03<01:07,  4.79it/s]
reward: -2.2015, last reward: -0.2860, gradient norm:  22.44:  48%|████▊     | 301/625 [01:03<01:07,  4.79it/s]
reward: -2.3683, last reward: -0.0137, gradient norm:  1.152:  48%|████▊     | 301/625 [01:03<01:07,  4.79it/s]
reward: -2.3683, last reward: -0.0137, gradient norm:  1.152:  48%|████▊     | 302/625 [01:03<01:07,  4.79it/s]
reward: -1.9836, last reward: -0.0664, gradient norm:  5.29:  48%|████▊     | 302/625 [01:03<01:07,  4.79it/s]
reward: -1.9836, last reward: -0.0664, gradient norm:  5.29:  48%|████▊     | 303/625 [01:03<01:07,  4.79it/s]
reward: -2.1668, last reward: -0.0758, gradient norm:  2.976:  48%|████▊     | 303/625 [01:04<01:07,  4.79it/s]
reward: -2.1668, last reward: -0.0758, gradient norm:  2.976:  49%|████▊     | 304/625 [01:04<01:06,  4.79it/s]
reward: -1.7214, last reward: -0.0275, gradient norm:  2.978:  49%|████▊     | 304/625 [01:04<01:06,  4.79it/s]
reward: -1.7214, last reward: -0.0275, gradient norm:  2.978:  49%|████▉     | 305/625 [01:04<01:06,  4.78it/s]
reward: -2.1655, last reward: -1.0136, gradient norm:  67.86:  49%|████▉     | 305/625 [01:04<01:06,  4.78it/s]
reward: -2.1655, last reward: -1.0136, gradient norm:  67.86:  49%|████▉     | 306/625 [01:04<01:06,  4.78it/s]
reward: -2.9232, last reward: -3.2623, gradient norm:  62.61:  49%|████▉     | 306/625 [01:04<01:06,  4.78it/s]
reward: -2.9232, last reward: -3.2623, gradient norm:  62.61:  49%|████▉     | 307/625 [01:04<01:06,  4.78it/s]
reward: -2.2422, last reward: -2.5996, gradient norm:  90.63:  49%|████▉     | 307/625 [01:05<01:06,  4.78it/s]
reward: -2.2422, last reward: -2.5996, gradient norm:  90.63:  49%|████▉     | 308/625 [01:05<01:06,  4.78it/s]
reward: -2.1574, last reward: -0.0119, gradient norm:  2.67:  49%|████▉     | 308/625 [01:05<01:06,  4.78it/s]
reward: -2.1574, last reward: -0.0119, gradient norm:  2.67:  49%|████▉     | 309/625 [01:05<01:06,  4.79it/s]
reward: -1.7745, last reward: -0.1597, gradient norm:  10.93:  49%|████▉     | 309/625 [01:05<01:06,  4.79it/s]
reward: -1.7745, last reward: -0.1597, gradient norm:  10.93:  50%|████▉     | 310/625 [01:05<01:05,  4.79it/s]
reward: -1.8866, last reward: -0.5739, gradient norm:  59.4:  50%|████▉     | 310/625 [01:05<01:05,  4.79it/s]
reward: -1.8866, last reward: -0.5739, gradient norm:  59.4:  50%|████▉     | 311/625 [01:05<01:06,  4.76it/s]
reward: -2.0082, last reward: -0.0806, gradient norm:  3.376:  50%|████▉     | 311/625 [01:05<01:06,  4.76it/s]
reward: -2.0082, last reward: -0.0806, gradient norm:  3.376:  50%|████▉     | 312/625 [01:05<01:05,  4.76it/s]
reward: -2.0180, last reward: -0.0130, gradient norm:  0.8043:  50%|████▉     | 312/625 [01:06<01:05,  4.76it/s]
reward: -2.0180, last reward: -0.0130, gradient norm:  0.8043:  50%|█████     | 313/625 [01:06<01:05,  4.77it/s]
reward: -2.1591, last reward: -0.1254, gradient norm:  7.212:  50%|█████     | 313/625 [01:06<01:05,  4.77it/s]
reward: -2.1591, last reward: -0.1254, gradient norm:  7.212:  50%|█████     | 314/625 [01:06<01:05,  4.78it/s]
reward: -1.9418, last reward: -0.0125, gradient norm:  0.6393:  50%|█████     | 314/625 [01:06<01:05,  4.78it/s]
reward: -1.9418, last reward: -0.0125, gradient norm:  0.6393:  50%|█████     | 315/625 [01:06<01:04,  4.79it/s]
reward: -2.0906, last reward: -0.0021, gradient norm:  0.7693:  50%|█████     | 315/625 [01:06<01:04,  4.79it/s]
reward: -2.0906, last reward: -0.0021, gradient norm:  0.7693:  51%|█████     | 316/625 [01:06<01:04,  4.79it/s]
reward: -2.1884, last reward: -0.0084, gradient norm:  0.9224:  51%|█████     | 316/625 [01:06<01:04,  4.79it/s]
reward: -2.1884, last reward: -0.0084, gradient norm:  0.9224:  51%|█████     | 317/625 [01:06<01:04,  4.81it/s]
reward: -2.0722, last reward: -0.0024, gradient norm:  0.6936:  51%|█████     | 317/625 [01:07<01:04,  4.81it/s]
reward: -2.0722, last reward: -0.0024, gradient norm:  0.6936:  51%|█████     | 318/625 [01:07<01:03,  4.81it/s]
reward: -2.2271, last reward: -0.0027, gradient norm:  0.3025:  51%|█████     | 318/625 [01:07<01:03,  4.81it/s]
reward: -2.2271, last reward: -0.0027, gradient norm:  0.3025:  51%|█████     | 319/625 [01:07<01:03,  4.81it/s]
reward: -2.0207, last reward: -0.0060, gradient norm:  1.949:  51%|█████     | 319/625 [01:07<01:03,  4.81it/s]
reward: -2.0207, last reward: -0.0060, gradient norm:  1.949:  51%|█████     | 320/625 [01:07<01:03,  4.81it/s]
reward: -1.8973, last reward: -0.0129, gradient norm:  0.6215:  51%|█████     | 320/625 [01:07<01:03,  4.81it/s]
reward: -1.8973, last reward: -0.0129, gradient norm:  0.6215:  51%|█████▏    | 321/625 [01:07<01:03,  4.79it/s]
reward: -1.7585, last reward: -0.0027, gradient norm:  0.5406:  51%|█████▏    | 321/625 [01:07<01:03,  4.79it/s]
reward: -1.7585, last reward: -0.0027, gradient norm:  0.5406:  52%|█████▏    | 322/625 [01:07<01:03,  4.81it/s]
reward: -2.2886, last reward: -0.0517, gradient norm:  10.62:  52%|█████▏    | 322/625 [01:08<01:03,  4.81it/s]
reward: -2.2886, last reward: -0.0517, gradient norm:  10.62:  52%|█████▏    | 323/625 [01:08<01:02,  4.81it/s]
reward: -1.8662, last reward: -0.0046, gradient norm:  2.198:  52%|█████▏    | 323/625 [01:08<01:02,  4.81it/s]
reward: -1.8662, last reward: -0.0046, gradient norm:  2.198:  52%|█████▏    | 324/625 [01:08<01:02,  4.81it/s]
reward: -2.0652, last reward: -0.0135, gradient norm:  2.58:  52%|█████▏    | 324/625 [01:08<01:02,  4.81it/s]
reward: -2.0652, last reward: -0.0135, gradient norm:  2.58:  52%|█████▏    | 325/625 [01:08<01:02,  4.81it/s]
reward: -2.0966, last reward: -0.0214, gradient norm:  1.656:  52%|█████▏    | 325/625 [01:08<01:02,  4.81it/s]
reward: -2.0966, last reward: -0.0214, gradient norm:  1.656:  52%|█████▏    | 326/625 [01:08<01:02,  4.81it/s]
reward: -2.5183, last reward: -0.0011, gradient norm:  0.705:  52%|█████▏    | 326/625 [01:08<01:02,  4.81it/s]
reward: -2.5183, last reward: -0.0011, gradient norm:  0.705:  52%|█████▏    | 327/625 [01:08<01:01,  4.81it/s]
reward: -2.3712, last reward: -0.0457, gradient norm:  1.244:  52%|█████▏    | 327/625 [01:09<01:01,  4.81it/s]
reward: -2.3712, last reward: -0.0457, gradient norm:  1.244:  52%|█████▏    | 328/625 [01:09<01:01,  4.80it/s]
reward: -2.2987, last reward: -0.0218, gradient norm:  1.368:  52%|█████▏    | 328/625 [01:09<01:01,  4.80it/s]
reward: -2.2987, last reward: -0.0218, gradient norm:  1.368:  53%|█████▎    | 329/625 [01:09<01:01,  4.81it/s]
reward: -2.3155, last reward: -0.0095, gradient norm:  0.7518:  53%|█████▎    | 329/625 [01:09<01:01,  4.81it/s]
reward: -2.3155, last reward: -0.0095, gradient norm:  0.7518:  53%|█████▎    | 330/625 [01:09<01:01,  4.81it/s]
reward: -2.1199, last reward: -0.1257, gradient norm:  5.305:  53%|█████▎    | 330/625 [01:09<01:01,  4.81it/s]
reward: -2.1199, last reward: -0.1257, gradient norm:  5.305:  53%|█████▎    | 331/625 [01:09<01:01,  4.80it/s]
reward: -1.9859, last reward: -0.0679, gradient norm:  5.372:  53%|█████▎    | 331/625 [01:10<01:01,  4.80it/s]
reward: -1.9859, last reward: -0.0679, gradient norm:  5.372:  53%|█████▎    | 332/625 [01:10<01:01,  4.79it/s]
reward: -2.4061, last reward: -0.6118, gradient norm:  94.16:  53%|█████▎    | 332/625 [01:10<01:01,  4.79it/s]
reward: -2.4061, last reward: -0.6118, gradient norm:  94.16:  53%|█████▎    | 333/625 [01:10<01:00,  4.80it/s]
reward: -3.0361, last reward: -3.3765, gradient norm:  103.5:  53%|█████▎    | 333/625 [01:10<01:00,  4.80it/s]
reward: -3.0361, last reward: -3.3765, gradient norm:  103.5:  53%|█████▎    | 334/625 [01:10<01:00,  4.80it/s]
reward: -2.2451, last reward: -0.1210, gradient norm:  3.228:  53%|█████▎    | 334/625 [01:10<01:00,  4.80it/s]
reward: -2.2451, last reward: -0.1210, gradient norm:  3.228:  54%|█████▎    | 335/625 [01:10<01:00,  4.79it/s]
reward: -1.8761, last reward: -0.0040, gradient norm:  0.777:  54%|█████▎    | 335/625 [01:10<01:00,  4.79it/s]
reward: -1.8761, last reward: -0.0040, gradient norm:  0.777:  54%|█████▍    | 336/625 [01:10<01:00,  4.79it/s]
reward: -2.9146, last reward: -3.2809, gradient norm:  51.08:  54%|█████▍    | 336/625 [01:11<01:00,  4.79it/s]
reward: -2.9146, last reward: -3.2809, gradient norm:  51.08:  54%|█████▍    | 337/625 [01:11<01:00,  4.79it/s]
reward: -3.0197, last reward: -2.2499, gradient norm:  20.1:  54%|█████▍    | 337/625 [01:11<01:00,  4.79it/s]
reward: -3.0197, last reward: -2.2499, gradient norm:  20.1:  54%|█████▍    | 338/625 [01:11<00:59,  4.80it/s]
reward: -2.9844, last reward: -2.3444, gradient norm:  18.91:  54%|█████▍    | 338/625 [01:11<00:59,  4.80it/s]
reward: -2.9844, last reward: -2.3444, gradient norm:  18.91:  54%|█████▍    | 339/625 [01:11<00:59,  4.80it/s]
reward: -2.4492, last reward: -2.3984, gradient norm:  62.17:  54%|█████▍    | 339/625 [01:11<00:59,  4.80it/s]
reward: -2.4492, last reward: -2.3984, gradient norm:  62.17:  54%|█████▍    | 340/625 [01:11<00:59,  4.81it/s]
reward: -2.1010, last reward: -0.0191, gradient norm:  1.736:  54%|█████▍    | 340/625 [01:11<00:59,  4.81it/s]
reward: -2.1010, last reward: -0.0191, gradient norm:  1.736:  55%|█████▍    | 341/625 [01:11<00:59,  4.80it/s]
reward: -2.6114, last reward: -0.2858, gradient norm:  2.123:  55%|█████▍    | 341/625 [01:12<00:59,  4.80it/s]
reward: -2.6114, last reward: -0.2858, gradient norm:  2.123:  55%|█████▍    | 342/625 [01:12<00:59,  4.79it/s]
reward: -2.4618, last reward: -0.0410, gradient norm:  2.15:  55%|█████▍    | 342/625 [01:12<00:59,  4.79it/s]
reward: -2.4618, last reward: -0.0410, gradient norm:  2.15:  55%|█████▍    | 343/625 [01:12<00:58,  4.81it/s]
reward: -2.5515, last reward: -0.4695, gradient norm:  5.609:  55%|█████▍    | 343/625 [01:12<00:58,  4.81it/s]
reward: -2.5515, last reward: -0.4695, gradient norm:  5.609:  55%|█████▌    | 344/625 [01:12<00:58,  4.81it/s]
reward: -2.8009, last reward: -2.1572, gradient norm:  34.87:  55%|█████▌    | 344/625 [01:12<00:58,  4.81it/s]
reward: -2.8009, last reward: -2.1572, gradient norm:  34.87:  55%|█████▌    | 345/625 [01:12<00:58,  4.82it/s]
reward: -3.2082, last reward: -5.0086, gradient norm:  45.63:  55%|█████▌    | 345/625 [01:12<00:58,  4.82it/s]
reward: -3.2082, last reward: -5.0086, gradient norm:  45.63:  55%|█████▌    | 346/625 [01:12<00:57,  4.82it/s]
reward: -2.8382, last reward: -3.4997, gradient norm:  50.9:  55%|█████▌    | 346/625 [01:13<00:57,  4.82it/s]
reward: -2.8382, last reward: -3.4997, gradient norm:  50.9:  56%|█████▌    | 347/625 [01:13<00:57,  4.83it/s]
reward: -2.4106, last reward: -0.8440, gradient norm:  20.79:  56%|█████▌    | 347/625 [01:13<00:57,  4.83it/s]
reward: -2.4106, last reward: -0.8440, gradient norm:  20.79:  56%|█████▌    | 348/625 [01:13<00:57,  4.82it/s]
reward: -1.9518, last reward: -0.0163, gradient norm:  1.572:  56%|█████▌    | 348/625 [01:13<00:57,  4.82it/s]
reward: -1.9518, last reward: -0.0163, gradient norm:  1.572:  56%|█████▌    | 349/625 [01:13<00:57,  4.82it/s]
reward: -2.0997, last reward: -0.0540, gradient norm:  6.954:  56%|█████▌    | 349/625 [01:13<00:57,  4.82it/s]
reward: -2.0997, last reward: -0.0540, gradient norm:  6.954:  56%|█████▌    | 350/625 [01:13<00:56,  4.82it/s]
reward: -2.0961, last reward: -0.0805, gradient norm:  2.763:  56%|█████▌    | 350/625 [01:13<00:56,  4.82it/s]
reward: -2.0961, last reward: -0.0805, gradient norm:  2.763:  56%|█████▌    | 351/625 [01:13<00:56,  4.82it/s]
reward: -2.0131, last reward: -0.0443, gradient norm:  2.295:  56%|█████▌    | 351/625 [01:14<00:56,  4.82it/s]
reward: -2.0131, last reward: -0.0443, gradient norm:  2.295:  56%|█████▋    | 352/625 [01:14<00:56,  4.83it/s]
reward: -1.5239, last reward: -0.0026, gradient norm:  0.9087:  56%|█████▋    | 352/625 [01:14<00:56,  4.83it/s]
reward: -1.5239, last reward: -0.0026, gradient norm:  0.9087:  56%|█████▋    | 353/625 [01:14<00:56,  4.82it/s]
reward: -2.3815, last reward: -0.0786, gradient norm:  5.712:  56%|█████▋    | 353/625 [01:14<00:56,  4.82it/s]
reward: -2.3815, last reward: -0.0786, gradient norm:  5.712:  57%|█████▋    | 354/625 [01:14<00:56,  4.82it/s]
reward: -2.2704, last reward: -0.0027, gradient norm:  2.876:  57%|█████▋    | 354/625 [01:14<00:56,  4.82it/s]
reward: -2.2704, last reward: -0.0027, gradient norm:  2.876:  57%|█████▋    | 355/625 [01:14<00:55,  4.82it/s]
reward: -2.2578, last reward: -0.0315, gradient norm:  1.772:  57%|█████▋    | 355/625 [01:15<00:55,  4.82it/s]
reward: -2.2578, last reward: -0.0315, gradient norm:  1.772:  57%|█████▋    | 356/625 [01:15<00:55,  4.83it/s]
reward: -2.7637, last reward: -2.6112, gradient norm:  44.13:  57%|█████▋    | 356/625 [01:15<00:55,  4.83it/s]
reward: -2.7637, last reward: -2.6112, gradient norm:  44.13:  57%|█████▋    | 357/625 [01:15<00:55,  4.80it/s]
reward: -2.6214, last reward: -2.8094, gradient norm:  34.44:  57%|█████▋    | 357/625 [01:15<00:55,  4.80it/s]
reward: -2.6214, last reward: -2.8094, gradient norm:  34.44:  57%|█████▋    | 358/625 [01:15<00:55,  4.80it/s]
reward: -2.6773, last reward: -0.9341, gradient norm:  17.79:  57%|█████▋    | 358/625 [01:15<00:55,  4.80it/s]
reward: -2.6773, last reward: -0.9341, gradient norm:  17.79:  57%|█████▋    | 359/625 [01:15<00:55,  4.78it/s]
reward: -2.0646, last reward: -0.0045, gradient norm:  0.8423:  57%|█████▋    | 359/625 [01:15<00:55,  4.78it/s]
reward: -2.0646, last reward: -0.0045, gradient norm:  0.8423:  58%|█████▊    | 360/625 [01:15<00:55,  4.78it/s]
reward: -2.2144, last reward: -0.0755, gradient norm:  2.833:  58%|█████▊    | 360/625 [01:16<00:55,  4.78it/s]
reward: -2.2144, last reward: -0.0755, gradient norm:  2.833:  58%|█████▊    | 361/625 [01:16<00:55,  4.76it/s]
reward: -2.1301, last reward: -0.1504, gradient norm:  4.438:  58%|█████▊    | 361/625 [01:16<00:55,  4.76it/s]
reward: -2.1301, last reward: -0.1504, gradient norm:  4.438:  58%|█████▊    | 362/625 [01:16<00:55,  4.77it/s]
reward: -2.2999, last reward: -0.1190, gradient norm:  3.388:  58%|█████▊    | 362/625 [01:16<00:55,  4.77it/s]
reward: -2.2999, last reward: -0.1190, gradient norm:  3.388:  58%|█████▊    | 363/625 [01:16<00:54,  4.77it/s]
reward: -2.0784, last reward: -0.0349, gradient norm:  1.901:  58%|█████▊    | 363/625 [01:16<00:54,  4.77it/s]
reward: -2.0784, last reward: -0.0349, gradient norm:  1.901:  58%|█████▊    | 364/625 [01:16<00:55,  4.67it/s]
reward: -2.2406, last reward: -0.0235, gradient norm:  1.598:  58%|█████▊    | 364/625 [01:16<00:55,  4.67it/s]
reward: -2.2406, last reward: -0.0235, gradient norm:  1.598:  58%|█████▊    | 365/625 [01:16<00:55,  4.70it/s]
reward: -2.4914, last reward: -0.5533, gradient norm:  18.79:  58%|█████▊    | 365/625 [01:17<00:55,  4.70it/s]
reward: -2.4914, last reward: -0.5533, gradient norm:  18.79:  59%|█████▊    | 366/625 [01:17<00:54,  4.73it/s]
reward: -2.1190, last reward: -1.1747, gradient norm:  50.33:  59%|█████▊    | 366/625 [01:17<00:54,  4.73it/s]
reward: -2.1190, last reward: -1.1747, gradient norm:  50.33:  59%|█████▊    | 367/625 [01:17<00:54,  4.75it/s]
reward: -1.9734, last reward: -0.0011, gradient norm:  6.159:  59%|█████▊    | 367/625 [01:17<00:54,  4.75it/s]
reward: -1.9734, last reward: -0.0011, gradient norm:  6.159:  59%|█████▉    | 368/625 [01:17<00:53,  4.77it/s]
reward: -2.4497, last reward: -0.0361, gradient norm:  1.444:  59%|█████▉    | 368/625 [01:17<00:53,  4.77it/s]
reward: -2.4497, last reward: -0.0361, gradient norm:  1.444:  59%|█████▉    | 369/625 [01:17<00:53,  4.77it/s]
reward: -1.6725, last reward: -0.0607, gradient norm:  2.076:  59%|█████▉    | 369/625 [01:17<00:53,  4.77it/s]
reward: -1.6725, last reward: -0.0607, gradient norm:  2.076:  59%|█████▉    | 370/625 [01:17<00:53,  4.78it/s]
reward: -2.1384, last reward: -0.0464, gradient norm:  1.567:  59%|█████▉    | 370/625 [01:18<00:53,  4.78it/s]
reward: -2.1384, last reward: -0.0464, gradient norm:  1.567:  59%|█████▉    | 371/625 [01:18<00:52,  4.79it/s]
reward: -1.7059, last reward: -0.0138, gradient norm:  1.031:  59%|█████▉    | 371/625 [01:18<00:52,  4.79it/s]
reward: -1.7059, last reward: -0.0138, gradient norm:  1.031:  60%|█████▉    | 372/625 [01:18<00:52,  4.80it/s]
reward: -1.9927, last reward: -0.0054, gradient norm:  0.5594:  60%|█████▉    | 372/625 [01:18<00:52,  4.80it/s]
reward: -1.9927, last reward: -0.0054, gradient norm:  0.5594:  60%|█████▉    | 373/625 [01:18<00:52,  4.80it/s]
reward: -2.4160, last reward: -0.5060, gradient norm:  29.92:  60%|█████▉    | 373/625 [01:18<00:52,  4.80it/s]
reward: -2.4160, last reward: -0.5060, gradient norm:  29.92:  60%|█████▉    | 374/625 [01:18<00:52,  4.81it/s]
reward: -2.5828, last reward: -0.1384, gradient norm:  4.958:  60%|█████▉    | 374/625 [01:18<00:52,  4.81it/s]
reward: -2.5828, last reward: -0.1384, gradient norm:  4.958:  60%|██████    | 375/625 [01:18<00:51,  4.81it/s]
reward: -1.9523, last reward: -0.0269, gradient norm:  1.721:  60%|██████    | 375/625 [01:19<00:51,  4.81it/s]
reward: -1.9523, last reward: -0.0269, gradient norm:  1.721:  60%|██████    | 376/625 [01:19<00:51,  4.82it/s]
reward: -1.8944, last reward: -0.0003, gradient norm:  0.4466:  60%|██████    | 376/625 [01:19<00:51,  4.82it/s]
reward: -1.8944, last reward: -0.0003, gradient norm:  0.4466:  60%|██████    | 377/625 [01:19<00:51,  4.81it/s]
reward: -2.2882, last reward: -0.0140, gradient norm:  1.393:  60%|██████    | 377/625 [01:19<00:51,  4.81it/s]
reward: -2.2882, last reward: -0.0140, gradient norm:  1.393:  60%|██████    | 378/625 [01:19<00:51,  4.81it/s]
reward: -2.2007, last reward: -0.0201, gradient norm:  0.9149:  60%|██████    | 378/625 [01:19<00:51,  4.81it/s]
reward: -2.2007, last reward: -0.0201, gradient norm:  0.9149:  61%|██████    | 379/625 [01:19<00:51,  4.81it/s]
reward: -2.1404, last reward: -0.2498, gradient norm:  0.7904:  61%|██████    | 379/625 [01:20<00:51,  4.81it/s]
reward: -2.1404, last reward: -0.2498, gradient norm:  0.7904:  61%|██████    | 380/625 [01:20<00:50,  4.80it/s]
reward: -1.9428, last reward: -0.0002, gradient norm:  0.3416:  61%|██████    | 380/625 [01:20<00:50,  4.80it/s]
reward: -1.9428, last reward: -0.0002, gradient norm:  0.3416:  61%|██████    | 381/625 [01:20<00:50,  4.80it/s]
reward: -1.6321, last reward: -0.0189, gradient norm:  1.258:  61%|██████    | 381/625 [01:20<00:50,  4.80it/s]
reward: -1.6321, last reward: -0.0189, gradient norm:  1.258:  61%|██████    | 382/625 [01:20<00:50,  4.80it/s]
reward: -1.9240, last reward: -0.0407, gradient norm:  0.8453:  61%|██████    | 382/625 [01:20<00:50,  4.80it/s]
reward: -1.9240, last reward: -0.0407, gradient norm:  0.8453:  61%|██████▏   | 383/625 [01:20<00:50,  4.80it/s]
reward: -1.7657, last reward: -0.1190, gradient norm:  3.86:  61%|██████▏   | 383/625 [01:20<00:50,  4.80it/s]
reward: -1.7657, last reward: -0.1190, gradient norm:  3.86:  61%|██████▏   | 384/625 [01:20<00:50,  4.80it/s]
reward: -2.2517, last reward: -0.0091, gradient norm:  2.363:  61%|██████▏   | 384/625 [01:21<00:50,  4.80it/s]
reward: -2.2517, last reward: -0.0091, gradient norm:  2.363:  62%|██████▏   | 385/625 [01:21<00:49,  4.80it/s]
reward: -2.3202, last reward: -0.0734, gradient norm:  6.84:  62%|██████▏   | 385/625 [01:21<00:49,  4.80it/s]
reward: -2.3202, last reward: -0.0734, gradient norm:  6.84:  62%|██████▏   | 386/625 [01:21<00:49,  4.80it/s]
reward: -2.4757, last reward: -0.1005, gradient norm:  1.801:  62%|██████▏   | 386/625 [01:21<00:49,  4.80it/s]
reward: -2.4757, last reward: -0.1005, gradient norm:  1.801:  62%|██████▏   | 387/625 [01:21<00:49,  4.80it/s]
reward: -2.1148, last reward: -0.4821, gradient norm:  40.67:  62%|██████▏   | 387/625 [01:21<00:49,  4.80it/s]
reward: -2.1148, last reward: -0.4821, gradient norm:  40.67:  62%|██████▏   | 388/625 [01:21<00:49,  4.79it/s]
reward: -2.3243, last reward: -0.1138, gradient norm:  2.966:  62%|██████▏   | 388/625 [01:21<00:49,  4.79it/s]
reward: -2.3243, last reward: -0.1138, gradient norm:  2.966:  62%|██████▏   | 389/625 [01:21<00:49,  4.80it/s]
reward: -2.1412, last reward: -0.0588, gradient norm:  2.561:  62%|██████▏   | 389/625 [01:22<00:49,  4.80it/s]
reward: -2.1412, last reward: -0.0588, gradient norm:  2.561:  62%|██████▏   | 390/625 [01:22<00:48,  4.81it/s]
reward: -1.8031, last reward: -0.0051, gradient norm:  2.107:  62%|██████▏   | 390/625 [01:22<00:48,  4.81it/s]
reward: -1.8031, last reward: -0.0051, gradient norm:  2.107:  63%|██████▎   | 391/625 [01:22<00:48,  4.82it/s]
reward: -2.2578, last reward: -2.3332, gradient norm:  44.11:  63%|██████▎   | 391/625 [01:22<00:48,  4.82it/s]
reward: -2.2578, last reward: -2.3332, gradient norm:  44.11:  63%|██████▎   | 392/625 [01:22<00:48,  4.81it/s]
reward: -2.5711, last reward: -3.2760, gradient norm:  42.22:  63%|██████▎   | 392/625 [01:22<00:48,  4.81it/s]
reward: -2.5711, last reward: -3.2760, gradient norm:  42.22:  63%|██████▎   | 393/625 [01:22<00:48,  4.81it/s]
reward: -2.4667, last reward: -1.7428, gradient norm:  33.16:  63%|██████▎   | 393/625 [01:22<00:48,  4.81it/s]
reward: -2.4667, last reward: -1.7428, gradient norm:  33.16:  63%|██████▎   | 394/625 [01:22<00:48,  4.81it/s]
reward: -2.0998, last reward: -0.0158, gradient norm:  2.666:  63%|██████▎   | 394/625 [01:23<00:48,  4.81it/s]
reward: -2.0998, last reward: -0.0158, gradient norm:  2.666:  63%|██████▎   | 395/625 [01:23<00:47,  4.80it/s]
reward: -2.4835, last reward: -0.1028, gradient norm:  6.602:  63%|██████▎   | 395/625 [01:23<00:47,  4.80it/s]
reward: -2.4835, last reward: -0.1028, gradient norm:  6.602:  63%|██████▎   | 396/625 [01:23<00:57,  3.98it/s]
reward: -4.1513, last reward: -2.9719, gradient norm:  31.03:  63%|██████▎   | 396/625 [01:23<00:57,  3.98it/s]
reward: -4.1513, last reward: -2.9719, gradient norm:  31.03:  64%|██████▎   | 397/625 [01:23<00:54,  4.20it/s]
reward: -3.8985, last reward: -5.0222, gradient norm:  215.2:  64%|██████▎   | 397/625 [01:23<00:54,  4.20it/s]
reward: -3.8985, last reward: -5.0222, gradient norm:  215.2:  64%|██████▎   | 398/625 [01:23<00:52,  4.36it/s]
reward: -2.2914, last reward: -0.1110, gradient norm:  3.192:  64%|██████▎   | 398/625 [01:24<00:52,  4.36it/s]
reward: -2.2914, last reward: -0.1110, gradient norm:  3.192:  64%|██████▍   | 399/625 [01:24<00:50,  4.49it/s]
reward: -1.9166, last reward: -0.0308, gradient norm:  1.668:  64%|██████▍   | 399/625 [01:24<00:50,  4.49it/s]
reward: -1.9166, last reward: -0.0308, gradient norm:  1.668:  64%|██████▍   | 400/625 [01:24<00:49,  4.58it/s]
reward: -1.8214, last reward: -0.0065, gradient norm:  0.6156:  64%|██████▍   | 400/625 [01:24<00:49,  4.58it/s]
reward: -1.8214, last reward: -0.0065, gradient norm:  0.6156:  64%|██████▍   | 401/625 [01:24<00:48,  4.64it/s]
reward: -2.2157, last reward: -2.9038, gradient norm:  114.0:  64%|██████▍   | 401/625 [01:24<00:48,  4.64it/s]
reward: -2.2157, last reward: -2.9038, gradient norm:  114.0:  64%|██████▍   | 402/625 [01:24<00:47,  4.69it/s]
reward: -2.2463, last reward: -3.3530, gradient norm:  120.8:  64%|██████▍   | 402/625 [01:24<00:47,  4.69it/s]
reward: -2.2463, last reward: -3.3530, gradient norm:  120.8:  64%|██████▍   | 403/625 [01:24<00:46,  4.72it/s]
reward: -2.0383, last reward: -0.0227, gradient norm:  1.776:  64%|██████▍   | 403/625 [01:25<00:46,  4.72it/s]
reward: -2.0383, last reward: -0.0227, gradient norm:  1.776:  65%|██████▍   | 404/625 [01:25<00:46,  4.75it/s]
reward: -1.7300, last reward: -0.0007, gradient norm:  0.414:  65%|██████▍   | 404/625 [01:25<00:46,  4.75it/s]
reward: -1.7300, last reward: -0.0007, gradient norm:  0.414:  65%|██████▍   | 405/625 [01:25<00:46,  4.77it/s]
reward: -1.7968, last reward: -0.0107, gradient norm:  0.8298:  65%|██████▍   | 405/625 [01:25<00:46,  4.77it/s]
reward: -1.7968, last reward: -0.0107, gradient norm:  0.8298:  65%|██████▍   | 406/625 [01:25<00:45,  4.77it/s]
reward: -2.0079, last reward: -0.2487, gradient norm:  0.8033:  65%|██████▍   | 406/625 [01:25<00:45,  4.77it/s]
reward: -2.0079, last reward: -0.2487, gradient norm:  0.8033:  65%|██████▌   | 407/625 [01:25<00:45,  4.78it/s]
reward: -1.8478, last reward: -0.0094, gradient norm:  0.7041:  65%|██████▌   | 407/625 [01:26<00:45,  4.78it/s]
reward: -1.8478, last reward: -0.0094, gradient norm:  0.7041:  65%|██████▌   | 408/625 [01:26<00:45,  4.77it/s]
reward: -2.2375, last reward: -0.1252, gradient norm:  0.9001:  65%|██████▌   | 408/625 [01:26<00:45,  4.77it/s]
reward: -2.2375, last reward: -0.1252, gradient norm:  0.9001:  65%|██████▌   | 409/625 [01:26<00:45,  4.75it/s]
reward: -1.9546, last reward: -0.0039, gradient norm:  0.4175:  65%|██████▌   | 409/625 [01:26<00:45,  4.75it/s]
reward: -1.9546, last reward: -0.0039, gradient norm:  0.4175:  66%|██████▌   | 410/625 [01:26<00:45,  4.75it/s]
reward: -2.3546, last reward: -0.0282, gradient norm:  14.68:  66%|██████▌   | 410/625 [01:26<00:45,  4.75it/s]
reward: -2.3546, last reward: -0.0282, gradient norm:  14.68:  66%|██████▌   | 411/625 [01:26<00:44,  4.77it/s]
reward: -2.1190, last reward: -0.7145, gradient norm:  47.83:  66%|██████▌   | 411/625 [01:26<00:44,  4.77it/s]
reward: -2.1190, last reward: -0.7145, gradient norm:  47.83:  66%|██████▌   | 412/625 [01:26<00:44,  4.79it/s]
reward: -2.1732, last reward: -0.0822, gradient norm:  2.868:  66%|██████▌   | 412/625 [01:27<00:44,  4.79it/s]
reward: -2.1732, last reward: -0.0822, gradient norm:  2.868:  66%|██████▌   | 413/625 [01:27<00:44,  4.78it/s]
reward: -2.2304, last reward: -1.3711, gradient norm:  38.48:  66%|██████▌   | 413/625 [01:27<00:44,  4.78it/s]
reward: -2.2304, last reward: -1.3711, gradient norm:  38.48:  66%|██████▌   | 414/625 [01:27<00:44,  4.78it/s]
reward: -2.1892, last reward: -0.2867, gradient norm:  2.725:  66%|██████▌   | 414/625 [01:27<00:44,  4.78it/s]
reward: -2.1892, last reward: -0.2867, gradient norm:  2.725:  66%|██████▋   | 415/625 [01:27<00:44,  4.76it/s]
reward: -1.9492, last reward: -0.0121, gradient norm:  0.8292:  66%|██████▋   | 415/625 [01:27<00:44,  4.76it/s]
reward: -1.9492, last reward: -0.0121, gradient norm:  0.8292:  67%|██████▋   | 416/625 [01:27<00:43,  4.77it/s]
reward: -1.7219, last reward: -0.0048, gradient norm:  0.6598:  67%|██████▋   | 416/625 [01:27<00:43,  4.77it/s]
reward: -1.7219, last reward: -0.0048, gradient norm:  0.6598:  67%|██████▋   | 417/625 [01:27<00:43,  4.77it/s]
reward: -2.1068, last reward: -0.0222, gradient norm:  1.108:  67%|██████▋   | 417/625 [01:28<00:43,  4.77it/s]
reward: -2.1068, last reward: -0.0222, gradient norm:  1.108:  67%|██████▋   | 418/625 [01:28<00:43,  4.79it/s]
reward: -1.7557, last reward: -0.0238, gradient norm:  1.243:  67%|██████▋   | 418/625 [01:28<00:43,  4.79it/s]
reward: -1.7557, last reward: -0.0238, gradient norm:  1.243:  67%|██████▋   | 419/625 [01:28<00:43,  4.73it/s]
reward: -1.8904, last reward: -0.0105, gradient norm:  27.15:  67%|██████▋   | 419/625 [01:28<00:43,  4.73it/s]
reward: -1.8904, last reward: -0.0105, gradient norm:  27.15:  67%|██████▋   | 420/625 [01:28<00:43,  4.72it/s]
reward: -2.1159, last reward: -0.0003, gradient norm:  0.3801:  67%|██████▋   | 420/625 [01:28<00:43,  4.72it/s]
reward: -2.1159, last reward: -0.0003, gradient norm:  0.3801:  67%|██████▋   | 421/625 [01:28<00:42,  4.75it/s]
reward: -1.7220, last reward: -0.0169, gradient norm:  1.102:  67%|██████▋   | 421/625 [01:28<00:42,  4.75it/s]
reward: -1.7220, last reward: -0.0169, gradient norm:  1.102:  68%|██████▊   | 422/625 [01:28<00:42,  4.77it/s]
reward: -1.8886, last reward: -0.0218, gradient norm:  1.461:  68%|██████▊   | 422/625 [01:29<00:42,  4.77it/s]
reward: -1.8886, last reward: -0.0218, gradient norm:  1.461:  68%|██████▊   | 423/625 [01:29<00:42,  4.79it/s]
reward: -1.6002, last reward: -0.0012, gradient norm:  0.08998:  68%|██████▊   | 423/625 [01:29<00:42,  4.79it/s]
reward: -1.6002, last reward: -0.0012, gradient norm:  0.08998:  68%|██████▊   | 424/625 [01:29<00:41,  4.80it/s]
reward: -2.3313, last reward: -0.0031, gradient norm:  0.6231:  68%|██████▊   | 424/625 [01:29<00:41,  4.80it/s]
reward: -2.3313, last reward: -0.0031, gradient norm:  0.6231:  68%|██████▊   | 425/625 [01:29<00:41,  4.80it/s]
reward: -1.9866, last reward: -0.0051, gradient norm:  0.697:  68%|██████▊   | 425/625 [01:29<00:41,  4.80it/s]
reward: -1.9866, last reward: -0.0051, gradient norm:  0.697:  68%|██████▊   | 426/625 [01:29<00:41,  4.80it/s]
reward: -2.2594, last reward: -0.0017, gradient norm:  0.5586:  68%|██████▊   | 426/625 [01:29<00:41,  4.80it/s]
reward: -2.2594, last reward: -0.0017, gradient norm:  0.5586:  68%|██████▊   | 427/625 [01:29<00:41,  4.80it/s]
reward: -2.2575, last reward: -0.0220, gradient norm:  4.928:  68%|██████▊   | 427/625 [01:30<00:41,  4.80it/s]
reward: -2.2575, last reward: -0.0220, gradient norm:  4.928:  68%|██████▊   | 428/625 [01:30<00:41,  4.80it/s]
reward: -1.8807, last reward: -0.0081, gradient norm:  0.9836:  68%|██████▊   | 428/625 [01:30<00:41,  4.80it/s]
reward: -1.8807, last reward: -0.0081, gradient norm:  0.9836:  69%|██████▊   | 429/625 [01:30<00:40,  4.80it/s]
reward: -2.0147, last reward: -0.0003, gradient norm:  0.2705:  69%|██████▊   | 429/625 [01:30<00:40,  4.80it/s]
reward: -2.0147, last reward: -0.0003, gradient norm:  0.2705:  69%|██████▉   | 430/625 [01:30<00:40,  4.81it/s]
reward: -1.8529, last reward: -0.0009, gradient norm:  0.7404:  69%|██████▉   | 430/625 [01:30<00:40,  4.81it/s]
reward: -1.8529, last reward: -0.0009, gradient norm:  0.7404:  69%|██████▉   | 431/625 [01:30<00:40,  4.81it/s]
reward: -1.9336, last reward: -0.0057, gradient norm:  0.6225:  69%|██████▉   | 431/625 [01:31<00:40,  4.81it/s]
reward: -1.9336, last reward: -0.0057, gradient norm:  0.6225:  69%|██████▉   | 432/625 [01:31<00:40,  4.80it/s]
reward: -2.3085, last reward: -0.0506, gradient norm:  1.342:  69%|██████▉   | 432/625 [01:31<00:40,  4.80it/s]
reward: -2.3085, last reward: -0.0506, gradient norm:  1.342:  69%|██████▉   | 433/625 [01:31<00:40,  4.79it/s]
reward: -2.5377, last reward: -0.0226, gradient norm:  0.4431:  69%|██████▉   | 433/625 [01:31<00:40,  4.79it/s]
reward: -2.5377, last reward: -0.0226, gradient norm:  0.4431:  69%|██████▉   | 434/625 [01:31<00:39,  4.79it/s]
reward: -2.1698, last reward: -0.1581, gradient norm:  2.587:  69%|██████▉   | 434/625 [01:31<00:39,  4.79it/s]
reward: -2.1698, last reward: -0.1581, gradient norm:  2.587:  70%|██████▉   | 435/625 [01:31<00:39,  4.80it/s]
reward: -2.5718, last reward: -0.1130, gradient norm:  6.102:  70%|██████▉   | 435/625 [01:31<00:39,  4.80it/s]
reward: -2.5718, last reward: -0.1130, gradient norm:  6.102:  70%|██████▉   | 436/625 [01:31<00:39,  4.80it/s]
reward: -2.2911, last reward: -0.3144, gradient norm:  4.01:  70%|██████▉   | 436/625 [01:32<00:39,  4.80it/s]
reward: -2.2911, last reward: -0.3144, gradient norm:  4.01:  70%|██████▉   | 437/625 [01:32<00:39,  4.79it/s]
reward: -2.7797, last reward: -0.3012, gradient norm:  2.231:  70%|██████▉   | 437/625 [01:32<00:39,  4.79it/s]
reward: -2.7797, last reward: -0.3012, gradient norm:  2.231:  70%|███████   | 438/625 [01:32<00:39,  4.78it/s]
reward: -1.8474, last reward: -0.0199, gradient norm:  1.789:  70%|███████   | 438/625 [01:32<00:39,  4.78it/s]
reward: -1.8474, last reward: -0.0199, gradient norm:  1.789:  70%|███████   | 439/625 [01:32<00:38,  4.78it/s]
reward: -2.0948, last reward: -0.0017, gradient norm:  0.3745:  70%|███████   | 439/625 [01:32<00:38,  4.78it/s]
reward: -2.0948, last reward: -0.0017, gradient norm:  0.3745:  70%|███████   | 440/625 [01:32<00:38,  4.80it/s]
reward: -2.0281, last reward: -0.0024, gradient norm:  0.4722:  70%|███████   | 440/625 [01:32<00:38,  4.80it/s]
reward: -2.0281, last reward: -0.0024, gradient norm:  0.4722:  71%|███████   | 441/625 [01:32<00:38,  4.81it/s]
reward: -2.2455, last reward: -0.0084, gradient norm:  0.9685:  71%|███████   | 441/625 [01:33<00:38,  4.81it/s]
reward: -2.2455, last reward: -0.0084, gradient norm:  0.9685:  71%|███████   | 442/625 [01:33<00:38,  4.80it/s]
reward: -1.9491, last reward: -0.0081, gradient norm:  0.7127:  71%|███████   | 442/625 [01:33<00:38,  4.80it/s]
reward: -1.9491, last reward: -0.0081, gradient norm:  0.7127:  71%|███████   | 443/625 [01:33<00:37,  4.80it/s]
reward: -2.0660, last reward: -0.0011, gradient norm:  0.4463:  71%|███████   | 443/625 [01:33<00:37,  4.80it/s]
reward: -2.0660, last reward: -0.0011, gradient norm:  0.4463:  71%|███████   | 444/625 [01:33<00:37,  4.81it/s]
reward: -2.0021, last reward: -0.0043, gradient norm:  0.8505:  71%|███████   | 444/625 [01:33<00:37,  4.81it/s]
reward: -2.0021, last reward: -0.0043, gradient norm:  0.8505:  71%|███████   | 445/625 [01:33<00:37,  4.79it/s]
reward: -2.2601, last reward: -0.0044, gradient norm:  0.6368:  71%|███████   | 445/625 [01:33<00:37,  4.79it/s]
reward: -2.2601, last reward: -0.0044, gradient norm:  0.6368:  71%|███████▏  | 446/625 [01:33<00:37,  4.78it/s]
reward: -2.1654, last reward: -0.0008, gradient norm:  0.9723:  71%|███████▏  | 446/625 [01:34<00:37,  4.78it/s]
reward: -2.1654, last reward: -0.0008, gradient norm:  0.9723:  72%|███████▏  | 447/625 [01:34<00:37,  4.80it/s]
reward: -1.7645, last reward: -0.0014, gradient norm:  0.6832:  72%|███████▏  | 447/625 [01:34<00:37,  4.80it/s]
reward: -1.7645, last reward: -0.0014, gradient norm:  0.6832:  72%|███████▏  | 448/625 [01:34<00:36,  4.81it/s]
reward: -2.1802, last reward: -0.0016, gradient norm:  0.4254:  72%|███████▏  | 448/625 [01:34<00:36,  4.81it/s]
reward: -2.1802, last reward: -0.0016, gradient norm:  0.4254:  72%|███████▏  | 449/625 [01:34<00:36,  4.81it/s]
reward: -1.9047, last reward: -0.0029, gradient norm:  0.6538:  72%|███████▏  | 449/625 [01:34<00:36,  4.81it/s]
reward: -1.9047, last reward: -0.0029, gradient norm:  0.6538:  72%|███████▏  | 450/625 [01:34<00:36,  4.79it/s]
reward: -2.3640, last reward: -0.0064, gradient norm:  1.098:  72%|███████▏  | 450/625 [01:34<00:36,  4.79it/s]
reward: -2.3640, last reward: -0.0064, gradient norm:  1.098:  72%|███████▏  | 451/625 [01:34<00:36,  4.80it/s]
reward: -2.1285, last reward: -0.0338, gradient norm:  1.303:  72%|███████▏  | 451/625 [01:35<00:36,  4.80it/s]
reward: -2.1285, last reward: -0.0338, gradient norm:  1.303:  72%|███████▏  | 452/625 [01:35<00:36,  4.81it/s]
reward: -1.6215, last reward: -0.0049, gradient norm:  2.223:  72%|███████▏  | 452/625 [01:35<00:36,  4.81it/s]
reward: -1.6215, last reward: -0.0049, gradient norm:  2.223:  72%|███████▏  | 453/625 [01:35<00:35,  4.81it/s]
reward: -1.5373, last reward: -0.0090, gradient norm:  1.162:  72%|███████▏  | 453/625 [01:35<00:35,  4.81it/s]
reward: -1.5373, last reward: -0.0090, gradient norm:  1.162:  73%|███████▎  | 454/625 [01:35<00:35,  4.82it/s]
reward: -1.8666, last reward: -0.0247, gradient norm:  1.893:  73%|███████▎  | 454/625 [01:35<00:35,  4.82it/s]
reward: -1.8666, last reward: -0.0247, gradient norm:  1.893:  73%|███████▎  | 455/625 [01:35<00:35,  4.81it/s]
reward: -1.9899, last reward: -0.0080, gradient norm:  1.12:  73%|███████▎  | 455/625 [01:36<00:35,  4.81it/s]
reward: -1.9899, last reward: -0.0080, gradient norm:  1.12:  73%|███████▎  | 456/625 [01:36<00:35,  4.82it/s]
reward: -2.1262, last reward: -0.1049, gradient norm:  10.91:  73%|███████▎  | 456/625 [01:36<00:35,  4.82it/s]
reward: -2.1262, last reward: -0.1049, gradient norm:  10.91:  73%|███████▎  | 457/625 [01:36<00:34,  4.81it/s]
reward: -2.1425, last reward: -0.0472, gradient norm:  2.676:  73%|███████▎  | 457/625 [01:36<00:34,  4.81it/s]
reward: -2.1425, last reward: -0.0472, gradient norm:  2.676:  73%|███████▎  | 458/625 [01:36<00:34,  4.81it/s]
reward: -2.2573, last reward: -0.0005, gradient norm:  0.3421:  73%|███████▎  | 458/625 [01:36<00:34,  4.81it/s]
reward: -2.2573, last reward: -0.0005, gradient norm:  0.3421:  73%|███████▎  | 459/625 [01:36<00:34,  4.81it/s]
reward: -1.5790, last reward: -0.0079, gradient norm:  0.8352:  73%|███████▎  | 459/625 [01:36<00:34,  4.81it/s]
reward: -1.5790, last reward: -0.0079, gradient norm:  0.8352:  74%|███████▎  | 460/625 [01:36<00:34,  4.80it/s]
reward: -1.8268, last reward: -0.0108, gradient norm:  0.8433:  74%|███████▎  | 460/625 [01:37<00:34,  4.80it/s]
reward: -1.8268, last reward: -0.0108, gradient norm:  0.8433:  74%|███████▍  | 461/625 [01:37<00:34,  4.80it/s]
reward: -1.8524, last reward: -0.0019, gradient norm:  0.4605:  74%|███████▍  | 461/625 [01:37<00:34,  4.80it/s]
reward: -1.8524, last reward: -0.0019, gradient norm:  0.4605:  74%|███████▍  | 462/625 [01:37<00:33,  4.80it/s]
reward: -1.9559, last reward: -0.0026, gradient norm:  2.404:  74%|███████▍  | 462/625 [01:37<00:33,  4.80it/s]
reward: -1.9559, last reward: -0.0026, gradient norm:  2.404:  74%|███████▍  | 463/625 [01:37<00:33,  4.81it/s]
reward: -2.3517, last reward: -2.4639, gradient norm:  109.4:  74%|███████▍  | 463/625 [01:37<00:33,  4.81it/s]
reward: -2.3517, last reward: -2.4639, gradient norm:  109.4:  74%|███████▍  | 464/625 [01:37<00:33,  4.79it/s]
reward: -2.8051, last reward: -4.1254, gradient norm:  80.4:  74%|███████▍  | 464/625 [01:37<00:33,  4.79it/s]
reward: -2.8051, last reward: -4.1254, gradient norm:  80.4:  74%|███████▍  | 465/625 [01:37<00:33,  4.79it/s]
reward: -2.2793, last reward: -3.5528, gradient norm:  133.8:  74%|███████▍  | 465/625 [01:38<00:33,  4.79it/s]
reward: -2.2793, last reward: -3.5528, gradient norm:  133.8:  75%|███████▍  | 466/625 [01:38<00:33,  4.79it/s]
reward: -2.4257, last reward: -0.0111, gradient norm:  0.8815:  75%|███████▍  | 466/625 [01:38<00:33,  4.79it/s]
reward: -2.4257, last reward: -0.0111, gradient norm:  0.8815:  75%|███████▍  | 467/625 [01:38<00:33,  4.76it/s]
reward: -2.0900, last reward: -0.0090, gradient norm:  0.5581:  75%|███████▍  | 467/625 [01:38<00:33,  4.76it/s]
reward: -2.0900, last reward: -0.0090, gradient norm:  0.5581:  75%|███████▍  | 468/625 [01:38<00:32,  4.77it/s]
reward: -2.0726, last reward: -0.0278, gradient norm:  0.9816:  75%|███████▍  | 468/625 [01:38<00:32,  4.77it/s]
reward: -2.0726, last reward: -0.0278, gradient norm:  0.9816:  75%|███████▌  | 469/625 [01:38<00:32,  4.76it/s]
reward: -2.2132, last reward: -0.0311, gradient norm:  1.074:  75%|███████▌  | 469/625 [01:38<00:32,  4.76it/s]
reward: -2.2132, last reward: -0.0311, gradient norm:  1.074:  75%|███████▌  | 470/625 [01:38<00:32,  4.77it/s]
reward: -2.2571, last reward: -0.0172, gradient norm:  0.7882:  75%|███████▌  | 470/625 [01:39<00:32,  4.77it/s]
reward: -2.2571, last reward: -0.0172, gradient norm:  0.7882:  75%|███████▌  | 471/625 [01:39<00:32,  4.79it/s]
reward: -2.0257, last reward: -0.0171, gradient norm:  0.715:  75%|███████▌  | 471/625 [01:39<00:32,  4.79it/s]
reward: -2.0257, last reward: -0.0171, gradient norm:  0.715:  76%|███████▌  | 472/625 [01:39<00:31,  4.80it/s]
reward: -2.7457, last reward: -0.0086, gradient norm:  11.82:  76%|███████▌  | 472/625 [01:39<00:31,  4.80it/s]
reward: -2.7457, last reward: -0.0086, gradient norm:  11.82:  76%|███████▌  | 473/625 [01:39<00:31,  4.82it/s]
reward: -2.3554, last reward: -0.2600, gradient norm:  3.902:  76%|███████▌  | 473/625 [01:39<00:31,  4.82it/s]
reward: -2.3554, last reward: -0.2600, gradient norm:  3.902:  76%|███████▌  | 474/625 [01:39<00:31,  4.82it/s]
reward: -1.9478, last reward: -0.0921, gradient norm:  6.198:  76%|███████▌  | 474/625 [01:39<00:31,  4.82it/s]
reward: -1.9478, last reward: -0.0921, gradient norm:  6.198:  76%|███████▌  | 475/625 [01:39<00:31,  4.82it/s]
reward: -1.8998, last reward: -0.0534, gradient norm:  2.329:  76%|███████▌  | 475/625 [01:40<00:31,  4.82it/s]
reward: -1.8998, last reward: -0.0534, gradient norm:  2.329:  76%|███████▌  | 476/625 [01:40<00:30,  4.82it/s]
reward: -2.2714, last reward: -0.0140, gradient norm:  0.7061:  76%|███████▌  | 476/625 [01:40<00:30,  4.82it/s]
reward: -2.2714, last reward: -0.0140, gradient norm:  0.7061:  76%|███████▋  | 477/625 [01:40<00:30,  4.82it/s]
reward: -1.8072, last reward: -0.0004, gradient norm:  0.2785:  76%|███████▋  | 477/625 [01:40<00:30,  4.82it/s]
reward: -1.8072, last reward: -0.0004, gradient norm:  0.2785:  76%|███████▋  | 478/625 [01:40<00:30,  4.81it/s]
reward: -1.9878, last reward: -0.0031, gradient norm:  0.5887:  76%|███████▋  | 478/625 [01:40<00:30,  4.81it/s]
reward: -1.9878, last reward: -0.0031, gradient norm:  0.5887:  77%|███████▋  | 479/625 [01:40<00:30,  4.81it/s]
reward: -1.9777, last reward: -0.0108, gradient norm:  1.364:  77%|███████▋  | 479/625 [01:41<00:30,  4.81it/s]
reward: -1.9777, last reward: -0.0108, gradient norm:  1.364:  77%|███████▋  | 480/625 [01:41<00:30,  4.81it/s]
reward: -2.2559, last reward: -0.0164, gradient norm:  0.69:  77%|███████▋  | 480/625 [01:41<00:30,  4.81it/s]
reward: -2.2559, last reward: -0.0164, gradient norm:  0.69:  77%|███████▋  | 481/625 [01:41<00:29,  4.81it/s]
reward: -1.9692, last reward: -0.0161, gradient norm:  0.7074:  77%|███████▋  | 481/625 [01:41<00:29,  4.81it/s]
reward: -1.9692, last reward: -0.0161, gradient norm:  0.7074:  77%|███████▋  | 482/625 [01:41<00:29,  4.82it/s]
reward: -1.9088, last reward: -0.0093, gradient norm:  0.5972:  77%|███████▋  | 482/625 [01:41<00:29,  4.82it/s]
reward: -1.9088, last reward: -0.0093, gradient norm:  0.5972:  77%|███████▋  | 483/625 [01:41<00:29,  4.82it/s]
reward: -1.6735, last reward: -0.0022, gradient norm:  0.6743:  77%|███████▋  | 483/625 [01:41<00:29,  4.82it/s]
reward: -1.6735, last reward: -0.0022, gradient norm:  0.6743:  77%|███████▋  | 484/625 [01:41<00:29,  4.82it/s]
reward: -1.5895, last reward: -0.0004, gradient norm:  0.1763:  77%|███████▋  | 484/625 [01:42<00:29,  4.82it/s]
reward: -1.5895, last reward: -0.0004, gradient norm:  0.1763:  78%|███████▊  | 485/625 [01:42<00:29,  4.82it/s]
reward: -2.2496, last reward: -0.0066, gradient norm:  0.5032:  78%|███████▊  | 485/625 [01:42<00:29,  4.82it/s]
reward: -2.2496, last reward: -0.0066, gradient norm:  0.5032:  78%|███████▊  | 486/625 [01:42<00:28,  4.82it/s]
reward: -2.1070, last reward: -0.0170, gradient norm:  0.8796:  78%|███████▊  | 486/625 [01:42<00:28,  4.82it/s]
reward: -2.1070, last reward: -0.0170, gradient norm:  0.8796:  78%|███████▊  | 487/625 [01:42<00:28,  4.82it/s]
reward: -2.1649, last reward: -0.0368, gradient norm:  1.901:  78%|███████▊  | 487/625 [01:42<00:28,  4.82it/s]
reward: -2.1649, last reward: -0.0368, gradient norm:  1.901:  78%|███████▊  | 488/625 [01:42<00:28,  4.79it/s]
reward: -2.3717, last reward: -0.0190, gradient norm:  0.6673:  78%|███████▊  | 488/625 [01:42<00:28,  4.79it/s]
reward: -2.3717, last reward: -0.0190, gradient norm:  0.6673:  78%|███████▊  | 489/625 [01:42<00:28,  4.77it/s]
reward: -2.4690, last reward: -0.0244, gradient norm:  2.987:  78%|███████▊  | 489/625 [01:43<00:28,  4.77it/s]
reward: -2.4690, last reward: -0.0244, gradient norm:  2.987:  78%|███████▊  | 490/625 [01:43<00:28,  4.75it/s]
reward: -3.9800, last reward: -2.4005, gradient norm:  84.83:  78%|███████▊  | 490/625 [01:43<00:28,  4.75it/s]
reward: -3.9800, last reward: -2.4005, gradient norm:  84.83:  79%|███████▊  | 491/625 [01:43<00:28,  4.75it/s]
reward: -3.9788, last reward: -3.1078, gradient norm:  61.26:  79%|███████▊  | 491/625 [01:43<00:28,  4.75it/s]
reward: -3.9788, last reward: -3.1078, gradient norm:  61.26:  79%|███████▊  | 492/625 [01:43<00:27,  4.77it/s]
reward: -2.8486, last reward: -0.2049, gradient norm:  2.378:  79%|███████▊  | 492/625 [01:43<00:27,  4.77it/s]
reward: -2.8486, last reward: -0.2049, gradient norm:  2.378:  79%|███████▉  | 493/625 [01:43<00:27,  4.78it/s]
reward: -2.3804, last reward: -0.2427, gradient norm:  8.888:  79%|███████▉  | 493/625 [01:43<00:27,  4.78it/s]
reward: -2.3804, last reward: -0.2427, gradient norm:  8.888:  79%|███████▉  | 494/625 [01:43<00:27,  4.78it/s]
reward: -2.7383, last reward: -0.0216, gradient norm:  0.3409:  79%|███████▉  | 494/625 [01:44<00:27,  4.78it/s]
reward: -2.7383, last reward: -0.0216, gradient norm:  0.3409:  79%|███████▉  | 495/625 [01:44<00:27,  4.76it/s]
reward: -2.2972, last reward: -0.0008, gradient norm:  0.1397:  79%|███████▉  | 495/625 [01:44<00:27,  4.76it/s]
reward: -2.2972, last reward: -0.0008, gradient norm:  0.1397:  79%|███████▉  | 496/625 [01:44<00:27,  4.74it/s]
reward: -1.7317, last reward: -0.4504, gradient norm:  431.0:  79%|███████▉  | 496/625 [01:44<00:27,  4.74it/s]
reward: -1.7317, last reward: -0.4504, gradient norm:  431.0:  80%|███████▉  | 497/625 [01:44<00:27,  4.73it/s]
reward: -1.9472, last reward: -0.0047, gradient norm:  0.4756:  80%|███████▉  | 497/625 [01:44<00:27,  4.73it/s]
reward: -1.9472, last reward: -0.0047, gradient norm:  0.4756:  80%|███████▉  | 498/625 [01:44<00:26,  4.73it/s]
reward: -2.6030, last reward: -0.0010, gradient norm:  0.7292:  80%|███████▉  | 498/625 [01:45<00:26,  4.73it/s]
reward: -2.6030, last reward: -0.0010, gradient norm:  0.7292:  80%|███████▉  | 499/625 [01:45<00:26,  4.72it/s]
reward: -1.8096, last reward: -0.0002, gradient norm:  0.4949:  80%|███████▉  | 499/625 [01:45<00:26,  4.72it/s]
reward: -1.8096, last reward: -0.0002, gradient norm:  0.4949:  80%|████████  | 500/625 [01:45<00:26,  4.75it/s]
reward: -1.6683, last reward: -0.0004, gradient norm:  0.4736:  80%|████████  | 500/625 [01:45<00:26,  4.75it/s]
reward: -1.6683, last reward: -0.0004, gradient norm:  0.4736:  80%|████████  | 501/625 [01:45<00:26,  4.74it/s]
reward: -1.9906, last reward: -0.0021, gradient norm:  0.673:  80%|████████  | 501/625 [01:45<00:26,  4.74it/s]
reward: -1.9906, last reward: -0.0021, gradient norm:  0.673:  80%|████████  | 502/625 [01:45<00:25,  4.75it/s]
reward: -2.2903, last reward: -0.0044, gradient norm:  0.5502:  80%|████████  | 502/625 [01:45<00:25,  4.75it/s]
reward: -2.2903, last reward: -0.0044, gradient norm:  0.5502:  80%|████████  | 503/625 [01:45<00:25,  4.75it/s]
reward: -1.9797, last reward: -0.0132, gradient norm:  7.029:  80%|████████  | 503/625 [01:46<00:25,  4.75it/s]
reward: -1.9797, last reward: -0.0132, gradient norm:  7.029:  81%|████████  | 504/625 [01:46<00:25,  4.77it/s]
reward: -2.2245, last reward: -0.0062, gradient norm:  0.3676:  81%|████████  | 504/625 [01:46<00:25,  4.77it/s]
reward: -2.2245, last reward: -0.0062, gradient norm:  0.3676:  81%|████████  | 505/625 [01:46<00:25,  4.78it/s]
reward: -1.7487, last reward: -0.0040, gradient norm:  0.3802:  81%|████████  | 505/625 [01:46<00:25,  4.78it/s]
reward: -1.7487, last reward: -0.0040, gradient norm:  0.3802:  81%|████████  | 506/625 [01:46<00:24,  4.79it/s]
reward: -1.9054, last reward: -0.0013, gradient norm:  0.4617:  81%|████████  | 506/625 [01:46<00:24,  4.79it/s]
reward: -1.9054, last reward: -0.0013, gradient norm:  0.4617:  81%|████████  | 507/625 [01:46<00:24,  4.80it/s]
reward: -1.9537, last reward: -0.0003, gradient norm:  0.4139:  81%|████████  | 507/625 [01:46<00:24,  4.80it/s]
reward: -1.9537, last reward: -0.0003, gradient norm:  0.4139:  81%|████████▏ | 508/625 [01:46<00:24,  4.81it/s]
reward: -1.9811, last reward: -0.0037, gradient norm:  0.4968:  81%|████████▏ | 508/625 [01:47<00:24,  4.81it/s]
reward: -1.9811, last reward: -0.0037, gradient norm:  0.4968:  81%|████████▏ | 509/625 [01:47<00:24,  4.81it/s]
reward: -2.0120, last reward: -0.0066, gradient norm:  0.4458:  81%|████████▏ | 509/625 [01:47<00:24,  4.81it/s]
reward: -2.0120, last reward: -0.0066, gradient norm:  0.4458:  82%|████████▏ | 510/625 [01:47<00:28,  3.98it/s]
reward: -2.0880, last reward: -0.0170, gradient norm:  0.4251:  82%|████████▏ | 510/625 [01:47<00:28,  3.98it/s]
reward: -2.0880, last reward: -0.0170, gradient norm:  0.4251:  82%|████████▏ | 511/625 [01:47<00:27,  4.19it/s]
reward: -2.7379, last reward: -0.5845, gradient norm:  22.38:  82%|████████▏ | 511/625 [01:47<00:27,  4.19it/s]
reward: -2.7379, last reward: -0.5845, gradient norm:  22.38:  82%|████████▏ | 512/625 [01:47<00:25,  4.36it/s]
reward: -2.5455, last reward: -0.2139, gradient norm:  6.013:  82%|████████▏ | 512/625 [01:48<00:25,  4.36it/s]
reward: -2.5455, last reward: -0.2139, gradient norm:  6.013:  82%|████████▏ | 513/625 [01:48<00:24,  4.49it/s]
reward: -2.4104, last reward: -0.0107, gradient norm:  0.9234:  82%|████████▏ | 513/625 [01:48<00:24,  4.49it/s]
reward: -2.4104, last reward: -0.0107, gradient norm:  0.9234:  82%|████████▏ | 514/625 [01:48<00:24,  4.59it/s]
reward: -1.9657, last reward: -0.0201, gradient norm:  2.032:  82%|████████▏ | 514/625 [01:48<00:24,  4.59it/s]
reward: -1.9657, last reward: -0.0201, gradient norm:  2.032:  82%|████████▏ | 515/625 [01:48<00:23,  4.64it/s]
reward: -2.2164, last reward: -0.0025, gradient norm:  0.2708:  82%|████████▏ | 515/625 [01:48<00:23,  4.64it/s]
reward: -2.2164, last reward: -0.0025, gradient norm:  0.2708:  83%|████████▎ | 516/625 [01:48<00:23,  4.68it/s]
reward: -2.2957, last reward: -0.0005, gradient norm:  0.9441:  83%|████████▎ | 516/625 [01:48<00:23,  4.68it/s]
reward: -2.2957, last reward: -0.0005, gradient norm:  0.9441:  83%|████████▎ | 517/625 [01:48<00:22,  4.71it/s]
reward: -1.9742, last reward: -0.0045, gradient norm:  0.3999:  83%|████████▎ | 517/625 [01:49<00:22,  4.71it/s]
reward: -1.9742, last reward: -0.0045, gradient norm:  0.3999:  83%|████████▎ | 518/625 [01:49<00:22,  4.75it/s]
reward: -2.1574, last reward: -0.0078, gradient norm:  0.8513:  83%|████████▎ | 518/625 [01:49<00:22,  4.75it/s]
reward: -2.1574, last reward: -0.0078, gradient norm:  0.8513:  83%|████████▎ | 519/625 [01:49<00:22,  4.76it/s]
reward: -1.8835, last reward: -0.0095, gradient norm:  0.5518:  83%|████████▎ | 519/625 [01:49<00:22,  4.76it/s]
reward: -1.8835, last reward: -0.0095, gradient norm:  0.5518:  83%|████████▎ | 520/625 [01:49<00:21,  4.78it/s]
reward: -2.4242, last reward: -0.4031, gradient norm:  225.8:  83%|████████▎ | 520/625 [01:49<00:21,  4.78it/s]
reward: -2.4242, last reward: -0.4031, gradient norm:  225.8:  83%|████████▎ | 521/625 [01:49<00:21,  4.78it/s]
reward: -1.9132, last reward: -0.0034, gradient norm:  0.4315:  83%|████████▎ | 521/625 [01:49<00:21,  4.78it/s]
reward: -1.9132, last reward: -0.0034, gradient norm:  0.4315:  84%|████████▎ | 522/625 [01:49<00:21,  4.79it/s]
reward: -2.3352, last reward: -0.0129, gradient norm:  0.2119:  84%|████████▎ | 522/625 [01:50<00:21,  4.79it/s]
reward: -2.3352, last reward: -0.0129, gradient norm:  0.2119:  84%|████████▎ | 523/625 [01:50<00:21,  4.80it/s]
reward: -2.0629, last reward: -0.2873, gradient norm:  7.375:  84%|████████▎ | 523/625 [01:50<00:21,  4.80it/s]
reward: -2.0629, last reward: -0.2873, gradient norm:  7.375:  84%|████████▍ | 524/625 [01:50<00:21,  4.80it/s]
reward: -2.2347, last reward: -0.0025, gradient norm:  0.4424:  84%|████████▍ | 524/625 [01:50<00:21,  4.80it/s]
reward: -2.2347, last reward: -0.0025, gradient norm:  0.4424:  84%|████████▍ | 525/625 [01:50<00:20,  4.80it/s]
reward: -2.2983, last reward: -0.0170, gradient norm:  0.5518:  84%|████████▍ | 525/625 [01:50<00:20,  4.80it/s]
reward: -2.2983, last reward: -0.0170, gradient norm:  0.5518:  84%|████████▍ | 526/625 [01:50<00:20,  4.80it/s]
reward: -1.6817, last reward: -0.0020, gradient norm:  0.4182:  84%|████████▍ | 526/625 [01:50<00:20,  4.80it/s]
reward: -1.6817, last reward: -0.0020, gradient norm:  0.4182:  84%|████████▍ | 527/625 [01:50<00:20,  4.80it/s]
reward: -2.2043, last reward: -0.0008, gradient norm:  0.2703:  84%|████████▍ | 527/625 [01:51<00:20,  4.80it/s]
reward: -2.2043, last reward: -0.0008, gradient norm:  0.2703:  84%|████████▍ | 528/625 [01:51<00:20,  4.81it/s]
reward: -1.8662, last reward: -0.0026, gradient norm:  1.062:  84%|████████▍ | 528/625 [01:51<00:20,  4.81it/s]
reward: -1.8662, last reward: -0.0026, gradient norm:  1.062:  85%|████████▍ | 529/625 [01:51<00:19,  4.80it/s]
reward: -2.1564, last reward: -0.0035, gradient norm:  0.4355:  85%|████████▍ | 529/625 [01:51<00:19,  4.80it/s]
reward: -2.1564, last reward: -0.0035, gradient norm:  0.4355:  85%|████████▍ | 530/625 [01:51<00:19,  4.81it/s]
reward: -2.5856, last reward: -0.0278, gradient norm:  0.4754:  85%|████████▍ | 530/625 [01:51<00:19,  4.81it/s]
reward: -2.5856, last reward: -0.0278, gradient norm:  0.4754:  85%|████████▍ | 531/625 [01:51<00:19,  4.81it/s]
reward: -2.3204, last reward: -0.0163, gradient norm:  0.5904:  85%|████████▍ | 531/625 [01:52<00:19,  4.81it/s]
reward: -2.3204, last reward: -0.0163, gradient norm:  0.5904:  85%|████████▌ | 532/625 [01:52<00:19,  4.82it/s]
reward: -2.6885, last reward: -0.2438, gradient norm:  2.277:  85%|████████▌ | 532/625 [01:52<00:19,  4.82it/s]
reward: -2.6885, last reward: -0.2438, gradient norm:  2.277:  85%|████████▌ | 533/625 [01:52<00:19,  4.82it/s]
reward: -2.2555, last reward: -0.0452, gradient norm:  0.9628:  85%|████████▌ | 533/625 [01:52<00:19,  4.82it/s]
reward: -2.2555, last reward: -0.0452, gradient norm:  0.9628:  85%|████████▌ | 534/625 [01:52<00:18,  4.82it/s]
reward: -3.0695, last reward: -0.7870, gradient norm:  30.08:  85%|████████▌ | 534/625 [01:52<00:18,  4.82it/s]
reward: -3.0695, last reward: -0.7870, gradient norm:  30.08:  86%|████████▌ | 535/625 [01:52<00:18,  4.82it/s]
reward: -2.9792, last reward: -0.7378, gradient norm:  15.69:  86%|████████▌ | 535/625 [01:52<00:18,  4.82it/s]
reward: -2.9792, last reward: -0.7378, gradient norm:  15.69:  86%|████████▌ | 536/625 [01:52<00:18,  4.81it/s]
reward: -3.3185, last reward: -0.8053, gradient norm:  10.1:  86%|████████▌ | 536/625 [01:53<00:18,  4.81it/s]
reward: -3.3185, last reward: -0.8053, gradient norm:  10.1:  86%|████████▌ | 537/625 [01:53<00:18,  4.82it/s]
reward: -3.3615, last reward: -0.7426, gradient norm:  32.47:  86%|████████▌ | 537/625 [01:53<00:18,  4.82it/s]
reward: -3.3615, last reward: -0.7426, gradient norm:  32.47:  86%|████████▌ | 538/625 [01:53<00:18,  4.82it/s]
reward: -2.8675, last reward: -0.8165, gradient norm:  107.7:  86%|████████▌ | 538/625 [01:53<00:18,  4.82it/s]
reward: -2.8675, last reward: -0.8165, gradient norm:  107.7:  86%|████████▌ | 539/625 [01:53<00:17,  4.82it/s]
reward: -2.1532, last reward: -0.0066, gradient norm:  0.5248:  86%|████████▌ | 539/625 [01:53<00:17,  4.82it/s]
reward: -2.1532, last reward: -0.0066, gradient norm:  0.5248:  86%|████████▋ | 540/625 [01:53<00:17,  4.81it/s]
reward: -1.9298, last reward: -0.0014, gradient norm:  0.328:  86%|████████▋ | 540/625 [01:53<00:17,  4.81it/s]
reward: -1.9298, last reward: -0.0014, gradient norm:  0.328:  87%|████████▋ | 541/625 [01:53<00:17,  4.81it/s]
reward: -2.4598, last reward: -0.0155, gradient norm:  0.431:  87%|████████▋ | 541/625 [01:54<00:17,  4.81it/s]
reward: -2.4598, last reward: -0.0155, gradient norm:  0.431:  87%|████████▋ | 542/625 [01:54<00:17,  4.81it/s]
reward: -2.2100, last reward: -0.0003, gradient norm:  0.4409:  87%|████████▋ | 542/625 [01:54<00:17,  4.81it/s]
reward: -2.2100, last reward: -0.0003, gradient norm:  0.4409:  87%|████████▋ | 543/625 [01:54<00:17,  4.81it/s]
reward: -2.0063, last reward: -0.0017, gradient norm:  0.3312:  87%|████████▋ | 543/625 [01:54<00:17,  4.81it/s]
reward: -2.0063, last reward: -0.0017, gradient norm:  0.3312:  87%|████████▋ | 544/625 [01:54<00:16,  4.81it/s]
reward: -2.1692, last reward: -0.0344, gradient norm:  0.6026:  87%|████████▋ | 544/625 [01:54<00:16,  4.81it/s]
reward: -2.1692, last reward: -0.0344, gradient norm:  0.6026:  87%|████████▋ | 545/625 [01:54<00:16,  4.81it/s]
reward: -2.4494, last reward: -0.0029, gradient norm:  0.2738:  87%|████████▋ | 545/625 [01:54<00:16,  4.81it/s]
reward: -2.4494, last reward: -0.0029, gradient norm:  0.2738:  87%|████████▋ | 546/625 [01:54<00:16,  4.81it/s]
reward: -1.9326, last reward: -0.0023, gradient norm:  0.3547:  87%|████████▋ | 546/625 [01:55<00:16,  4.81it/s]
reward: -1.9326, last reward: -0.0023, gradient norm:  0.3547:  88%|████████▊ | 547/625 [01:55<00:16,  4.81it/s]
reward: -2.0056, last reward: -0.0011, gradient norm:  0.4607:  88%|████████▊ | 547/625 [01:55<00:16,  4.81it/s]
reward: -2.0056, last reward: -0.0011, gradient norm:  0.4607:  88%|████████▊ | 548/625 [01:55<00:16,  4.81it/s]
reward: -2.2037, last reward: -0.0005, gradient norm:  0.4285:  88%|████████▊ | 548/625 [01:55<00:16,  4.81it/s]
reward: -2.2037, last reward: -0.0005, gradient norm:  0.4285:  88%|████████▊ | 549/625 [01:55<00:15,  4.82it/s]
reward: -2.2003, last reward: -0.0001, gradient norm:  0.7362:  88%|████████▊ | 549/625 [01:55<00:15,  4.82it/s]
reward: -2.2003, last reward: -0.0001, gradient norm:  0.7362:  88%|████████▊ | 550/625 [01:55<00:15,  4.82it/s]
reward: -1.2650, last reward: -0.0000, gradient norm:  0.2252:  88%|████████▊ | 550/625 [01:55<00:15,  4.82it/s]
reward: -1.2650, last reward: -0.0000, gradient norm:  0.2252:  88%|████████▊ | 551/625 [01:55<00:15,  4.81it/s]
reward: -1.5291, last reward: -0.0001, gradient norm:  0.351:  88%|████████▊ | 551/625 [01:56<00:15,  4.81it/s]
reward: -1.5291, last reward: -0.0001, gradient norm:  0.351:  88%|████████▊ | 552/625 [01:56<00:15,  4.82it/s]
reward: -2.1972, last reward: -0.0454, gradient norm:  6.832:  88%|████████▊ | 552/625 [01:56<00:15,  4.82it/s]
reward: -2.1972, last reward: -0.0454, gradient norm:  6.832:  88%|████████▊ | 553/625 [01:56<00:14,  4.81it/s]
reward: -1.9404, last reward: -0.0000, gradient norm:  0.4075:  88%|████████▊ | 553/625 [01:56<00:14,  4.81it/s]
reward: -1.9404, last reward: -0.0000, gradient norm:  0.4075:  89%|████████▊ | 554/625 [01:56<00:14,  4.82it/s]
reward: -2.3901, last reward: -0.0043, gradient norm:  0.2454:  89%|████████▊ | 554/625 [01:56<00:14,  4.82it/s]
reward: -2.3901, last reward: -0.0043, gradient norm:  0.2454:  89%|████████▉ | 555/625 [01:56<00:14,  4.82it/s]
reward: -2.1442, last reward: -0.0016, gradient norm:  0.398:  89%|████████▉ | 555/625 [01:57<00:14,  4.82it/s]
reward: -2.1442, last reward: -0.0016, gradient norm:  0.398:  89%|████████▉ | 556/625 [01:57<00:14,  4.83it/s]
reward: -2.5808, last reward: -0.0063, gradient norm:  3.177:  89%|████████▉ | 556/625 [01:57<00:14,  4.83it/s]
reward: -2.5808, last reward: -0.0063, gradient norm:  3.177:  89%|████████▉ | 557/625 [01:57<00:14,  4.83it/s]
reward: -2.3110, last reward: -0.1865, gradient norm:  4.909:  89%|████████▉ | 557/625 [01:57<00:14,  4.83it/s]
reward: -2.3110, last reward: -0.1865, gradient norm:  4.909:  89%|████████▉ | 558/625 [01:57<00:13,  4.81it/s]
reward: -2.2579, last reward: -0.0129, gradient norm:  0.3089:  89%|████████▉ | 558/625 [01:57<00:13,  4.81it/s]
reward: -2.2579, last reward: -0.0129, gradient norm:  0.3089:  89%|████████▉ | 559/625 [01:57<00:13,  4.82it/s]
reward: -2.2661, last reward: -0.0258, gradient norm:  1.13:  89%|████████▉ | 559/625 [01:57<00:13,  4.82it/s]
reward: -2.2661, last reward: -0.0258, gradient norm:  1.13:  90%|████████▉ | 560/625 [01:57<00:13,  4.82it/s]
reward: -2.2963, last reward: -0.4148, gradient norm:  11.31:  90%|████████▉ | 560/625 [01:58<00:13,  4.82it/s]
reward: -2.2963, last reward: -0.4148, gradient norm:  11.31:  90%|████████▉ | 561/625 [01:58<00:13,  4.83it/s]
reward: -2.0830, last reward: -0.0138, gradient norm:  0.3773:  90%|████████▉ | 561/625 [01:58<00:13,  4.83it/s]
reward: -2.0830, last reward: -0.0138, gradient norm:  0.3773:  90%|████████▉ | 562/625 [01:58<00:13,  4.80it/s]
reward: -2.0689, last reward: -0.0016, gradient norm:  1.096:  90%|████████▉ | 562/625 [01:58<00:13,  4.80it/s]
reward: -2.0689, last reward: -0.0016, gradient norm:  1.096:  90%|█████████ | 563/625 [01:58<00:12,  4.80it/s]
reward: -2.2374, last reward: -0.0940, gradient norm:  3.178:  90%|█████████ | 563/625 [01:58<00:12,  4.80it/s]
reward: -2.2374, last reward: -0.0940, gradient norm:  3.178:  90%|█████████ | 564/625 [01:58<00:12,  4.79it/s]
reward: -2.4075, last reward: -0.0054, gradient norm:  0.4273:  90%|█████████ | 564/625 [01:58<00:12,  4.79it/s]
reward: -2.4075, last reward: -0.0054, gradient norm:  0.4273:  90%|█████████ | 565/625 [01:58<00:12,  4.80it/s]
reward: -2.5810, last reward: -0.4576, gradient norm:  30.6:  90%|█████████ | 565/625 [01:59<00:12,  4.80it/s]
reward: -2.5810, last reward: -0.4576, gradient norm:  30.6:  91%|█████████ | 566/625 [01:59<00:12,  4.80it/s]
reward: -2.0336, last reward: -0.0071, gradient norm:  0.3727:  91%|█████████ | 566/625 [01:59<00:12,  4.80it/s]
reward: -2.0336, last reward: -0.0071, gradient norm:  0.3727:  91%|█████████ | 567/625 [01:59<00:12,  4.81it/s]
reward: -2.4358, last reward: -0.0337, gradient norm:  2.027:  91%|█████████ | 567/625 [01:59<00:12,  4.81it/s]
reward: -2.4358, last reward: -0.0337, gradient norm:  2.027:  91%|█████████ | 568/625 [01:59<00:11,  4.82it/s]
reward: -2.3988, last reward: -0.0015, gradient norm:  0.4643:  91%|█████████ | 568/625 [01:59<00:11,  4.82it/s]
reward: -2.3988, last reward: -0.0015, gradient norm:  0.4643:  91%|█████████ | 569/625 [01:59<00:11,  4.82it/s]
reward: -2.2093, last reward: -0.0042, gradient norm:  0.2236:  91%|█████████ | 569/625 [01:59<00:11,  4.82it/s]
reward: -2.2093, last reward: -0.0042, gradient norm:  0.2236:  91%|█████████ | 570/625 [01:59<00:11,  4.79it/s]
reward: -1.7894, last reward: -0.0001, gradient norm:  0.5424:  91%|█████████ | 570/625 [02:00<00:11,  4.79it/s]
reward: -1.7894, last reward: -0.0001, gradient norm:  0.5424:  91%|█████████▏| 571/625 [02:00<00:11,  4.78it/s]
reward: -2.0149, last reward: -0.0005, gradient norm:  0.5926:  91%|█████████▏| 571/625 [02:00<00:11,  4.78it/s]
reward: -2.0149, last reward: -0.0005, gradient norm:  0.5926:  92%|█████████▏| 572/625 [02:00<00:11,  4.80it/s]
reward: -2.3232, last reward: -0.0703, gradient norm:  1.67:  92%|█████████▏| 572/625 [02:00<00:11,  4.80it/s]
reward: -2.3232, last reward: -0.0703, gradient norm:  1.67:  92%|█████████▏| 573/625 [02:00<00:10,  4.79it/s]
reward: -1.5762, last reward: -0.0003, gradient norm:  0.3608:  92%|█████████▏| 573/625 [02:00<00:10,  4.79it/s]
reward: -1.5762, last reward: -0.0003, gradient norm:  0.3608:  92%|█████████▏| 574/625 [02:00<00:10,  4.80it/s]
reward: -2.3711, last reward: -0.0000, gradient norm:  0.3172:  92%|█████████▏| 574/625 [02:00<00:10,  4.80it/s]
reward: -2.3711, last reward: -0.0000, gradient norm:  0.3172:  92%|█████████▏| 575/625 [02:00<00:10,  4.80it/s]
reward: -2.3527, last reward: -0.0001, gradient norm:  3.841:  92%|█████████▏| 575/625 [02:01<00:10,  4.80it/s]
reward: -2.3527, last reward: -0.0001, gradient norm:  3.841:  92%|█████████▏| 576/625 [02:01<00:10,  4.81it/s]
reward: -1.9138, last reward: -0.0004, gradient norm:  0.363:  92%|█████████▏| 576/625 [02:01<00:10,  4.81it/s]
reward: -1.9138, last reward: -0.0004, gradient norm:  0.363:  92%|█████████▏| 577/625 [02:01<00:09,  4.81it/s]
reward: -2.3048, last reward: -0.0007, gradient norm:  0.399:  92%|█████████▏| 577/625 [02:01<00:09,  4.81it/s]
reward: -2.3048, last reward: -0.0007, gradient norm:  0.399:  92%|█████████▏| 578/625 [02:01<00:09,  4.81it/s]
reward: -1.9566, last reward: -0.0011, gradient norm:  0.5855:  92%|█████████▏| 578/625 [02:01<00:09,  4.81it/s]
reward: -1.9566, last reward: -0.0011, gradient norm:  0.5855:  93%|█████████▎| 579/625 [02:01<00:09,  4.80it/s]
reward: -2.4461, last reward: -0.0148, gradient norm:  1.622:  93%|█████████▎| 579/625 [02:01<00:09,  4.80it/s]
reward: -2.4461, last reward: -0.0148, gradient norm:  1.622:  93%|█████████▎| 580/625 [02:01<00:09,  4.81it/s]
reward: -2.6084, last reward: -0.0063, gradient norm:  6.955:  93%|█████████▎| 580/625 [02:02<00:09,  4.81it/s]
reward: -2.6084, last reward: -0.0063, gradient norm:  6.955:  93%|█████████▎| 581/625 [02:02<00:09,  4.81it/s]
reward: -3.1225, last reward: -0.7400, gradient norm:  92.97:  93%|█████████▎| 581/625 [02:02<00:09,  4.81it/s]
reward: -3.1225, last reward: -0.7400, gradient norm:  92.97:  93%|█████████▎| 582/625 [02:02<00:08,  4.79it/s]
reward: -3.3131, last reward: -1.9206, gradient norm:  591.2:  93%|█████████▎| 582/625 [02:02<00:08,  4.79it/s]
reward: -3.3131, last reward: -1.9206, gradient norm:  591.2:  93%|█████████▎| 583/625 [02:02<00:08,  4.79it/s]
reward: -2.5562, last reward: -0.2136, gradient norm:  4.752:  93%|█████████▎| 583/625 [02:02<00:08,  4.79it/s]
reward: -2.5562, last reward: -0.2136, gradient norm:  4.752:  93%|█████████▎| 584/625 [02:02<00:08,  4.78it/s]
reward: -1.9200, last reward: -0.0085, gradient norm:  0.5597:  93%|█████████▎| 584/625 [02:03<00:08,  4.78it/s]
reward: -1.9200, last reward: -0.0085, gradient norm:  0.5597:  94%|█████████▎| 585/625 [02:03<00:08,  4.79it/s]
reward: -2.2839, last reward: -0.0135, gradient norm:  0.5916:  94%|█████████▎| 585/625 [02:03<00:08,  4.79it/s]
reward: -2.2839, last reward: -0.0135, gradient norm:  0.5916:  94%|█████████▍| 586/625 [02:03<00:08,  4.80it/s]
reward: -2.1346, last reward: -0.0095, gradient norm:  2.234:  94%|█████████▍| 586/625 [02:03<00:08,  4.80it/s]
reward: -2.1346, last reward: -0.0095, gradient norm:  2.234:  94%|█████████▍| 587/625 [02:03<00:07,  4.79it/s]
reward: -2.2311, last reward: -0.0026, gradient norm:  0.3546:  94%|█████████▍| 587/625 [02:03<00:07,  4.79it/s]
reward: -2.2311, last reward: -0.0026, gradient norm:  0.3546:  94%|█████████▍| 588/625 [02:03<00:07,  4.80it/s]
reward: -1.8353, last reward: -0.0001, gradient norm:  0.4645:  94%|█████████▍| 588/625 [02:03<00:07,  4.80it/s]
reward: -1.8353, last reward: -0.0001, gradient norm:  0.4645:  94%|█████████▍| 589/625 [02:03<00:07,  4.80it/s]
reward: -1.9739, last reward: -0.0033, gradient norm:  2.222:  94%|█████████▍| 589/625 [02:04<00:07,  4.80it/s]
reward: -1.9739, last reward: -0.0033, gradient norm:  2.222:  94%|█████████▍| 590/625 [02:04<00:07,  4.81it/s]
reward: -2.2696, last reward: -0.1279, gradient norm:  3.818:  94%|█████████▍| 590/625 [02:04<00:07,  4.81it/s]
reward: -2.2696, last reward: -0.1279, gradient norm:  3.818:  95%|█████████▍| 591/625 [02:04<00:07,  4.80it/s]
reward: -2.2685, last reward: -0.0089, gradient norm:  0.844:  95%|█████████▍| 591/625 [02:04<00:07,  4.80it/s]
reward: -2.2685, last reward: -0.0089, gradient norm:  0.844:  95%|█████████▍| 592/625 [02:04<00:06,  4.80it/s]
reward: -2.2583, last reward: -0.0056, gradient norm:  0.2895:  95%|█████████▍| 592/625 [02:04<00:06,  4.80it/s]
reward: -2.2583, last reward: -0.0056, gradient norm:  0.2895:  95%|█████████▍| 593/625 [02:04<00:06,  4.80it/s]
reward: -2.3198, last reward: -0.2449, gradient norm:  18.06:  95%|█████████▍| 593/625 [02:04<00:06,  4.80it/s]
reward: -2.3198, last reward: -0.2449, gradient norm:  18.06:  95%|█████████▌| 594/625 [02:04<00:06,  4.79it/s]
reward: -2.2948, last reward: -0.0019, gradient norm:  0.4655:  95%|█████████▌| 594/625 [02:05<00:06,  4.79it/s]
reward: -2.2948, last reward: -0.0019, gradient norm:  0.4655:  95%|█████████▌| 595/625 [02:05<00:06,  4.80it/s]
reward: -2.1368, last reward: -0.1032, gradient norm:  1.97:  95%|█████████▌| 595/625 [02:05<00:06,  4.80it/s]
reward: -2.1368, last reward: -0.1032, gradient norm:  1.97:  95%|█████████▌| 596/625 [02:05<00:06,  4.80it/s]
reward: -2.0820, last reward: -0.0000, gradient norm:  0.2516:  95%|█████████▌| 596/625 [02:05<00:06,  4.80it/s]
reward: -2.0820, last reward: -0.0000, gradient norm:  0.2516:  96%|█████████▌| 597/625 [02:05<00:05,  4.81it/s]
reward: -2.3768, last reward: -0.0006, gradient norm:  0.723:  96%|█████████▌| 597/625 [02:05<00:05,  4.81it/s]
reward: -2.3768, last reward: -0.0006, gradient norm:  0.723:  96%|█████████▌| 598/625 [02:05<00:05,  4.82it/s]
reward: -2.2649, last reward: -0.0010, gradient norm:  0.8623:  96%|█████████▌| 598/625 [02:05<00:05,  4.82it/s]
reward: -2.2649, last reward: -0.0010, gradient norm:  0.8623:  96%|█████████▌| 599/625 [02:05<00:05,  4.81it/s]
reward: -2.5340, last reward: -0.0005, gradient norm:  0.6933:  96%|█████████▌| 599/625 [02:06<00:05,  4.81it/s]
reward: -2.5340, last reward: -0.0005, gradient norm:  0.6933:  96%|█████████▌| 600/625 [02:06<00:05,  4.82it/s]
reward: -2.5290, last reward: -0.0018, gradient norm:  2.335:  96%|█████████▌| 600/625 [02:06<00:05,  4.82it/s]
reward: -2.5290, last reward: -0.0018, gradient norm:  2.335:  96%|█████████▌| 601/625 [02:06<00:04,  4.82it/s]
reward: -2.1673, last reward: -0.0003, gradient norm:  3.073:  96%|█████████▌| 601/625 [02:06<00:04,  4.82it/s]
reward: -2.1673, last reward: -0.0003, gradient norm:  3.073:  96%|█████████▋| 602/625 [02:06<00:04,  4.82it/s]
reward: -2.6205, last reward: -0.0079, gradient norm:  5.206:  96%|█████████▋| 602/625 [02:06<00:04,  4.82it/s]
reward: -2.6205, last reward: -0.0079, gradient norm:  5.206:  96%|█████████▋| 603/625 [02:06<00:04,  4.82it/s]
reward: -5.1828, last reward: -4.6680, gradient norm:  54.94:  96%|█████████▋| 603/625 [02:06<00:04,  4.82it/s]
reward: -5.1828, last reward: -4.6680, gradient norm:  54.94:  97%|█████████▋| 604/625 [02:06<00:04,  4.83it/s]
reward: -5.8211, last reward: -5.8027, gradient norm:  13.15:  97%|█████████▋| 604/625 [02:07<00:04,  4.83it/s]
reward: -5.8211, last reward: -5.8027, gradient norm:  13.15:  97%|█████████▋| 605/625 [02:07<00:04,  4.82it/s]
reward: -6.0052, last reward: -5.2599, gradient norm:  7.317:  97%|█████████▋| 605/625 [02:07<00:04,  4.82it/s]
reward: -6.0052, last reward: -5.2599, gradient norm:  7.317:  97%|█████████▋| 606/625 [02:07<00:03,  4.80it/s]
reward: -5.9510, last reward: -5.8142, gradient norm:  6.936:  97%|█████████▋| 606/625 [02:07<00:03,  4.80it/s]
reward: -5.9510, last reward: -5.8142, gradient norm:  6.936:  97%|█████████▋| 607/625 [02:07<00:03,  4.81it/s]
reward: -5.4776, last reward: -5.6192, gradient norm:  13.72:  97%|█████████▋| 607/625 [02:07<00:03,  4.81it/s]
reward: -5.4776, last reward: -5.6192, gradient norm:  13.72:  97%|█████████▋| 608/625 [02:07<00:03,  4.78it/s]
reward: -5.0379, last reward: -3.9016, gradient norm:  25.06:  97%|█████████▋| 608/625 [02:08<00:03,  4.78it/s]
reward: -5.0379, last reward: -3.9016, gradient norm:  25.06:  97%|█████████▋| 609/625 [02:08<00:03,  4.79it/s]
reward: -2.5771, last reward: -0.1840, gradient norm:  1.342:  97%|█████████▋| 609/625 [02:08<00:03,  4.79it/s]
reward: -2.5771, last reward: -0.1840, gradient norm:  1.342:  98%|█████████▊| 610/625 [02:08<00:03,  4.79it/s]
reward: -2.4566, last reward: -0.3031, gradient norm:  46.21:  98%|█████████▊| 610/625 [02:08<00:03,  4.79it/s]
reward: -2.4566, last reward: -0.3031, gradient norm:  46.21:  98%|█████████▊| 611/625 [02:08<00:02,  4.80it/s]
reward: -2.3758, last reward: -0.0001, gradient norm:  0.6069:  98%|█████████▊| 611/625 [02:08<00:02,  4.80it/s]
reward: -2.3758, last reward: -0.0001, gradient norm:  0.6069:  98%|█████████▊| 612/625 [02:08<00:02,  4.79it/s]
reward: -2.2030, last reward: -0.0016, gradient norm:  0.5892:  98%|█████████▊| 612/625 [02:08<00:02,  4.79it/s]
reward: -2.2030, last reward: -0.0016, gradient norm:  0.5892:  98%|█████████▊| 613/625 [02:08<00:02,  4.79it/s]
reward: -1.9065, last reward: -0.0472, gradient norm:  1.085:  98%|█████████▊| 613/625 [02:09<00:02,  4.79it/s]
reward: -1.9065, last reward: -0.0472, gradient norm:  1.085:  98%|█████████▊| 614/625 [02:09<00:02,  4.79it/s]
reward: -2.7741, last reward: -0.4854, gradient norm:  23.05:  98%|█████████▊| 614/625 [02:09<00:02,  4.79it/s]
reward: -2.7741, last reward: -0.4854, gradient norm:  23.05:  98%|█████████▊| 615/625 [02:09<00:02,  4.80it/s]
reward: -2.3814, last reward: -2.3419, gradient norm:  107.5:  98%|█████████▊| 615/625 [02:09<00:02,  4.80it/s]
reward: -2.3814, last reward: -2.3419, gradient norm:  107.5:  99%|█████████▊| 616/625 [02:09<00:01,  4.81it/s]
reward: -2.7114, last reward: -1.2236, gradient norm:  16.27:  99%|█████████▊| 616/625 [02:09<00:01,  4.81it/s]
reward: -2.7114, last reward: -1.2236, gradient norm:  16.27:  99%|█████████▊| 617/625 [02:09<00:01,  4.82it/s]
reward: -2.3560, last reward: -0.0010, gradient norm:  2.488:  99%|█████████▊| 617/625 [02:09<00:01,  4.82it/s]
reward: -2.3560, last reward: -0.0010, gradient norm:  2.488:  99%|█████████▉| 618/625 [02:09<00:01,  4.79it/s]
reward: -1.7539, last reward: -0.0022, gradient norm:  0.4706:  99%|█████████▉| 618/625 [02:10<00:01,  4.79it/s]
reward: -1.7539, last reward: -0.0022, gradient norm:  0.4706:  99%|█████████▉| 619/625 [02:10<00:01,  4.78it/s]
reward: -1.9285, last reward: -0.0051, gradient norm:  0.3408:  99%|█████████▉| 619/625 [02:10<00:01,  4.78it/s]
reward: -1.9285, last reward: -0.0051, gradient norm:  0.3408:  99%|█████████▉| 620/625 [02:10<00:01,  4.79it/s]
reward: -2.3782, last reward: -0.0073, gradient norm:  0.4432:  99%|█████████▉| 620/625 [02:10<00:01,  4.79it/s]
reward: -2.3782, last reward: -0.0073, gradient norm:  0.4432:  99%|█████████▉| 621/625 [02:10<00:00,  4.79it/s]
reward: -2.0915, last reward: -0.0086, gradient norm:  0.3351:  99%|█████████▉| 621/625 [02:10<00:00,  4.79it/s]
reward: -2.0915, last reward: -0.0086, gradient norm:  0.3351: 100%|█████████▉| 622/625 [02:10<00:00,  4.80it/s]
reward: -2.5187, last reward: -0.1573, gradient norm:  7.866: 100%|█████████▉| 622/625 [02:10<00:00,  4.80it/s]
reward: -2.5187, last reward: -0.1573, gradient norm:  7.866: 100%|█████████▉| 623/625 [02:10<00:00,  4.80it/s]
reward: -2.4126, last reward: -0.0157, gradient norm:  0.8849: 100%|█████████▉| 623/625 [02:11<00:00,  4.80it/s]
reward: -2.4126, last reward: -0.0157, gradient norm:  0.8849: 100%|█████████▉| 624/625 [02:11<00:00,  3.98it/s]
reward: -2.0543, last reward: -0.0045, gradient norm:  0.2265: 100%|█████████▉| 624/625 [02:11<00:00,  3.98it/s]
reward: -2.0543, last reward: -0.0045, gradient norm:  0.2265: 100%|██████████| 625/625 [02:11<00:00,  4.18it/s]
reward: -2.0543, last reward: -0.0045, gradient norm:  0.2265: 100%|██████████| 625/625 [02:11<00:00,  4.75it/s]

结论#

在本教程中,我们学习了如何从头开始编写一个无状态环境。我们涉及了以下主题

  • 编写环境时需要注意的四个基本组件(stepreset、种子设置和构建规格说明)。我们看到了这些方法和类如何与 TensorDict 类交互;

  • 如何使用 check_env_specs() 测试环境编写是否正确;

  • 如何在无状态环境的上下文中附加转换,以及如何编写自定义转换;

  • 如何在一个完全可微分的模拟器上训练策略。

脚本总运行时间:(2 分 12.071 秒)