KLRewardTransform¶
- class torchrl.envs.transforms.KLRewardTransform(actor: ProbabilisticTensorDictModule, coef=1.0, in_keys=None, out_keys=None, requires_grad=False, log_prob_key: NestedKey = 'sample_log_prob', action_key: NestedKey | None = None, functional: bool | None = None, device: torch.device | None = None)[源代码]¶
一个将 KL[pi_current||pi_0] 校正项添加到奖励的转换。
此转换用于约束策略使其保持接近其原始配置,这在 RLHF 微调时限制了过拟合。
- 参数:
actor (ProbabilisticTensorDictModule) – 一个概率性 actor。它必须具有以下特性:它必须有一组输入(
in_keys
)和输出键(out_keys
)。它必须有一个get_dist
方法,该方法输出动作的分布。coef (
float
) – KL 项的系数。默认为1.0
。in_keys (str 或 str/元组的字符串列表) – 应从中获取奖励的输入键。默认为
"reward"
。out_keys (str 或 str/元组的字符串列表) – 应将奖励写入的输出键。默认为
"reward"
。requires_grad (bool, 可选) – 如果
True
,则冻结的参数将由原始参数的可微分克隆组成。默认为False
。
注意
如果参数不可微分(默认),则在调用 dtype 或 device 转换操作(如
cuda()
、to()
等)时,它们将 *不* 遵循模块。当requires_grad=True
时,转换操作将按预期工作。示例
>>> from torchrl.envs.libs.gym import GymEnv >>> from torchrl.envs import TransformedEnv >>> from tensordict.nn import TensorDictModule as Mod, NormalParamExtractor >>> from torchrl.modules import ProbabilisticActor >>> from tensordict import TensorDict >>> from torchrl.modules.distributions import TanhNormal >>> from torch import nn >>> base_env = GymEnv("Pendulum-v1") >>> n_obs = base_env.observation_spec["observation"].shape[-1] >>> n_act = base_env.action_spec.shape[-1] >>> module = Mod( ... nn.Sequential(nn.Linear(n_obs, n_act * 2), NormalParamExtractor()), ... in_keys=["observation"], ... out_keys=["loc", "scale"], ... ) >>> actor = ProbabilisticActor( ... module, ... in_keys=["loc", "scale"], ... distribution_class=TanhNormal, ... return_log_prob=True, ... ) >>> transform = KLRewardTransform(actor, out_keys="reward_kl") >>> env = TransformedEnv(base_env, transform) >>> with torch.no_grad(): ... # modify the actor parameters ... _ = TensorDict(dict(actor.named_parameters()), []).apply_(lambda x: x.data.copy_(x.data + 1)) ... td = env.rollout(3, actor) >>> # check that rewards have been modified >>> assert (td.get(("next", "reward")) != td.get(("next", "reward_kl"))).all()
注意
由于 KL 公式并非始终可用,且原始分布的参数可能未被记录,我们使用 KL 散度的随机估计。
- forward(tensordict: TensorDictBase) TensorDictBase [源代码]¶
读取输入 tensordict,并对选定的键应用转换。
默认情况下,此方法
直接调用
_apply_transform()
。不调用
_step()
或_call()
。
此方法不调用 env.step 中的任何点。但是,它在
sample()
中调用。注意
forward
也使用dispatch
与常规关键字参数一起工作,以将参数名称转换为键。示例
>>> class TransformThatMeasuresBytes(Transform): ... '''Measures the number of bytes in the tensordict, and writes it under `"bytes"`.''' ... def __init__(self): ... super().__init__(in_keys=[], out_keys=["bytes"]) ... ... def forward(self, tensordict: TensorDictBase) -> TensorDictBase: ... bytes_in_td = tensordict.bytes() ... tensordict["bytes"] = bytes ... return tensordict >>> t = TransformThatMeasuresBytes() >>> env = env.append_transform(t) # works within envs >>> t(TensorDict(a=0)) # Works offline too.