torchrl.modules 包¶

TensorDict 模块：Actor、探索、价值模型和生成模型¶

TorchRL 提供一系列模块包装器，旨在方便从头开始构建 RL 模型。这些包装器完全基于 tensordict.nn.TensorDictModule 和 tensordict.nn.TensorDictSequential。它们可大致分为三类：策略（Actor），包括探索策略；价值模型；以及仿真模型（在模型驱动的场景中）。

主要特点是

将规范集成到您的模型中，以确保模型输出与环境期望的输入相匹配；
概率模块，可以自动从选定的分布中采样和/或返回感兴趣的分布；
Q 值学习、模型驱动代理等专用容器。

TensorDictModules 和 SafeModules¶

TorchRL 的 SafeModule 允许您检查模型输出是否与环境的预期相符。当您的模型需要跨多个环境回收利用，并且您想确保输出（例如动作）始终满足环境施加的边界时，应使用此功能。以下是使用 Actor 类实现此功能的示例：

>>> env = GymEnv("Pendulum-v1")
>>> action_spec = env.action_spec
>>> model = nn.LazyLinear(action_spec.shape[-1])
>>> policy = Actor(model, in_keys=["observation"], spec=action_spec, safe=True)

`safe` 标志确保输出始终在 `action_spec` 域的边界内：如果网络输出违反这些边界，它将被投影（以 L1 方式）到所需域。

`Actor`(args, *kwargs)	RL 中确定性 Actor 的通用类。
`MultiStepActorWrapper`(args, *kwargs)	多动作 Actor 的包装器。
`SafeModule`(args, *kwargs)	接受 `TensorSpec` 作为参数来控制输出域的 `tensordict.nn.TensorDictModule` 子类。
`SafeSequential`(args, *kwargs)	TensorDictModule 的安全序列。
`TanhModule`(args, *kwargs)	用于具有有界动作空间的确定性策略的 Tanh 模块。

探索包装器和模块¶

为了有效地探索环境，TorchRL 提供了一系列模块，它们将用更嘈杂的版本覆盖策略所采样的动作。它们的行为由 `exploration_type()` 控制：如果探索设置为 `ExplorationType.RANDOM`，则探索是激活的。在所有其他情况下，tensordict 中写入的动作就是网络输出。

注意

与其他探索模块不同，`ConsistentDropoutModule` 使用 `train`/`eval` 模式来遵循 PyTorch 中标准的“Dropout”API。`set_exploration_type()` 上下文管理器对该模块无效。

`AdditiveGaussianModule`(args, *kwargs)	加性高斯 PO 模块。
`ConsistentDropoutModule`(args, *kwargs)	用于 `ConsistentDropout` 的 TensorDictModule 包装器。
`EGreedyModule`(args, *kwargs)	Epsilon-Greedy 探索模块。
`OrnsteinUhlenbeckProcessModule`(args, *kwargs)	Ornstein-Uhlenbeck 探索策略模块。

概率 Actor¶

某些算法（如 PPO）需要实现概率策略。在 TorchRL 中，这些策略的形式是一个模型，后跟一个分布构造器。

注意

选择概率 Actor 还是常规 Actor 取决于实现的算法。在线算法通常需要概率 Actor，离线算法通常有一个确定性 Actor 加上额外的探索策略。然而，这个规则有很多例外。

模型读取输入（通常是环境的某些观测），并输出分布的参数，而分布构造器读取这些参数并获取分布的随机样本和/或提供一个 `torch.distributions.Distribution` 对象。

>>> from tensordict.nn import NormalParamExtractor, TensorDictSequential, TensorDictModule
>>> from torchrl.modules import SafeProbabilisticModule
>>> from torchrl.envs import GymEnv
>>> from torch.distributions import Normal
>>> from torch import nn
>>>
>>> env = GymEnv("Pendulum-v1")
>>> action_spec = env.action_spec
>>> model = nn.Sequential(nn.LazyLinear(action_spec.shape[-1] * 2), NormalParamExtractor())
>>> # build the first module, which maps the observation on the mean and sd of the normal distribution
>>> model = TensorDictModule(model, in_keys=["observation"], out_keys=["loc", "scale"])
>>> # build the distribution constructor
>>> prob_module = SafeProbabilisticModule(
...     in_keys=["loc", "scale"],
...     out_keys=["action"],
...     distribution_class=Normal,
...     return_log_prob=True,
...     spec=action_spec,
... )
>>> policy = TensorDictSequential(model, prob_module)
>>> # execute a rollout
>>> env.rollout(3, policy)

为了方便构建概率策略，我们提供了一个专用的 ProbabilisticActor。

>>> from torchrl.modules import ProbabilisticActor
>>> policy = ProbabilisticActor(
...     model,
...     in_keys=["loc", "scale"],
...     out_keys=["action"],
...     distribution_class=Normal,
...     return_log_prob=True,
...     spec=action_spec,
... )

它减轻了指定构造器并在序列中将其与模块放在一起的需要。

此策略的输出将包含“loc”和“scale”条目，根据正态分布采样的“action”以及该动作的对数概率。

`ProbabilisticActor`(args, *kwargs)	RL 中概率 Actor 的通用类。
`SafeProbabilisticModule`(args, *kwargs)	接受 `TensorSpec` 作为参数来控制输出域的 `tensordict.nn.ProbabilisticTensorDictModule` 子类。
`SafeProbabilisticTensorDictSequential`(args, ..., *kwargs)	接受 `TensorSpec` 作为参数来控制输出域的 `tensordict.nn.ProbabilisticTensorDictSequential` 子类。

Q 值 Actor¶

Q 值 Actor 是一种策略，它根据状态-动作对的最大值（或“质量”）来选择动作。这个值可以表示为表格或函数。对于具有连续状态的离散动作空间，通常使用神经网络等非线性模型来表示这个函数。

QValueActor¶

QValueActor 类接受一个模块和一个动作规范，并输出选定的动作及其对应的 Q 值。

>>> import torch
>>> from tensordict import TensorDict
>>> from torch import nn
>>> from torchrl.data import OneHot
>>> from torchrl.modules.tensordict_module.actors import QValueActor
>>> # Create a tensor dict with an observation
>>> td = TensorDict({'observation': torch.randn(5, 3)}, [5])
>>> # Define the action space
>>> action_spec = OneHot(4)
>>> # Create a linear module to output action values
>>> module = nn.Linear(3, 4)
>>> # Create a QValueActor instance
>>> qvalue_actor = QValueActor(module=module, spec=action_spec)
>>> # Run the actor on the tensor dict
>>> qvalue_actor(td)
>>> print(td)
TensorDict(
    fields={
        action: Tensor(shape=torch.Size([5, 4]), device=cpu, dtype=torch.int64, is_shared=False),
        action_value: Tensor(shape=torch.Size([5, 4]), device=cpu, dtype=torch.float32, is_shared=False),
        chosen_action_value: Tensor(shape=torch.Size([5, 1]), device=cpu, dtype=torch.float32, is_shared=False),
        observation: Tensor(shape=torch.Size([5, 3]), device=cpu, dtype=torch.float32, is_shared=False)},
    batch_size=torch.Size([5]),
    device=None,
    is_shared=False)

这将输出一个包含选定动作及其对应 Q 值的张量字典。

分布 Q 学习¶

分布 Q 学习是 Q 学习的一个变体，它将值函数表示为值可能性的概率分布，而不是单个标量值。这使得代理能够了解环境中的不确定性并做出更明智的决策。在 TorchRL 中，分布 Q 学习使用 `DistributionalQValueActor` 类来实现。该类接受一个模块、一个动作规范和一个支持向量，并输出选定的动作及其对应的 Q 值分布。

>>> import torch
>>> from tensordict import TensorDict
>>> from torch import nn
>>> from torchrl.data import OneHot
>>> from torchrl.modules import DistributionalQValueActor, MLP
>>> # Create a tensor dict with an observation
>>> td = TensorDict({'observation': torch.randn(5, 4)}, [5])
>>> # Define the action space
>>> action_spec = OneHot(4)
>>> # Define the number of bins for the value distribution
>>> nbins = 3
>>> # Create an MLP module to output logits for the value distribution
>>> module = MLP(out_features=(nbins, 4), depth=2)
>>> # Create a DistributionalQValueActor instance
>>> qvalue_actor = DistributionalQValueActor(module=module, spec=action_spec, support=torch.arange(nbins))
>>> # Run the actor on the tensor dict
>>> td = qvalue_actor(td)
>>> print(td)
TensorDict(
    fields={
        action: Tensor(shape=torch.Size([5, 4]), device=cpu, dtype=torch.int64, is_shared=False),
        action_value: Tensor(shape=torch.Size([5, 3, 4]), device=cpu, dtype=torch.float32, is_shared=False),
        observation: Tensor(shape=torch.Size([5, 4]), device=cpu, dtype=torch.float32, is_shared=False)},
    batch_size=torch.Size([5]),
    device=None,
    is_shared=False)

这将输出一个包含选定动作及其对应 Q 值分布的张量字典。

`QValueActor`(args, *kwargs)	Q 值 Actor 类。
`QValueModule`(args, *kwargs)	用于 Q 值策略的 Q 值 TensorDictModule。
`DistributionalQValueActor`(args, *kwargs)	分布 DQN Actor 类。
`DistributionalQValueModule`(args, *kwargs)	Q 值 Actor 策略的分布 Q 值 Hook。

价值运算符和联合模型¶

TorchRL 提供了一系列价值运算符，它们包装价值网络，以软化与库其他部分的接口。基本构建块是 `torchrl.modules.tensordict_module.ValueOperator`：给定一个输入状态（可能还有动作），它将自动在 tensordict 中写入“state_value”（或“state_action_value”），具体取决于输入。因此，该类同时处理价值网络和 Q 值网络。还提供了三个类来组合策略和价值网络。`ActorCriticOperator` 是一个具有共享参数的联合 Actor-Critic 网络：它读取一个观测，通过一个公共骨干网络进行传递，写入一个隐藏状态，将此隐藏状态馈送给策略，然后采用隐藏状态和动作，并提供状态-动作对的 Q 值。`ActorValueOperator` 是一个具有共享参数的联合 Actor-Value 网络：它读取一个观测，通过一个公共骨干网络进行传递，写入一个隐藏状态，将此隐藏状态馈送给策略和价值模块，以输出动作和状态值。最后，`ActorCriticWrapper` 是一个没有共享参数的联合 Actor 和价值网络。它主要用作 `ActorValueOperator` 的替代品，当脚本需要同时处理这两种选项时。

>>> actor = make_actor()
>>> value = make_value()
>>> if shared_params:
...     common = make_common()
...     model = ActorValueOperator(common, actor, value)
... else:
...     model = ActorValueOperator(actor, value)
>>> policy = model.get_policy_operator()  # will work in both cases

`ActorCriticOperator`(args, *kwargs)	Actor-Critic 运算符。
`ActorCriticWrapper`(args, *kwargs)	没有通用模块的 Actor-Value 运算符。
`ActorValueOperator`(args, *kwargs)	Actor-Value 运算符。
`ValueOperator`(args, *kwargs)	RL 中价值函数的通用类。
`DecisionTransformerInferenceWrapper`(args, ..., *kwargs)	Decision Transformer 的推理动作包装器。

特定领域的 TensorDict 模块¶

这些模块包括 MBRL 或 RLHF 管道的专用解决方案。

`LMHeadActorValueOperator`(args, *kwargs)	从 huggingface 风格的 `LMHeadModel` 构建 Actor-Value 运算符。
`WorldModelWrapper`(args, *kwargs)	世界模型包装器。

Hooks¶

`QValueActor` 和 `DistributionalQValueActor` 模块使用 Q 值钩子，通常应优先使用它们，因为它们更容易创建和使用。

`QValueHook`(action_space[, var_nums, ...])	Q 值策略的 Q 值钩子。
`DistributionalQValueHook`(action_space, support)	Q 值 Actor 策略的分布 Q 值 Hook。

模型¶

TorchRL 提供了一系列用于 RL 用途的有用“常规”（即非 tensordict）nn.Module 类。

常规模块¶

`BatchRenorm1d`(num_features, , momentum=0.1, *kwargs)	BatchRenorm 模块 (https://arxiv.org/abs/1702.03275)。
`ConsistentDropout`([p])	实现了具有一致性 dropout 的 `Dropout` 变体。
`Conv3dNet`(in_features, depth, num_cells, **kwargs)	3D 卷积神经网络。
`ConvNet`(in_features, depth, num_cells, **kwargs)	卷积神经网络。
`MLP`(in_features, out_features, depth, **kwargs)	多层感知机。
`Squeeze2dLayer`()	卷积神经网络的挤压层。
`SqueezeLayer`([dims])	挤压层。

特定于算法的模块¶

这些网络实现了对特定算法（如 DQN、DDPG 或 Dreamer）有用的子网络。

`DTActor`(state_dim, action_dim[, ...])	Decision Transformer Actor 类。
`DdpgCnnActor`(action_dim[, conv_net_kwargs, ...])	DDPG 卷积 Actor 类。
`DdpgCnnQNet`([conv_net_kwargs, ...])	DDPG 卷积 Q 值类。
`DdpgMlpActor`(action_dim[, mlp_net_kwargs, ...])	DDPG Actor 类。
`DdpgMlpQNet`([mlp_net_kwargs_net1, ...])	DDPG Q 值 MLP 类。
`DecisionTransformer`(state_dim, action_dim[, ...])	在线 Decision Transformer。
`DistributionalDQNnet`(args, *kwargs)	分布式深度 Q 网络 softmax 层。
`DreamerActor`(out_features[, depth, ...])	Dreamer Actor 网络。
`DuelingCnnDQNet`(out_features[, ...])	Dueling CNN Q 网络。
`GRUCell`(input_size, hidden_size[, bias, ...])	执行的操作与 `nn.LSTMCell` 相同的门控循环单元 (GRU) 单元，但完全用 Python 编写。
`GRU`(input_size, hidden_size[, num_layers, ...])	用于执行多层 GRU 多步操作的 PyTorch 模块。
`GRUModule`(args, *kwargs)	GRU 模块的嵌入器。
`LSTMCell`(input_size, hidden_size[, bias, ...])	执行的操作与 `nn.LSTMCell` 相同的长短期记忆 (LSTM) 单元，但完全用 Python 编写。
`LSTM`(input_size, hidden_size[, num_layers, ...])	用于执行多层 LSTM 多步操作的 PyTorch 模块。
`LSTMModule`(args, *kwargs)	LSTM 模块的嵌入器。
`ObsDecoder`([channels, num_layers, ...])	观测解码器网络。
`ObsEncoder`([channels, num_layers, depth])	观测编码器网络。
`OnlineDTActor`(state_dim, action_dim[, ...])	在线 Decision Transformer Actor 类。
`RSSMPosterior`([hidden_dim, state_dim, scale_lb])	RSSM 的后验网络。
`RSSMPrior`(action_spec[, hidden_dim, ...])	RSSM 的先验网络。
`set_recurrent_mode`([mode])	用于设置 RNN 循环模式的上下文管理器。
`recurrent_mode`()	返回当前的采样类型。

多智能体特定模块¶

这些网络实现了可在多智能体场景中使用的模型。它们使用 `vmap()` 来一次性在网络输入上执行多个网络。由于参数是批量化的，初始化可能与通常使用其他 PyTorch 模块的方式不同，有关更多信息，请参阅 `get_stateful_net()`。

`MultiAgentNetBase`(, n_agents, *kwargs)	多智能体网络的基类。
`MultiAgentMLP`(n_agent_inputs, ..., **kwargs)	多智能体 MLP。
`MultiAgentConvNet`(n_agents, centralized, ..., **kwargs)	多智能体 CNN。
`QMixer`(state_shape, mixing_embed_dim, ..., **kwargs)	QMix 混合器。
`VDNMixer`(n_agents, device)	值分解网络 (VDN) 混合器。

探索¶

带噪声的线性层是探索环境的一种流行方式，它不改变动作，而是将随机性集成到权重配置中。

`NoisyLinear`(in_features, out_features[, ...])	带噪声的线性层。
`NoisyLazyLinear`(out_features[, bias, ...])	带噪声的懒惰线性层。
`reset_noise`(layer)	重置带噪声层的噪声。

规划器¶

`CEMPlanner`(args, *kwargs)	CEMPlanner 模块。
`MPCPlannerBase`(args, *kwargs)	MPCPlannerBase 抽象模块。
`MPPIPlanner`(args, *kwargs)	MPPI Planner 模块。

分布¶

RL 脚本中通常使用某些分布。

`Delta`(param[, atol, rtol, batch_shape, ...])	Delta 分布。
`IndependentNormal`(loc, scale[, upscale, ...])	实现带位置缩放的独立正态分布。
`TanhNormal`(loc, scale[, upscale, low, high, ...])	实现带位置缩放的 Tanh 正态分布。
`TruncatedNormal`(loc, scale[, upscale, low, ...])	实现带位置缩放的截断正态分布。
`TanhDelta`(param[, low, high, event_dims, ...])	实现 Tanh 变换的 Delta 分布。
`OneHotCategorical`([logits, probs, grad_method])	独热（One-hot）分类分布。
`LLMMaskedCategorical`(logits, mask[, ...])	LLM 优化的掩码分类分布。
`MaskedCategorical`([logits, probs, mask, ...])	MaskedCategorical 分布。
`MaskedOneHotCategorical`([logits, probs, ...])	MaskedCategorical 分布。
`Ordinal`(scores)	用于学习从有限有序集合采样的离散分布。
`OneHotOrdinal`(scores)	`Ordinal` 分布的 One-hot 版本。

Utils¶

模块实用函数包括用于执行自定义映射的功能以及一个用于从给定模块构建 `TensorDictPrimer` 实例的工具。

`mappings`(key)	给定一个输入字符串，返回一个满射函数 f(x): R -> R^+。
`inv_softplus`(bias)	反向 softplus 函数。
`biased_softplus`(bias[, min_val])	带偏置的 softplus 模块。
`get_primers_from_module`(module)	从模块的所有子模块获取所有 tensordict primer。

VmapModule(*args, **kwargs)

用于在输入上进行 vmap 的 TensorDictModule 包装器。