注意

转到底部下载完整的示例代码。

TorchRL LLM：构建支持工具的环境¶

作者：Vincent Moens

本教程演示了如何在 TorchRL 中构建和组合具有工具功能的 LLM 环境。我们将展示如何创建一个能够执行工具、格式化响应以及处理 LLM 与外部工具之间交互的完整环境。

本教程以网络浏览为例，但这些概念适用于 TorchRL 的 LLM 框架中的任何工具集成。

主要收获

理解 TorchRL 的 LLM 环境组合
创建和添加工具转换
格式化工具响应和 LLM 交互
处理工具执行和状态管理

先决条件：基本熟悉 TorchRL 的环境概念。

安装¶

首先，使用 LLM 支持安装 TorchRL。如果您在 Jupyter notebook 中运行此命令，可以使用以下命令安装包

%pip install "torchrl[llm]"    # Install TorchRL with all LLM dependencies

“torchrl[llm]”包包含 LLM 功能所需的所有必要依赖项，包括 transformers、vllm 和 playwright（用于浏览器自动化）。

安装后，您需要设置浏览器自动化组件

!playwright install            # Install browser binaries

注意：“!”和“%pip”前缀仅适用于 Jupyter notebooks。在常规终端中，请在不带前缀的情况下使用这些命令。

环境设置¶

TorchRL 的 LLM 接口围绕可组合的环境和转换构建。关键组件是：

基本环境 (ChatEnv)
工具执行转换
数据加载转换
奖励计算转换

让我们导入必要的组件并设置我们的环境。

from __future__ import annotations

import warnings

import torch

from tensordict import set_list_to_stack, TensorDict
from torchrl import torchrl_logger
from torchrl.data import CompositeSpec, Unbounded
from torchrl.envs import Transform
from torchrl.envs.llm import ChatEnv
from torchrl.envs.llm.transforms.browser import BrowserTransform
from transformers import AutoTokenizer

warnings.filterwarnings("ignore")

步骤 1：基本环境配置¶

我们将创建一个 ChatEnv 并配置其浏览器自动化功能。首先，我们启用 TensorDict 的列表到堆栈转换，这对于 LLM 环境中的正确批处理处理是必需的。

# Enable list-to-stack conversion for TensorDict
set_list_to_stack(True).set()

现在我们将创建 tokenizer 和基本环境。环境需要一个批处理大小，即使我们只运行单个实例。

tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-7B-Instruct")
env = ChatEnv(
    batch_size=(1,),
    tokenizer=tokenizer,
    apply_template=True,
    system_prompt=(
        "You are a helpful assistant that can use tools to accomplish tasks. "
        "Tools will be executed and their responses will be added to our conversation."
    ),
)

接下来，我们将添加具有安全配置的浏览器转换。此转换启用了具有域限制的浏览器功能，以确保安全。

browser_transform = BrowserTransform(
    allowed_domains=["google.com", "github.com"],
    headless=False,  # Set to False to see the browser actions
)
env = env.append_transform(browser_transform)

我们还可以设计一个转换来为环境分配奖励。例如，我们可以解析浏览器转换的结果，以便在达到特定目标时分配奖励。在此示例中，如果 LLM 找到问题的答案（巴黎），则奖励 2；如果 LLM 到达目标网站，则奖励 1；否则奖励 0。

class RewardTransform(Transform):
    """A transform that assigns rewards based on the LLM's responses.

    This transform parses the browser responses in the environment's history and assigns
    rewards based on specific achievements:

    - Finding the correct answer (Paris): reward = 2.0
    - Successfully reaching Google: reward = 1.0
    - Otherwise: reward = 0.0

    """

    def _call(self, tensordict: TensorDict) -> TensorDict:
        """Process the tensordict and assign rewards based on the LLM's response.

        Args:
            tensordict (TensorDict): The tensordict containing the environment state.
                Must have a "history" key containing the conversation history.

        Returns:
            TensorDict: The tensordict with an added "reward" key containing the
                computed reward with shape (B, 1) where B is the batch size.
        """
        # ChatEnv has created a history item. We just pick up the last item,
        # and check if `"Paris"` is in the response.
        # We use index 0 because we are in a single-instance environment.
        history = tensordict[0]["history"]
        last_item = history[-1]
        if "Paris" in last_item.content:
            torchrl_logger.info("Found the answer to the question: Paris")
            # Recall that rewards have a trailing singleton dimension.
            tensordict["reward"] = torch.full((1, 1), 2.0)
        # Check if we successfully reached the website
        elif (
            "google.com" in last_item.content
            and "executed successfully" in last_item.content
        ):
            torchrl_logger.info("Reached the website google.com")
            tensordict["reward"] = torch.full((1, 1), 1.0)
        else:
            tensordict["reward"] = torch.full((1, 1), 0.0)
        return tensordict

    def transform_reward_spec(self, reward_spec: CompositeSpec) -> CompositeSpec:
        """Transform the reward spec to include our custom reward.

        This method is required to override the reward spec since the environment
        is initially reward-agnostic.

        Args:
            reward_spec (CompositeSpec): The original reward spec from the environment.

        Returns:
            CompositeSpec: The transformed reward spec with our custom reward definition.
                The reward will have shape (B, 1) where B is the batch size.
        """
        reward_spec["reward"] = Unbounded(
            shape=reward_spec.shape + (1,), dtype=torch.float32
        )
        return reward_spec


# We append the reward transform to the environment.
env = env.append_transform(RewardTransform())

步骤 2：工具执行助手¶

为了使我们与工具的交互更加有条理，我们将创建一个助手函数来执行工具操作并显示结果。

def execute_tool_action(
    env: ChatEnv,
    current_state: TensorDict,
    action: str,
    verbose: bool = True,
) -> tuple[TensorDict, TensorDict]:
    """Execute a tool action and show the formatted interaction."""
    s = current_state.set("text_response", [action])
    s, s_ = env.step_and_maybe_reset(s)

    if verbose:
        print("\nLLM Action:")
        print("-----------")
        print(action)
        print("\nEnvironment Response:")
        print("--------------------")
        torchrl_logger.info(s_["history"].apply_chat_template(tokenizer=env.tokenizer))

    return s, s_

步骤 3：开始交互¶

让我们从初始化环境并输入一个问题开始，然后导航到搜索引擎。请注意，用作环境输入的 tensordict 必须与环境共享相同的批处理大小。文本查询被放入长度为 1 的列表中，以便与环境的批处理大小兼容。

reset = env.reset(
    TensorDict(
        text=["What is the capital of France?"],
        batch_size=(1,),
    )
)

现在我们将使用浏览器转换导航到 Google。该转换期望操作采用特定的 JSON 格式，并用工具标签包装。在实践中，此操作应该是我们的 LLM 的输出，它将在“text_response”键中写入响应字符串。

s, s_ = execute_tool_action(
    env,
    reset,
    """
    Let me search for that:
    <tool>browser
    {
        "action": "navigate",
        "url": "https://google.com"
    }
    </tool><|im_end|>
    """,
)

步骤 4：执行搜索¶

在打开浏览器后，我们现在可以输入我们的查询并执行搜索。首先，我们将搜索查询输入到 Google 的搜索框中。

s, s_ = execute_tool_action(
    env,
    s_,
    """
    Let me type the search query:
    <tool>browser
    {
        "action": "type",
        "selector": "[name='q']",
        "text": "What is the capital of France?"
    }
    </tool><|im_end|>
    """,
)

接下来，我们将单击搜索按钮来执行搜索。请注意我们如何使用 CSS 选择器来识别页面上的元素。

s, s_ = execute_tool_action(
    env,
    s_,
    """
    Now let me click the search button:
    <tool>browser
    {
        "action": "click",
        "selector": "[name='btnK']"
    }
    </tool><|im_end|>
    """,
)

步骤 5：提取结果¶

最后，我们将从页面中提取搜索结果。浏览器转换可以从指定的元素中提取文本内容和 HTML。

s, s_ = execute_tool_action(
    env,
    s_,
    """
    Let me extract the results:
    <tool>browser
    {
        "action": "extract",
        "selector": "#search",
        "extract_type": "text"
    }
    </tool><|im_end|>
    """,
)

让我们关闭环境。

env.close()

结论¶

本教程演示了如何在 TorchRL 中构建和组合具有工具功能的 LLM 环境。我们已经展示了如何创建一个能够执行工具、格式化响应以及处理 LLM 与外部工具之间交互的完整环境。

关键概念是：

理解 TorchRL 的 LLM 环境组合
创建和添加工具转换
格式化工具响应和 LLM 交互
处理工具执行和状态管理
与 LLM 包装器集成 (vLLM, Transformers)

有关如何使用 TorchRL 构建支持工具的环境的更多信息，请参阅 ref_llms 教程。

由 Sphinx-Gallery 生成的画廊