评价此页

使用 Ray Tune 进行超参数调优#

创建日期:2020年8月31日 | 最后更新:2025年6月24日 | 最后验证:2024年11月5日

超参数调优可以决定模型是表现平平还是高度精确。通常,仅仅是选择不同的学习率或改变网络层大小,就能对模型性能产生显著影响。

幸运的是,有一些工具可以帮助找到最佳参数组合。Ray Tune 是一个用于分布式超参数调优的行业标准工具。Ray Tune 包含了最新的超参数搜索算法,集成了各种分析库,并通过 Ray 的分布式机器学习引擎 原生支持分布式训练。

在本教程中,我们将向您展示如何将 Ray Tune 集成到您的 PyTorch 训练工作流程中。我们将扩展 PyTorch 文档中的此教程 来训练 CIFAR10 图像分类器。

正如您将看到的,我们只需要进行一些小的修改。具体来说,我们需要

  1. 将数据加载和训练封装到函数中,

  2. 使一些网络参数可配置,

  3. 添加检查点(可选),

  4. 并定义模型调优的搜索空间


要运行本教程,请确保已安装以下包

  • ray[tune]: 分布式超参数调优库

  • torchvision: 用于数据转换器

设置/导入#

让我们从导入开始

from functools import partial
import os
import tempfile
from pathlib import Path
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torch.utils.data import random_split
import torchvision
import torchvision.transforms as transforms
from ray import tune
from ray import train
from ray.train import Checkpoint, get_checkpoint
from ray.tune.schedulers import ASHAScheduler
import ray.cloudpickle as pickle

大多数导入是构建 PyTorch 模型所必需的。只有最后几个导入是用于 Ray Tune 的。

数据加载器#

我们将数据加载器封装在自己的函数中,并传递一个全局数据目录。这样我们可以在不同的试验之间共享一个数据目录。

def load_data(data_dir="./data"):
    transform = transforms.Compose(
        [transforms.ToTensor(), transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))]
    )

    trainset = torchvision.datasets.CIFAR10(
        root=data_dir, train=True, download=True, transform=transform
    )

    testset = torchvision.datasets.CIFAR10(
        root=data_dir, train=False, download=True, transform=transform
    )

    return trainset, testset

可配置的神经网络#

我们只能调整那些可配置的参数。在这个例子中,我们可以指定全连接层的层大小

class Net(nn.Module):
    def __init__(self, l1=120, l2=84):
        super(Net, self).__init__()
        self.conv1 = nn.Conv2d(3, 6, 5)
        self.pool = nn.MaxPool2d(2, 2)
        self.conv2 = nn.Conv2d(6, 16, 5)
        self.fc1 = nn.Linear(16 * 5 * 5, l1)
        self.fc2 = nn.Linear(l1, l2)
        self.fc3 = nn.Linear(l2, 10)

    def forward(self, x):
        x = self.pool(F.relu(self.conv1(x)))
        x = self.pool(F.relu(self.conv2(x)))
        x = torch.flatten(x, 1)  # flatten all dimensions except batch
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        x = self.fc3(x)
        return x

训练函数#

现在变得有趣了,因为我们对 PyTorch 文档中的示例 进行了一些修改。

我们将训练脚本封装在一个函数 train_cifar(config, data_dir=None) 中。config 参数将接收我们希望训练的超参数。data_dir 指定我们加载和存储数据的目录,以便多个运行可以共享相同的数据源。如果提供了检查点,我们还在运行开始时加载模型和优化器状态。在本教程的下方,您将找到有关如何保存检查点及其用途的信息。

net = Net(config["l1"], config["l2"])

checkpoint = get_checkpoint()
if checkpoint:
    with checkpoint.as_directory() as checkpoint_dir:
        data_path = Path(checkpoint_dir) / "data.pkl"
        with open(data_path, "rb") as fp:
            checkpoint_state = pickle.load(fp)
        start_epoch = checkpoint_state["epoch"]
        net.load_state_dict(checkpoint_state["net_state_dict"])
        optimizer.load_state_dict(checkpoint_state["optimizer_state_dict"])
else:
    start_epoch = 0

优化器的学习率也可以配置

optimizer = optim.SGD(net.parameters(), lr=config["lr"], momentum=0.9)

我们还将训练数据分成训练和验证子集。因此,我们使用 80% 的数据进行训练,并在剩余的 20% 数据上计算验证损失。我们遍历训练集和测试集时的批次大小也是可配置的。

添加(多)GPU 支持与 DataParallel#

图像分类很大程度上受益于 GPU。幸运的是,我们可以在 Ray Tune 中继续使用 PyTorch 的抽象。因此,我们可以将模型封装在 nn.DataParallel 中,以支持多 GPU 上的数据并行训练

device = "cpu"
if torch.cuda.is_available():
    device = "cuda:0"
    if torch.cuda.device_count() > 1:
        net = nn.DataParallel(net)
net.to(device)

通过使用 device 变量,我们确保即使没有可用的 GPU,训练也能正常工作。PyTorch 要求我们明确地将数据发送到 GPU 内存,如下所示

for i, data in enumerate(trainloader, 0):
    inputs, labels = data
    inputs, labels = inputs.to(device), labels.to(device)

该代码现在支持在 CPU、单个 GPU 和多个 GPU 上进行训练。值得注意的是,Ray 还支持分块 GPU,因此我们可以在试验之间共享 GPU,只要模型仍然适合 GPU 内存。我们稍后再讨论这个问题。

与 Ray Tune 通信#

最有趣的部分是与 Ray Tune 的通信

checkpoint_data = {
    "epoch": epoch,
    "net_state_dict": net.state_dict(),
    "optimizer_state_dict": optimizer.state_dict(),
}
with tempfile.TemporaryDirectory() as checkpoint_dir:
    data_path = Path(checkpoint_dir) / "data.pkl"
    with open(data_path, "wb") as fp:
        pickle.dump(checkpoint_data, fp)

    checkpoint = Checkpoint.from_directory(checkpoint_dir)
    train.report(
        {"loss": val_loss / val_steps, "accuracy": correct / total},
        checkpoint=checkpoint,
    )

这里我们首先保存一个检查点,然后向 Ray Tune 报告一些指标。具体来说,我们向 Ray Tune 发送验证损失和准确率。然后 Ray Tune 可以使用这些指标来决定哪些超参数配置产生了最佳结果。这些指标还可以用于提前停止表现不佳的试验,以避免在这些试验上浪费资源。

保存检查点是可选的,但是,如果我们要使用像 基于群体的训练 这样的高级调度器,它是必需的。此外,通过保存检查点,我们以后可以加载训练好的模型并在测试集上验证它们。最后,保存检查点对于容错性很有用,它允许我们中断训练并在以后继续训练。

完整训练函数#

完整的代码示例如下

def train_cifar(config, data_dir=None):
    net = Net(config["l1"], config["l2"])

    device = "cpu"
    if torch.cuda.is_available():
        device = "cuda:0"
        if torch.cuda.device_count() > 1:
            net = nn.DataParallel(net)
    net.to(device)

    criterion = nn.CrossEntropyLoss()
    optimizer = optim.SGD(net.parameters(), lr=config["lr"], momentum=0.9)

    checkpoint = get_checkpoint()
    if checkpoint:
        with checkpoint.as_directory() as checkpoint_dir:
            data_path = Path(checkpoint_dir) / "data.pkl"
            with open(data_path, "rb") as fp:
                checkpoint_state = pickle.load(fp)
            start_epoch = checkpoint_state["epoch"]
            net.load_state_dict(checkpoint_state["net_state_dict"])
            optimizer.load_state_dict(checkpoint_state["optimizer_state_dict"])
    else:
        start_epoch = 0

    trainset, testset = load_data(data_dir)

    test_abs = int(len(trainset) * 0.8)
    train_subset, val_subset = random_split(
        trainset, [test_abs, len(trainset) - test_abs]
    )

    trainloader = torch.utils.data.DataLoader(
        train_subset, batch_size=int(config["batch_size"]), shuffle=True, num_workers=8
    )
    valloader = torch.utils.data.DataLoader(
        val_subset, batch_size=int(config["batch_size"]), shuffle=True, num_workers=8
    )

    for epoch in range(start_epoch, 10):  # loop over the dataset multiple times
        running_loss = 0.0
        epoch_steps = 0
        for i, data in enumerate(trainloader, 0):
            # get the inputs; data is a list of [inputs, labels]
            inputs, labels = data
            inputs, labels = inputs.to(device), labels.to(device)

            # zero the parameter gradients
            optimizer.zero_grad()

            # forward + backward + optimize
            outputs = net(inputs)
            loss = criterion(outputs, labels)
            loss.backward()
            optimizer.step()

            # print statistics
            running_loss += loss.item()
            epoch_steps += 1
            if i % 2000 == 1999:  # print every 2000 mini-batches
                print(
                    "[%d, %5d] loss: %.3f"
                    % (epoch + 1, i + 1, running_loss / epoch_steps)
                )
                running_loss = 0.0

        # Validation loss
        val_loss = 0.0
        val_steps = 0
        total = 0
        correct = 0
        for i, data in enumerate(valloader, 0):
            with torch.no_grad():
                inputs, labels = data
                inputs, labels = inputs.to(device), labels.to(device)

                outputs = net(inputs)
                _, predicted = torch.max(outputs.data, 1)
                total += labels.size(0)
                correct += (predicted == labels).sum().item()

                loss = criterion(outputs, labels)
                val_loss += loss.cpu().numpy()
                val_steps += 1

        checkpoint_data = {
            "epoch": epoch,
            "net_state_dict": net.state_dict(),
            "optimizer_state_dict": optimizer.state_dict(),
        }
        with tempfile.TemporaryDirectory() as checkpoint_dir:
            data_path = Path(checkpoint_dir) / "data.pkl"
            with open(data_path, "wb") as fp:
                pickle.dump(checkpoint_data, fp)

            checkpoint = Checkpoint.from_directory(checkpoint_dir)
            train.report(
                {"loss": val_loss / val_steps, "accuracy": correct / total},
                checkpoint=checkpoint,
            )

    print("Finished Training")

如您所见,大部分代码都是直接改编自原始示例。

测试集准确率#

通常,机器学习模型的性能是在未用于训练模型的保留测试集上进行测试的。我们也将其封装在一个函数中

def test_accuracy(net, device="cpu"):
    trainset, testset = load_data()

    testloader = torch.utils.data.DataLoader(
        testset, batch_size=4, shuffle=False, num_workers=2
    )

    correct = 0
    total = 0
    with torch.no_grad():
        for data in testloader:
            images, labels = data
            images, labels = images.to(device), labels.to(device)
            outputs = net(images)
            _, predicted = torch.max(outputs.data, 1)
            total += labels.size(0)
            correct += (predicted == labels).sum().item()

    return correct / total

该函数还期望一个 device 参数,以便我们可以在 GPU 上进行测试集验证。

配置搜索空间#

最后,我们需要定义 Ray Tune 的搜索空间。这是一个例子

config = {
    "l1": tune.choice([2 ** i for i in range(9)]),
    "l2": tune.choice([2 ** i for i in range(9)]),
    "lr": tune.loguniform(1e-4, 1e-1),
    "batch_size": tune.choice([2, 4, 8, 16])
}

tune.choice() 接受一个值列表,这些值是均匀采样的。在这个例子中,l1l2 参数应该是 4 到 256 之间的 2 的幂,所以可以是 4、8、16、32、64、128 或 256。lr(学习率)应该在 0.0001 和 0.1 之间均匀采样。最后,批次大小是在 2、4、8 和 16 之间进行选择。

在每次试验中,Ray Tune 将从这些搜索空间中随机抽样一组参数组合。然后它将并行训练多个模型,并找到其中表现最好的一个。我们还使用 ASHAScheduler,它将提前终止表现不佳的试验。

我们用 functools.partial 包装 train_cifar 函数,以设置常量 data_dir 参数。我们还可以告诉 Ray Tune 每个试验应该有哪些资源可用

gpus_per_trial = 2
# ...
result = tune.run(
    partial(train_cifar, data_dir=data_dir),
    resources_per_trial={"cpu": 8, "gpu": gpus_per_trial},
    config=config,
    num_samples=num_samples,
    scheduler=scheduler,
    checkpoint_at_end=True)

您可以指定 CPU 数量,然后这些 CPU 可用于例如增加 PyTorch DataLoader 实例的 num_workers。选定的 GPU 数量在每个试验中对 PyTorch 可见。试验无法访问未为其请求的 GPU——因此您不必担心两个试验使用相同的资源集。

在这里我们还可以指定小数 GPU,所以像 gpus_per_trial=0.5 这样的值是完全有效的。然后试验将共享 GPU。您只需确保模型仍然适合 GPU 内存。

在模型训练完成后,我们将找到表现最好的模型,并从检查点文件中加载训练好的网络。然后我们获取测试集准确率并通过打印报告所有内容。

完整的 main 函数如下所示

def main(num_samples=10, max_num_epochs=10, gpus_per_trial=2):
    data_dir = os.path.abspath("./data")
    load_data(data_dir)
    config = {
        "l1": tune.choice([2**i for i in range(9)]),
        "l2": tune.choice([2**i for i in range(9)]),
        "lr": tune.loguniform(1e-4, 1e-1),
        "batch_size": tune.choice([2, 4, 8, 16]),
    }
    scheduler = ASHAScheduler(
        metric="loss",
        mode="min",
        max_t=max_num_epochs,
        grace_period=1,
        reduction_factor=2,
    )
    result = tune.run(
        partial(train_cifar, data_dir=data_dir),
        resources_per_trial={"cpu": 2, "gpu": gpus_per_trial},
        config=config,
        num_samples=num_samples,
        scheduler=scheduler,
    )

    best_trial = result.get_best_trial("loss", "min", "last")
    print(f"Best trial config: {best_trial.config}")
    print(f"Best trial final validation loss: {best_trial.last_result['loss']}")
    print(f"Best trial final validation accuracy: {best_trial.last_result['accuracy']}")

    best_trained_model = Net(best_trial.config["l1"], best_trial.config["l2"])
    device = "cpu"
    if torch.cuda.is_available():
        device = "cuda:0"
        if gpus_per_trial > 1:
            best_trained_model = nn.DataParallel(best_trained_model)
    best_trained_model.to(device)

    best_checkpoint = result.get_best_checkpoint(trial=best_trial, metric="accuracy", mode="max")
    with best_checkpoint.as_directory() as checkpoint_dir:
        data_path = Path(checkpoint_dir) / "data.pkl"
        with open(data_path, "rb") as fp:
            best_checkpoint_data = pickle.load(fp)

        best_trained_model.load_state_dict(best_checkpoint_data["net_state_dict"])
        test_acc = test_accuracy(best_trained_model, device)
        print("Best trial test set accuracy: {}".format(test_acc))


if __name__ == "__main__":
    # You can change the number of GPUs per trial here:
    main(num_samples=10, max_num_epochs=10, gpus_per_trial=0)
  0%|          | 0.00/170M [00:00<?, ?B/s]
  0%|          | 426k/170M [00:00<00:42, 4.02MB/s]
  1%|          | 1.44M/170M [00:00<00:22, 7.53MB/s]
  2%|▏         | 2.72M/170M [00:00<00:17, 9.84MB/s]
  3%|▎         | 4.33M/170M [00:00<00:13, 12.2MB/s]
  4%|▎         | 6.32M/170M [00:00<00:11, 14.8MB/s]
  5%|▌         | 8.81M/170M [00:00<00:08, 18.2MB/s]
  7%|▋         | 11.9M/170M [00:00<00:07, 22.1MB/s]
  9%|▉         | 15.0M/170M [00:00<00:06, 25.0MB/s]
 10%|█         | 17.7M/170M [00:00<00:05, 25.6MB/s]
 12%|█▏        | 20.3M/170M [00:01<00:05, 25.8MB/s]
 14%|█▎        | 23.1M/170M [00:01<00:05, 26.4MB/s]
 16%|█▌        | 27.7M/170M [00:01<00:04, 32.2MB/s]
 19%|█▉        | 32.1M/170M [00:01<00:03, 35.8MB/s]
 21%|██        | 35.7M/170M [00:01<00:03, 35.4MB/s]
 23%|██▎       | 39.3M/170M [00:01<00:03, 33.8MB/s]
 25%|██▌       | 42.7M/170M [00:01<00:03, 32.5MB/s]
 27%|██▋       | 46.0M/170M [00:01<00:03, 31.8MB/s]
 29%|██▉       | 49.2M/170M [00:01<00:03, 31.7MB/s]
 31%|███       | 52.4M/170M [00:01<00:03, 31.2MB/s]
 33%|███▎      | 55.5M/170M [00:02<00:03, 30.8MB/s]
 34%|███▍      | 58.6M/170M [00:02<00:03, 30.3MB/s]
 36%|███▌      | 61.7M/170M [00:02<00:03, 29.9MB/s]
 38%|███▊      | 64.7M/170M [00:02<00:03, 30.0MB/s]
 40%|███▉      | 67.8M/170M [00:02<00:03, 29.8MB/s]
 42%|████▏     | 70.8M/170M [00:02<00:03, 29.9MB/s]
 43%|████▎     | 73.9M/170M [00:02<00:03, 30.2MB/s]
 45%|████▌     | 77.2M/170M [00:02<00:03, 30.7MB/s]
 47%|████▋     | 80.2M/170M [00:02<00:02, 30.5MB/s]
 49%|████▉     | 83.3M/170M [00:02<00:02, 30.3MB/s]
 51%|█████     | 86.4M/170M [00:03<00:02, 30.1MB/s]
 52%|█████▏    | 89.5M/170M [00:03<00:02, 30.3MB/s]
 54%|█████▍    | 92.6M/170M [00:03<00:02, 30.7MB/s]
 56%|█████▌    | 95.7M/170M [00:03<00:02, 30.5MB/s]
 58%|█████▊    | 98.8M/170M [00:03<00:02, 30.4MB/s]
 60%|█████▉    | 102M/170M [00:03<00:02, 31.2MB/s]
 63%|██████▎   | 108M/170M [00:03<00:01, 38.6MB/s]
 67%|██████▋   | 115M/170M [00:03<00:01, 48.8MB/s]
 72%|███████▏  | 123M/170M [00:03<00:00, 56.6MB/s]
 76%|███████▋  | 130M/170M [00:04<00:00, 62.4MB/s]
 81%|████████  | 138M/170M [00:04<00:00, 66.4MB/s]
 85%|████████▌ | 145M/170M [00:04<00:00, 69.1MB/s]
 90%|████████▉ | 153M/170M [00:04<00:00, 71.3MB/s]
 94%|█████████▍| 160M/170M [00:04<00:00, 70.9MB/s]
 98%|█████████▊| 167M/170M [00:04<00:00, 71.5MB/s]
100%|██████████| 170M/170M [00:04<00:00, 37.4MB/s]
2025-08-07 18:22:07,276 WARNING services.py:1889 -- WARNING: The object store is using /tmp instead of /dev/shm because /dev/shm has only 2147467264 bytes available. This will harm performance! You may be able to free up space by deleting files in /dev/shm. If you are inside a Docker container, you can increase /dev/shm size by passing '--shm-size=10.24gb' to 'docker run' (or add it to the run_options list in a Ray cluster config). Make sure to set this to more than 30% of available RAM.
2025-08-07 18:22:07,439 INFO worker.py:1642 -- Started a local Ray instance.
2025-08-07 18:22:08,368 INFO tune.py:228 -- Initializing Ray automatically. For cluster usage or custom Ray initialization, call `ray.init(...)` before `tune.run(...)`.
2025-08-07 18:22:08,370 INFO tune.py:654 -- [output] This will use the new output engine with verbosity 2. To disable the new output and use the legacy output engine, set the environment variable RAY_AIR_NEW_OUTPUT=0. For more information, please see https://github.com/ray-project/ray/issues/36949
╭────────────────────────────────────────────────────────────────────╮
│ Configuration for experiment     train_cifar_2025-08-07_18-22-08   │
├────────────────────────────────────────────────────────────────────┤
│ Search algorithm                 BasicVariantGenerator             │
│ Scheduler                        AsyncHyperBandScheduler           │
│ Number of trials                 10                                │
╰────────────────────────────────────────────────────────────────────╯

View detailed results here: /var/lib/ci-user/ray_results/train_cifar_2025-08-07_18-22-08
To visualize your results with TensorBoard, run: `tensorboard --logdir /var/lib/ci-user/ray_results/train_cifar_2025-08-07_18-22-08`

Trial status: 10 PENDING
Current time: 2025-08-07 18:22:08. Total running time: 0s
Logical resource usage: 16.0/16 CPUs, 0/1 GPUs (0.0/1.0 accelerator_type:A10G)
╭───────────────────────────────────────────────────────────────────────────────╮
│ Trial name                status       l1     l2            lr     batch_size │
├───────────────────────────────────────────────────────────────────────────────┤
│ train_cifar_6da15_00000   PENDING       2    128   0.0509604               16 │
│ train_cifar_6da15_00001   PENDING       1      4   0.000583917              4 │
│ train_cifar_6da15_00002   PENDING       1    128   0.00404185               8 │
│ train_cifar_6da15_00003   PENDING      16      1   0.0300776                2 │
│ train_cifar_6da15_00004   PENDING       1     16   0.047715                16 │
│ train_cifar_6da15_00005   PENDING       1      1   0.0207046                8 │
│ train_cifar_6da15_00006   PENDING     256    256   0.000109715             16 │
│ train_cifar_6da15_00007   PENDING      32    128   0.00200394              16 │
│ train_cifar_6da15_00008   PENDING       2     32   0.00087078               2 │
│ train_cifar_6da15_00009   PENDING       4     64   0.00109616               8 │
╰───────────────────────────────────────────────────────────────────────────────╯

Trial train_cifar_6da15_00006 started with configuration:
╭──────────────────────────────────────────────────╮
│ Trial train_cifar_6da15_00006 config             │
├──────────────────────────────────────────────────┤
│ batch_size                                    16 │
│ l1                                           256 │
│ l2                                           256 │
│ lr                                       0.00011 │
╰──────────────────────────────────────────────────╯

Trial train_cifar_6da15_00002 started with configuration:
╭──────────────────────────────────────────────────╮
│ Trial train_cifar_6da15_00002 config             │
├──────────────────────────────────────────────────┤
│ batch_size                                     8 │
│ l1                                             1 │
│ l2                                           128 │
│ lr                                       0.00404 │
╰──────────────────────────────────────────────────╯

Trial train_cifar_6da15_00001 started with configuration:
╭──────────────────────────────────────────────────╮
│ Trial train_cifar_6da15_00001 config             │
├──────────────────────────────────────────────────┤
│ batch_size                                     4 │
│ l1                                             1 │
│ l2                                             4 │
│ lr                                       0.00058 │
╰──────────────────────────────────────────────────╯

Trial train_cifar_6da15_00003 started with configuration:
╭──────────────────────────────────────────────────╮
│ Trial train_cifar_6da15_00003 config             │
├──────────────────────────────────────────────────┤
│ batch_size                                     2 │
│ l1                                            16 │
│ l2                                             1 │
│ lr                                       0.03008 │
╰──────────────────────────────────────────────────╯

Trial train_cifar_6da15_00004 started with configuration:
╭──────────────────────────────────────────────────╮
│ Trial train_cifar_6da15_00004 config             │
├──────────────────────────────────────────────────┤
│ batch_size                                    16 │
│ l1                                             1 │
│ l2                                            16 │
│ lr                                       0.04771 │
╰──────────────────────────────────────────────────╯

Trial train_cifar_6da15_00000 started with configuration:
╭──────────────────────────────────────────────────╮
│ Trial train_cifar_6da15_00000 config             │
├──────────────────────────────────────────────────┤
│ batch_size                                    16 │
│ l1                                             2 │
│ l2                                           128 │
│ lr                                       0.05096 │
╰──────────────────────────────────────────────────╯

Trial train_cifar_6da15_00007 started with configuration:
╭────────────────────────────────────────────────╮
│ Trial train_cifar_6da15_00007 config           │
├────────────────────────────────────────────────┤
│ batch_size                                  16 │
│ l1                                          32 │
│ l2                                         128 │
│ lr                                       0.002 │
╰────────────────────────────────────────────────╯

Trial train_cifar_6da15_00005 started with configuration:
╭─────────────────────────────────────────────────╮
│ Trial train_cifar_6da15_00005 config            │
├─────────────────────────────────────────────────┤
│ batch_size                                    8 │
│ l1                                            1 │
│ l2                                            1 │
│ lr                                       0.0207 │
╰─────────────────────────────────────────────────╯
(func pid=3879) [1,  2000] loss: 2.337
(func pid=3882) [1,  2000] loss: 2.301 [repeated 7x across cluster] (Ray deduplicates logs by default. Set RAY_DEDUP_LOGS=0 to disable log deduplication, or see https://docs.rayai.org.cn/en/master/ray-observability/ray-logging.html#log-deduplication for more options.)

Trial status: 8 RUNNING | 2 PENDING
Current time: 2025-08-07 18:22:38. Total running time: 30s
Logical resource usage: 16.0/16 CPUs, 0/1 GPUs (0.0/1.0 accelerator_type:A10G)
╭───────────────────────────────────────────────────────────────────────────────╮
│ Trial name                status       l1     l2            lr     batch_size │
├───────────────────────────────────────────────────────────────────────────────┤
│ train_cifar_6da15_00000   RUNNING       2    128   0.0509604               16 │
│ train_cifar_6da15_00001   RUNNING       1      4   0.000583917              4 │
│ train_cifar_6da15_00002   RUNNING       1    128   0.00404185               8 │
│ train_cifar_6da15_00003   RUNNING      16      1   0.0300776                2 │
│ train_cifar_6da15_00004   RUNNING       1     16   0.047715                16 │
│ train_cifar_6da15_00005   RUNNING       1      1   0.0207046                8 │
│ train_cifar_6da15_00006   RUNNING     256    256   0.000109715             16 │
│ train_cifar_6da15_00007   RUNNING      32    128   0.00200394              16 │
│ train_cifar_6da15_00008   PENDING       2     32   0.00087078               2 │
│ train_cifar_6da15_00009   PENDING       4     64   0.00109616               8 │
╰───────────────────────────────────────────────────────────────────────────────╯

Trial train_cifar_6da15_00007 finished iteration 1 at 2025-08-07 18:22:40. Total running time: 31s
╭────────────────────────────────────────────────────────────╮
│ Trial train_cifar_6da15_00007 result                       │
├────────────────────────────────────────────────────────────┤
│ checkpoint_dir_name                      checkpoint_000000 │
│ time_this_iter_s                                  27.01455 │
│ time_total_s                                      27.01455 │
│ training_iteration                                       1 │
│ accuracy                                            0.4034 │
│ loss                                               1.64608 │
╰────────────────────────────────────────────────────────────╯
Trial train_cifar_6da15_00007 saved a checkpoint for iteration 1 at: (local)/var/lib/ci-user/ray_results/train_cifar_2025-08-07_18-22-08/train_cifar_6da15_00007_7_batch_size=16,l1=32,l2=128,lr=0.0020_2025-08-07_18-22-08/checkpoint_000000
(func pid=3883) Checkpoint successfully created at: Checkpoint(filesystem=local, path=/var/lib/ci-user/ray_results/train_cifar_2025-08-07_18-22-08/train_cifar_6da15_00007_7_batch_size=16,l1=32,l2=128,lr=0.0020_2025-08-07_18-22-08/checkpoint_000000)
(func pid=3879) [1,  4000] loss: 1.169

Trial train_cifar_6da15_00000 finished iteration 1 at 2025-08-07 18:22:40. Total running time: 32s
╭────────────────────────────────────────────────────────────╮
│ Trial train_cifar_6da15_00000 result                       │
├────────────────────────────────────────────────────────────┤
│ checkpoint_dir_name                      checkpoint_000000 │
│ time_this_iter_s                                  27.47424 │
│ time_total_s                                      27.47424 │
│ training_iteration                                       1 │
│ accuracy                                            0.0995 │
│ loss                                               2.31793 │
╰────────────────────────────────────────────────────────────╯
Trial train_cifar_6da15_00000 saved a checkpoint for iteration 1 at: (local)/var/lib/ci-user/ray_results/train_cifar_2025-08-07_18-22-08/train_cifar_6da15_00000_0_batch_size=16,l1=2,l2=128,lr=0.0510_2025-08-07_18-22-08/checkpoint_000000

Trial train_cifar_6da15_00000 completed after 1 iterations at 2025-08-07 18:22:40. Total running time: 32s

Trial train_cifar_6da15_00008 started with configuration:
╭──────────────────────────────────────────────────╮
│ Trial train_cifar_6da15_00008 config             │
├──────────────────────────────────────────────────┤
│ batch_size                                     2 │
│ l1                                             2 │
│ l2                                            32 │
│ lr                                       0.00087 │
╰──────────────────────────────────────────────────╯
(func pid=3878) [1,  4000] loss: 1.111

Trial train_cifar_6da15_00004 finished iteration 1 at 2025-08-07 18:22:41. Total running time: 33s
╭────────────────────────────────────────────────────────────╮
│ Trial train_cifar_6da15_00004 result                       │
├────────────────────────────────────────────────────────────┤
│ checkpoint_dir_name                      checkpoint_000000 │
│ time_this_iter_s                                  28.52594 │
│ time_total_s                                      28.52594 │
│ training_iteration                                       1 │
│ accuracy                                            0.1035 │
│ loss                                               2.30867 │
╰────────────────────────────────────────────────────────────╯
Trial train_cifar_6da15_00004 saved a checkpoint for iteration 1 at: (local)/var/lib/ci-user/ray_results/train_cifar_2025-08-07_18-22-08/train_cifar_6da15_00004_4_batch_size=16,l1=1,l2=16,lr=0.0477_2025-08-07_18-22-08/checkpoint_000000

Trial train_cifar_6da15_00004 completed after 1 iterations at 2025-08-07 18:22:41. Total running time: 33s

Trial train_cifar_6da15_00009 started with configuration:
╭─────────────────────────────────────────────────╮
│ Trial train_cifar_6da15_00009 config            │
├─────────────────────────────────────────────────┤
│ batch_size                                    8 │
│ l1                                            4 │
│ l2                                           64 │
│ lr                                       0.0011 │
╰─────────────────────────────────────────────────╯

Trial train_cifar_6da15_00006 finished iteration 1 at 2025-08-07 18:22:42. Total running time: 33s
╭────────────────────────────────────────────────────────────╮
│ Trial train_cifar_6da15_00006 result                       │
├────────────────────────────────────────────────────────────┤
│ checkpoint_dir_name                      checkpoint_000000 │
│ time_this_iter_s                                  29.37421 │
│ time_total_s                                      29.37421 │
│ training_iteration                                       1 │
│ accuracy                                            0.1145 │
│ loss                                               2.29772 │
╰────────────────────────────────────────────────────────────╯
Trial train_cifar_6da15_00006 saved a checkpoint for iteration 1 at: (local)/var/lib/ci-user/ray_results/train_cifar_2025-08-07_18-22-08/train_cifar_6da15_00006_6_batch_size=16,l1=256,l2=256,lr=0.0001_2025-08-07_18-22-08/checkpoint_000000
(func pid=3879) [1,  6000] loss: 0.781 [repeated 3x across cluster]

Trial train_cifar_6da15_00005 finished iteration 1 at 2025-08-07 18:22:56. Total running time: 48s
╭────────────────────────────────────────────────────────────╮
│ Trial train_cifar_6da15_00005 result                       │
├────────────────────────────────────────────────────────────┤
│ checkpoint_dir_name                      checkpoint_000000 │
│ time_this_iter_s                                  43.20412 │
│ time_total_s                                      43.20412 │
│ training_iteration                                       1 │
│ accuracy                                            0.1022 │
│ loss                                               2.31003 │
╰────────────────────────────────────────────────────────────╯
Trial train_cifar_6da15_00005 saved a checkpoint for iteration 1 at: (local)/var/lib/ci-user/ray_results/train_cifar_2025-08-07_18-22-08/train_cifar_6da15_00005_5_batch_size=8,l1=1,l2=1,lr=0.0207_2025-08-07_18-22-08/checkpoint_000000

Trial train_cifar_6da15_00005 completed after 1 iterations at 2025-08-07 18:22:56. Total running time: 48s
(func pid=3881) Checkpoint successfully created at: Checkpoint(filesystem=local, path=/var/lib/ci-user/ray_results/train_cifar_2025-08-07_18-22-08/train_cifar_6da15_00005_5_batch_size=8,l1=1,l2=1,lr=0.0207_2025-08-07_18-22-08/checkpoint_000000) [repeated 4x across cluster]
(func pid=3880) [1,  2000] loss: 2.191 [repeated 4x across cluster]

Trial train_cifar_6da15_00002 finished iteration 1 at 2025-08-07 18:22:57. Total running time: 48s
╭────────────────────────────────────────────────────────────╮
│ Trial train_cifar_6da15_00002 result                       │
├────────────────────────────────────────────────────────────┤
│ checkpoint_dir_name                      checkpoint_000000 │
│ time_this_iter_s                                   44.1901 │
│ time_total_s                                       44.1901 │
│ training_iteration                                       1 │
│ accuracy                                            0.1008 │
│ loss                                               2.31397 │
╰────────────────────────────────────────────────────────────╯
Trial train_cifar_6da15_00002 saved a checkpoint for iteration 1 at: (local)/var/lib/ci-user/ray_results/train_cifar_2025-08-07_18-22-08/train_cifar_6da15_00002_2_batch_size=8,l1=1,l2=128,lr=0.0040_2025-08-07_18-22-08/checkpoint_000000

Trial train_cifar_6da15_00002 completed after 1 iterations at 2025-08-07 18:22:57. Total running time: 48s

Trial train_cifar_6da15_00007 finished iteration 2 at 2025-08-07 18:23:01. Total running time: 52s
╭────────────────────────────────────────────────────────────╮
│ Trial train_cifar_6da15_00007 result                       │
├────────────────────────────────────────────────────────────┤
│ checkpoint_dir_name                      checkpoint_000001 │
│ time_this_iter_s                                  21.00816 │
│ time_total_s                                      48.02271 │
│ training_iteration                                       2 │
│ accuracy                                             0.502 │
│ loss                                               1.36508 │
╰────────────────────────────────────────────────────────────╯
Trial train_cifar_6da15_00007 saved a checkpoint for iteration 2 at: (local)/var/lib/ci-user/ray_results/train_cifar_2025-08-07_18-22-08/train_cifar_6da15_00007_7_batch_size=16,l1=32,l2=128,lr=0.0020_2025-08-07_18-22-08/checkpoint_000001
(func pid=3876) [1,  4000] loss: 1.153 [repeated 4x across cluster]

Trial train_cifar_6da15_00006 finished iteration 2 at 2025-08-07 18:23:05. Total running time: 56s
╭────────────────────────────────────────────────────────────╮
│ Trial train_cifar_6da15_00006 result                       │
├────────────────────────────────────────────────────────────┤
│ checkpoint_dir_name                      checkpoint_000001 │
│ time_this_iter_s                                  22.63908 │
│ time_total_s                                       52.0133 │
│ training_iteration                                       2 │
│ accuracy                                            0.1685 │
│ loss                                               2.28008 │
╰────────────────────────────────────────────────────────────╯
Trial train_cifar_6da15_00006 saved a checkpoint for iteration 2 at: (local)/var/lib/ci-user/ray_results/train_cifar_2025-08-07_18-22-08/train_cifar_6da15_00006_6_batch_size=16,l1=256,l2=256,lr=0.0001_2025-08-07_18-22-08/checkpoint_000001

Trial train_cifar_6da15_00006 completed after 2 iterations at 2025-08-07 18:23:05. Total running time: 56s
(func pid=3882) Checkpoint successfully created at: Checkpoint(filesystem=local, path=/var/lib/ci-user/ray_results/train_cifar_2025-08-07_18-22-08/train_cifar_6da15_00006_6_batch_size=16,l1=256,l2=256,lr=0.0001_2025-08-07_18-22-08/checkpoint_000001) [repeated 3x across cluster]

Trial status: 5 TERMINATED | 5 RUNNING
Current time: 2025-08-07 18:23:08. Total running time: 1min 0s
Logical resource usage: 10.0/16 CPUs, 0/1 GPUs (0.0/1.0 accelerator_type:A10G)
╭────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ Trial name                status         l1     l2            lr     batch_size     iter     total time (s)      loss     accuracy │
├────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ train_cifar_6da15_00001   RUNNING         1      4   0.000583917              4                                                    │
│ train_cifar_6da15_00003   RUNNING        16      1   0.0300776                2                                                    │
│ train_cifar_6da15_00007   RUNNING        32    128   0.00200394              16        2            48.0227   1.36508       0.502  │
│ train_cifar_6da15_00008   RUNNING         2     32   0.00087078               2                                                    │
│ train_cifar_6da15_00009   RUNNING         4     64   0.00109616               8                                                    │
│ train_cifar_6da15_00000   TERMINATED      2    128   0.0509604               16        1            27.4742   2.31793       0.0995 │
│ train_cifar_6da15_00002   TERMINATED      1    128   0.00404185               8        1            44.1901   2.31397       0.1008 │
│ train_cifar_6da15_00004   TERMINATED      1     16   0.047715                16        1            28.5259   2.30867       0.1035 │
│ train_cifar_6da15_00005   TERMINATED      1      1   0.0207046                8        1            43.2041   2.31003       0.1022 │
│ train_cifar_6da15_00006   TERMINATED    256    256   0.000109715             16        2            52.0133   2.28008       0.1685 │
╰────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
(func pid=3878) [1, 10000] loss: 0.392 [repeated 3x across cluster]

Trial train_cifar_6da15_00009 finished iteration 1 at 2025-08-07 18:23:14. Total running time: 1min 6s
╭────────────────────────────────────────────────────────────╮
│ Trial train_cifar_6da15_00009 result                       │
├────────────────────────────────────────────────────────────┤
│ checkpoint_dir_name                      checkpoint_000000 │
│ time_this_iter_s                                  33.09376 │
│ time_total_s                                      33.09376 │
│ training_iteration                                       1 │
│ accuracy                                            0.3654 │
│ loss                                               1.69085 │
╰────────────────────────────────────────────────────────────╯
Trial train_cifar_6da15_00009 saved a checkpoint for iteration 1 at: (local)/var/lib/ci-user/ray_results/train_cifar_2025-08-07_18-22-08/train_cifar_6da15_00009_9_batch_size=8,l1=4,l2=64,lr=0.0011_2025-08-07_18-22-08/checkpoint_000000
(func pid=3880) Checkpoint successfully created at: Checkpoint(filesystem=local, path=/var/lib/ci-user/ray_results/train_cifar_2025-08-07_18-22-08/train_cifar_6da15_00009_9_batch_size=8,l1=4,l2=64,lr=0.0011_2025-08-07_18-22-08/checkpoint_000000)
(func pid=3879) [1, 12000] loss: 0.389 [repeated 3x across cluster]

Trial train_cifar_6da15_00001 finished iteration 1 at 2025-08-07 18:23:17. Total running time: 1min 8s
╭────────────────────────────────────────────────────────────╮
│ Trial train_cifar_6da15_00001 result                       │
├────────────────────────────────────────────────────────────┤
│ checkpoint_dir_name                      checkpoint_000000 │
│ time_this_iter_s                                  63.91659 │
│ time_total_s                                      63.91659 │
│ training_iteration                                       1 │
│ accuracy                                            0.1995 │
│ loss                                               1.92404 │
╰────────────────────────────────────────────────────────────╯
Trial train_cifar_6da15_00001 saved a checkpoint for iteration 1 at: (local)/var/lib/ci-user/ray_results/train_cifar_2025-08-07_18-22-08/train_cifar_6da15_00001_1_batch_size=4,l1=1,l2=4,lr=0.0006_2025-08-07_18-22-08/checkpoint_000000
(func pid=3878) Checkpoint successfully created at: Checkpoint(filesystem=local, path=/var/lib/ci-user/ray_results/train_cifar_2025-08-07_18-22-08/train_cifar_6da15_00001_1_batch_size=4,l1=1,l2=4,lr=0.0006_2025-08-07_18-22-08/checkpoint_000000)

Trial train_cifar_6da15_00007 finished iteration 3 at 2025-08-07 18:23:17. Total running time: 1min 9s
╭────────────────────────────────────────────────────────────╮
│ Trial train_cifar_6da15_00007 result                       │
├────────────────────────────────────────────────────────────┤
│ checkpoint_dir_name                      checkpoint_000002 │
│ time_this_iter_s                                  16.28883 │
│ time_total_s                                      64.31153 │
│ training_iteration                                       3 │
│ accuracy                                             0.533 │
│ loss                                               1.30026 │
╰────────────────────────────────────────────────────────────╯
Trial train_cifar_6da15_00007 saved a checkpoint for iteration 3 at: (local)/var/lib/ci-user/ray_results/train_cifar_2025-08-07_18-22-08/train_cifar_6da15_00007_7_batch_size=16,l1=32,l2=128,lr=0.0020_2025-08-07_18-22-08/checkpoint_000002
(func pid=3879) [1, 14000] loss: 0.334 [repeated 2x across cluster]
(func pid=3883) [4,  2000] loss: 1.245 [repeated 4x across cluster]

Trial train_cifar_6da15_00007 finished iteration 4 at 2025-08-07 18:23:33. Total running time: 1min 25s
╭────────────────────────────────────────────────────────────╮
│ Trial train_cifar_6da15_00007 result                       │
├────────────────────────────────────────────────────────────┤
│ checkpoint_dir_name                      checkpoint_000003 │
│ time_this_iter_s                                  15.78644 │
│ time_total_s                                      80.09798 │
│ training_iteration                                       4 │
│ accuracy                                            0.5692 │
│ loss                                               1.21548 │
╰────────────────────────────────────────────────────────────╯
Trial train_cifar_6da15_00007 saved a checkpoint for iteration 4 at: (local)/var/lib/ci-user/ray_results/train_cifar_2025-08-07_18-22-08/train_cifar_6da15_00007_7_batch_size=16,l1=32,l2=128,lr=0.0020_2025-08-07_18-22-08/checkpoint_000003
(func pid=3883) Checkpoint successfully created at: Checkpoint(filesystem=local, path=/var/lib/ci-user/ray_results/train_cifar_2025-08-07_18-22-08/train_cifar_6da15_00007_7_batch_size=16,l1=32,l2=128,lr=0.0020_2025-08-07_18-22-08/checkpoint_000003) [repeated 2x across cluster]
(func pid=3876) [1, 12000] loss: 0.384 [repeated 4x across cluster]

Trial status: 5 TERMINATED | 5 RUNNING
Current time: 2025-08-07 18:23:38. Total running time: 1min 30s
Logical resource usage: 10.0/16 CPUs, 0/1 GPUs (0.0/1.0 accelerator_type:A10G)
╭────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ Trial name                status         l1     l2            lr     batch_size     iter     total time (s)      loss     accuracy │
├────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ train_cifar_6da15_00001   RUNNING         1      4   0.000583917              4        1            63.9166   1.92404       0.1995 │
│ train_cifar_6da15_00003   RUNNING        16      1   0.0300776                2                                                    │
│ train_cifar_6da15_00007   RUNNING        32    128   0.00200394              16        4            80.098    1.21548       0.5692 │
│ train_cifar_6da15_00008   RUNNING         2     32   0.00087078               2                                                    │
│ train_cifar_6da15_00009   RUNNING         4     64   0.00109616               8        1            33.0938   1.69085       0.3654 │
│ train_cifar_6da15_00000   TERMINATED      2    128   0.0509604               16        1            27.4742   2.31793       0.0995 │
│ train_cifar_6da15_00002   TERMINATED      1    128   0.00404185               8        1            44.1901   2.31397       0.1008 │
│ train_cifar_6da15_00004   TERMINATED      1     16   0.047715                16        1            28.5259   2.30867       0.1035 │
│ train_cifar_6da15_00005   TERMINATED      1      1   0.0207046                8        1            43.2041   2.31003       0.1022 │
│ train_cifar_6da15_00006   TERMINATED    256    256   0.000109715             16        2            52.0133   2.28008       0.1685 │
╰────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
(func pid=3878) [2,  6000] loss: 0.635 [repeated 2x across cluster]

Trial train_cifar_6da15_00009 finished iteration 2 at 2025-08-07 18:23:41. Total running time: 1min 33s
╭────────────────────────────────────────────────────────────╮
│ Trial train_cifar_6da15_00009 result                       │
├────────────────────────────────────────────────────────────┤
│ checkpoint_dir_name                      checkpoint_000001 │
│ time_this_iter_s                                  26.73551 │
│ time_total_s                                      59.82926 │
│ training_iteration                                       2 │
│ accuracy                                            0.4238 │
│ loss                                               1.50893 │
╰────────────────────────────────────────────────────────────╯
Trial train_cifar_6da15_00009 saved a checkpoint for iteration 2 at: (local)/var/lib/ci-user/ray_results/train_cifar_2025-08-07_18-22-08/train_cifar_6da15_00009_9_batch_size=8,l1=4,l2=64,lr=0.0011_2025-08-07_18-22-08/checkpoint_000001
(func pid=3880) Checkpoint successfully created at: Checkpoint(filesystem=local, path=/var/lib/ci-user/ray_results/train_cifar_2025-08-07_18-22-08/train_cifar_6da15_00009_9_batch_size=8,l1=4,l2=64,lr=0.0011_2025-08-07_18-22-08/checkpoint_000001)

Trial train_cifar_6da15_00007 finished iteration 5 at 2025-08-07 18:23:49. Total running time: 1min 40s
╭────────────────────────────────────────────────────────────╮
│ Trial train_cifar_6da15_00007 result                       │
├────────────────────────────────────────────────────────────┤
│ checkpoint_dir_name                      checkpoint_000004 │
│ time_this_iter_s                                  15.66899 │
│ time_total_s                                      95.76696 │
│ training_iteration                                       5 │
│ accuracy                                             0.586 │
│ loss                                               1.17639 │
╰────────────────────────────────────────────────────────────╯
(func pid=3883) Checkpoint successfully created at: Checkpoint(filesystem=local, path=/var/lib/ci-user/ray_results/train_cifar_2025-08-07_18-22-08/train_cifar_6da15_00007_7_batch_size=16,l1=32,l2=128,lr=0.0020_2025-08-07_18-22-08/checkpoint_000004)
Trial train_cifar_6da15_00007 saved a checkpoint for iteration 5 at: (local)/var/lib/ci-user/ray_results/train_cifar_2025-08-07_18-22-08/train_cifar_6da15_00007_7_batch_size=16,l1=32,l2=128,lr=0.0020_2025-08-07_18-22-08/checkpoint_000004
(func pid=3878) [2,  8000] loss: 0.473 [repeated 4x across cluster]
(func pid=3878) [2, 10000] loss: 0.375 [repeated 3x across cluster]

Trial train_cifar_6da15_00003 finished iteration 1 at 2025-08-07 18:23:58. Total running time: 1min 50s
╭────────────────────────────────────────────────────────────╮
│ Trial train_cifar_6da15_00003 result                       │
├────────────────────────────────────────────────────────────┤
│ checkpoint_dir_name                      checkpoint_000000 │
│ time_this_iter_s                                 105.78079 │
│ time_total_s                                     105.78079 │
│ training_iteration                                       1 │
│ accuracy                                            0.1024 │
│ loss                                               2.33666 │
╰────────────────────────────────────────────────────────────╯
Trial train_cifar_6da15_00003 saved a checkpoint for iteration 1 at: (local)/var/lib/ci-user/ray_results/train_cifar_2025-08-07_18-22-08/train_cifar_6da15_00003_3_batch_size=2,l1=16,l2=1,lr=0.0301_2025-08-07_18-22-08/checkpoint_000000

Trial train_cifar_6da15_00003 completed after 1 iterations at 2025-08-07 18:23:58. Total running time: 1min 50s
(func pid=3879) Checkpoint successfully created at: Checkpoint(filesystem=local, path=/var/lib/ci-user/ray_results/train_cifar_2025-08-07_18-22-08/train_cifar_6da15_00003_3_batch_size=2,l1=16,l2=1,lr=0.0301_2025-08-07_18-22-08/checkpoint_000000)

Trial train_cifar_6da15_00001 finished iteration 2 at 2025-08-07 18:24:03. Total running time: 1min 54s
╭────────────────────────────────────────────────────────────╮
│ Trial train_cifar_6da15_00001 result                       │
├────────────────────────────────────────────────────────────┤
│ checkpoint_dir_name                      checkpoint_000001 │
│ time_this_iter_s                                  46.35663 │
│ time_total_s                                     110.27322 │
│ training_iteration                                       2 │
│ accuracy                                            0.2069 │
│ loss                                               1.91341 │
╰────────────────────────────────────────────────────────────╯
(func pid=3878) Checkpoint successfully created at: Checkpoint(filesystem=local, path=/var/lib/ci-user/ray_results/train_cifar_2025-08-07_18-22-08/train_cifar_6da15_00001_1_batch_size=4,l1=1,l2=4,lr=0.0006_2025-08-07_18-22-08/checkpoint_000001)
Trial train_cifar_6da15_00001 saved a checkpoint for iteration 2 at: (local)/var/lib/ci-user/ray_results/train_cifar_2025-08-07_18-22-08/train_cifar_6da15_00001_1_batch_size=4,l1=1,l2=4,lr=0.0006_2025-08-07_18-22-08/checkpoint_000001

Trial train_cifar_6da15_00001 completed after 2 iterations at 2025-08-07 18:24:03. Total running time: 1min 54s

Trial train_cifar_6da15_00007 finished iteration 6 at 2025-08-07 18:24:04. Total running time: 1min 56s
╭────────────────────────────────────────────────────────────╮
│ Trial train_cifar_6da15_00007 result                       │
├────────────────────────────────────────────────────────────┤
│ checkpoint_dir_name                      checkpoint_000005 │
│ time_this_iter_s                                  15.49965 │
│ time_total_s                                     111.26661 │
│ training_iteration                                       6 │
│ accuracy                                            0.6106 │
│ loss                                               1.11135 │
╰────────────────────────────────────────────────────────────╯
Trial train_cifar_6da15_00007 saved a checkpoint for iteration 6 at: (local)/var/lib/ci-user/ray_results/train_cifar_2025-08-07_18-22-08/train_cifar_6da15_00007_7_batch_size=16,l1=32,l2=128,lr=0.0020_2025-08-07_18-22-08/checkpoint_000005
(func pid=3876) [1, 20000] loss: 0.202 [repeated 4x across cluster]

Trial train_cifar_6da15_00009 finished iteration 3 at 2025-08-07 18:24:07. Total running time: 1min 58s
╭────────────────────────────────────────────────────────────╮
│ Trial train_cifar_6da15_00009 result                       │
├────────────────────────────────────────────────────────────┤
│ checkpoint_dir_name                      checkpoint_000002 │
│ time_this_iter_s                                  25.67368 │
│ time_total_s                                      85.50294 │
│ training_iteration                                       3 │
│ accuracy                                            0.4735 │
│ loss                                               1.41079 │
╰────────────────────────────────────────────────────────────╯
Trial train_cifar_6da15_00009 saved a checkpoint for iteration 3 at: (local)/var/lib/ci-user/ray_results/train_cifar_2025-08-07_18-22-08/train_cifar_6da15_00009_9_batch_size=8,l1=4,l2=64,lr=0.0011_2025-08-07_18-22-08/checkpoint_000002

Trial status: 7 TERMINATED | 3 RUNNING
Current time: 2025-08-07 18:24:08. Total running time: 2min 0s
Logical resource usage: 6.0/16 CPUs, 0/1 GPUs (0.0/1.0 accelerator_type:A10G)
╭────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ Trial name                status         l1     l2            lr     batch_size     iter     total time (s)      loss     accuracy │
├────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ train_cifar_6da15_00007   RUNNING        32    128   0.00200394              16        6           111.267    1.11135       0.6106 │
│ train_cifar_6da15_00008   RUNNING         2     32   0.00087078               2                                                    │
│ train_cifar_6da15_00009   RUNNING         4     64   0.00109616               8        3            85.5029   1.41079       0.4735 │
│ train_cifar_6da15_00000   TERMINATED      2    128   0.0509604               16        1            27.4742   2.31793       0.0995 │
│ train_cifar_6da15_00001   TERMINATED      1      4   0.000583917              4        2           110.273    1.91341       0.2069 │
│ train_cifar_6da15_00002   TERMINATED      1    128   0.00404185               8        1            44.1901   2.31397       0.1008 │
│ train_cifar_6da15_00003   TERMINATED     16      1   0.0300776                2        1           105.781    2.33666       0.1024 │
│ train_cifar_6da15_00004   TERMINATED      1     16   0.047715                16        1            28.5259   2.30867       0.1035 │
│ train_cifar_6da15_00005   TERMINATED      1      1   0.0207046                8        1            43.2041   2.31003       0.1022 │
│ train_cifar_6da15_00006   TERMINATED    256    256   0.000109715             16        2            52.0133   2.28008       0.1685 │
╰────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
(func pid=3883) [7,  2000] loss: 1.060

Trial train_cifar_6da15_00008 finished iteration 1 at 2025-08-07 18:24:14. Total running time: 2min 6s
╭────────────────────────────────────────────────────────────╮
│ Trial train_cifar_6da15_00008 result                       │
├────────────────────────────────────────────────────────────┤
│ checkpoint_dir_name                      checkpoint_000000 │
│ time_this_iter_s                                  93.84313 │
│ time_total_s                                      93.84313 │
│ training_iteration                                       1 │
│ accuracy                                            0.1973 │
│ loss                                               1.96308 │
╰────────────────────────────────────────────────────────────╯
Trial train_cifar_6da15_00008 saved a checkpoint for iteration 1 at: (local)/var/lib/ci-user/ray_results/train_cifar_2025-08-07_18-22-08/train_cifar_6da15_00008_8_batch_size=2,l1=2,l2=32,lr=0.0009_2025-08-07_18-22-08/checkpoint_000000
(func pid=3876) Checkpoint successfully created at: Checkpoint(filesystem=local, path=/var/lib/ci-user/ray_results/train_cifar_2025-08-07_18-22-08/train_cifar_6da15_00008_8_batch_size=2,l1=2,l2=32,lr=0.0009_2025-08-07_18-22-08/checkpoint_000000) [repeated 3x across cluster]
(func pid=3880) [4,  2000] loss: 1.375

Trial train_cifar_6da15_00007 finished iteration 7 at 2025-08-07 18:24:18. Total running time: 2min 10s
╭────────────────────────────────────────────────────────────╮
│ Trial train_cifar_6da15_00007 result                       │
├────────────────────────────────────────────────────────────┤
│ checkpoint_dir_name                      checkpoint_000006 │
│ time_this_iter_s                                  14.08048 │
│ time_total_s                                      125.3471 │
│ training_iteration                                       7 │
│ accuracy                                            0.6064 │
│ loss                                               1.12497 │
╰────────────────────────────────────────────────────────────╯
Trial train_cifar_6da15_00007 saved a checkpoint for iteration 7 at: (local)/var/lib/ci-user/ray_results/train_cifar_2025-08-07_18-22-08/train_cifar_6da15_00007_7_batch_size=16,l1=32,l2=128,lr=0.0020_2025-08-07_18-22-08/checkpoint_000006
(func pid=3876) [2,  2000] loss: 1.970
(func pid=3880) [4,  4000] loss: 0.692
(func pid=3883) [8,  2000] loss: 1.031 [repeated 2x across cluster]

Trial train_cifar_6da15_00009 finished iteration 4 at 2025-08-07 18:24:29. Total running time: 2min 21s
╭────────────────────────────────────────────────────────────╮
│ Trial train_cifar_6da15_00009 result                       │
├────────────────────────────────────────────────────────────┤
│ checkpoint_dir_name                      checkpoint_000003 │
│ time_this_iter_s                                  22.29746 │
│ time_total_s                                      107.8004 │
│ training_iteration                                       4 │
│ accuracy                                            0.5039 │
│ loss                                               1.33079 │
╰────────────────────────────────────────────────────────────╯
Trial train_cifar_6da15_00009 saved a checkpoint for iteration 4 at: (local)/var/lib/ci-user/ray_results/train_cifar_2025-08-07_18-22-08/train_cifar_6da15_00009_9_batch_size=8,l1=4,l2=64,lr=0.0011_2025-08-07_18-22-08/checkpoint_000003

Trial train_cifar_6da15_00009 completed after 4 iterations at 2025-08-07 18:24:29. Total running time: 2min 21s
(func pid=3880) Checkpoint successfully created at: Checkpoint(filesystem=local, path=/var/lib/ci-user/ray_results/train_cifar_2025-08-07_18-22-08/train_cifar_6da15_00009_9_batch_size=8,l1=4,l2=64,lr=0.0011_2025-08-07_18-22-08/checkpoint_000003) [repeated 2x across cluster]

Trial train_cifar_6da15_00007 finished iteration 8 at 2025-08-07 18:24:32. Total running time: 2min 24s
╭────────────────────────────────────────────────────────────╮
│ Trial train_cifar_6da15_00007 result                       │
├────────────────────────────────────────────────────────────┤
│ checkpoint_dir_name                      checkpoint_000007 │
│ time_this_iter_s                                  13.66092 │
│ time_total_s                                     139.00802 │
│ training_iteration                                       8 │
│ accuracy                                            0.6076 │
│ loss                                               1.11366 │
╰────────────────────────────────────────────────────────────╯
Trial train_cifar_6da15_00007 saved a checkpoint for iteration 8 at: (local)/var/lib/ci-user/ray_results/train_cifar_2025-08-07_18-22-08/train_cifar_6da15_00007_7_batch_size=16,l1=32,l2=128,lr=0.0020_2025-08-07_18-22-08/checkpoint_000007
(func pid=3876) [2,  8000] loss: 0.461 [repeated 2x across cluster]

Trial status: 8 TERMINATED | 2 RUNNING
Current time: 2025-08-07 18:24:38. Total running time: 2min 30s
Logical resource usage: 4.0/16 CPUs, 0/1 GPUs (0.0/1.0 accelerator_type:A10G)
╭────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ Trial name                status         l1     l2            lr     batch_size     iter     total time (s)      loss     accuracy │
├────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ train_cifar_6da15_00007   RUNNING        32    128   0.00200394              16        8           139.008    1.11366       0.6076 │
│ train_cifar_6da15_00008   RUNNING         2     32   0.00087078               2        1            93.8431   1.96308       0.1973 │
│ train_cifar_6da15_00000   TERMINATED      2    128   0.0509604               16        1            27.4742   2.31793       0.0995 │
│ train_cifar_6da15_00001   TERMINATED      1      4   0.000583917              4        2           110.273    1.91341       0.2069 │
│ train_cifar_6da15_00002   TERMINATED      1    128   0.00404185               8        1            44.1901   2.31397       0.1008 │
│ train_cifar_6da15_00003   TERMINATED     16      1   0.0300776                2        1           105.781    2.33666       0.1024 │
│ train_cifar_6da15_00004   TERMINATED      1     16   0.047715                16        1            28.5259   2.30867       0.1035 │
│ train_cifar_6da15_00005   TERMINATED      1      1   0.0207046                8        1            43.2041   2.31003       0.1022 │
│ train_cifar_6da15_00006   TERMINATED    256    256   0.000109715             16        2            52.0133   2.28008       0.1685 │
│ train_cifar_6da15_00009   TERMINATED      4     64   0.00109616               8        4           107.8      1.33079       0.5039 │
╰────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
(func pid=3876) [2, 10000] loss: 0.358 [repeated 2x across cluster]

Trial train_cifar_6da15_00007 finished iteration 9 at 2025-08-07 18:24:44. Total running time: 2min 35s
╭────────────────────────────────────────────────────────────╮
│ Trial train_cifar_6da15_00007 result                       │
├────────────────────────────────────────────────────────────┤
│ checkpoint_dir_name                      checkpoint_000008 │
│ time_this_iter_s                                  11.80722 │
│ time_total_s                                     150.81524 │
│ training_iteration                                       9 │
│ accuracy                                            0.5799 │
│ loss                                                1.1971 │
╰────────────────────────────────────────────────────────────╯
Trial train_cifar_6da15_00007 saved a checkpoint for iteration 9 at: (local)/var/lib/ci-user/ray_results/train_cifar_2025-08-07_18-22-08/train_cifar_6da15_00007_7_batch_size=16,l1=32,l2=128,lr=0.0020_2025-08-07_18-22-08/checkpoint_000008
(func pid=3883) Checkpoint successfully created at: Checkpoint(filesystem=local, path=/var/lib/ci-user/ray_results/train_cifar_2025-08-07_18-22-08/train_cifar_6da15_00007_7_batch_size=16,l1=32,l2=128,lr=0.0020_2025-08-07_18-22-08/checkpoint_000008) [repeated 2x across cluster]
(func pid=3876) [2, 12000] loss: 0.291
(func pid=3883) [10,  2000] loss: 0.974

Trial train_cifar_6da15_00007 finished iteration 10 at 2025-08-07 18:24:55. Total running time: 2min 47s
╭────────────────────────────────────────────────────────────╮
│ Trial train_cifar_6da15_00007 result                       │
├────────────────────────────────────────────────────────────┤
│ checkpoint_dir_name                      checkpoint_000009 │
│ time_this_iter_s                                  11.67832 │
│ time_total_s                                     162.49356 │
│ training_iteration                                      10 │
│ accuracy                                            0.6162 │
│ loss                                                1.0932 │
╰────────────────────────────────────────────────────────────╯
Trial train_cifar_6da15_00007 saved a checkpoint for iteration 10 at: (local)/var/lib/ci-user/ray_results/train_cifar_2025-08-07_18-22-08/train_cifar_6da15_00007_7_batch_size=16,l1=32,l2=128,lr=0.0020_2025-08-07_18-22-08/checkpoint_000009

Trial train_cifar_6da15_00007 completed after 10 iterations at 2025-08-07 18:24:55. Total running time: 2min 47s
(func pid=3883) Checkpoint successfully created at: Checkpoint(filesystem=local, path=/var/lib/ci-user/ray_results/train_cifar_2025-08-07_18-22-08/train_cifar_6da15_00007_7_batch_size=16,l1=32,l2=128,lr=0.0020_2025-08-07_18-22-08/checkpoint_000009)
(func pid=3876) [2, 16000] loss: 0.213 [repeated 2x across cluster]
(func pid=3876) [2, 18000] loss: 0.190

Trial status: 9 TERMINATED | 1 RUNNING
Current time: 2025-08-07 18:25:08. Total running time: 3min 0s
Logical resource usage: 2.0/16 CPUs, 0/1 GPUs (0.0/1.0 accelerator_type:A10G)
╭────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ Trial name                status         l1     l2            lr     batch_size     iter     total time (s)      loss     accuracy │
├────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ train_cifar_6da15_00008   RUNNING         2     32   0.00087078               2        1            93.8431   1.96308       0.1973 │
│ train_cifar_6da15_00000   TERMINATED      2    128   0.0509604               16        1            27.4742   2.31793       0.0995 │
│ train_cifar_6da15_00001   TERMINATED      1      4   0.000583917              4        2           110.273    1.91341       0.2069 │
│ train_cifar_6da15_00002   TERMINATED      1    128   0.00404185               8        1            44.1901   2.31397       0.1008 │
│ train_cifar_6da15_00003   TERMINATED     16      1   0.0300776                2        1           105.781    2.33666       0.1024 │
│ train_cifar_6da15_00004   TERMINATED      1     16   0.047715                16        1            28.5259   2.30867       0.1035 │
│ train_cifar_6da15_00005   TERMINATED      1      1   0.0207046                8        1            43.2041   2.31003       0.1022 │
│ train_cifar_6da15_00006   TERMINATED    256    256   0.000109715             16        2            52.0133   2.28008       0.1685 │
│ train_cifar_6da15_00007   TERMINATED     32    128   0.00200394              16       10           162.494    1.0932        0.6162 │
│ train_cifar_6da15_00009   TERMINATED      4     64   0.00109616               8        4           107.8      1.33079       0.5039 │
╰────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
(func pid=3876) [2, 20000] loss: 0.170

Trial train_cifar_6da15_00008 finished iteration 2 at 2025-08-07 18:25:17. Total running time: 3min 9s
╭────────────────────────────────────────────────────────────╮
│ Trial train_cifar_6da15_00008 result                       │
├────────────────────────────────────────────────────────────┤
│ checkpoint_dir_name                      checkpoint_000001 │
│ time_this_iter_s                                  62.93553 │
│ time_total_s                                     156.77867 │
│ training_iteration                                       2 │
│ accuracy                                            0.3405 │
│ loss                                               1.69089 │
╰────────────────────────────────────────────────────────────╯
Trial train_cifar_6da15_00008 saved a checkpoint for iteration 2 at: (local)/var/lib/ci-user/ray_results/train_cifar_2025-08-07_18-22-08/train_cifar_6da15_00008_8_batch_size=2,l1=2,l2=32,lr=0.0009_2025-08-07_18-22-08/checkpoint_000001
(func pid=3876) Checkpoint successfully created at: Checkpoint(filesystem=local, path=/var/lib/ci-user/ray_results/train_cifar_2025-08-07_18-22-08/train_cifar_6da15_00008_8_batch_size=2,l1=2,l2=32,lr=0.0009_2025-08-07_18-22-08/checkpoint_000001)
(func pid=3876) [3,  2000] loss: 1.652
(func pid=3876) [3,  4000] loss: 0.841
(func pid=3876) [3,  6000] loss: 0.563
(func pid=3876) [3,  8000] loss: 0.420

Trial status: 9 TERMINATED | 1 RUNNING
Current time: 2025-08-07 18:25:39. Total running time: 3min 30s
Logical resource usage: 2.0/16 CPUs, 0/1 GPUs (0.0/1.0 accelerator_type:A10G)
╭────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ Trial name                status         l1     l2            lr     batch_size     iter     total time (s)      loss     accuracy │
├────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ train_cifar_6da15_00008   RUNNING         2     32   0.00087078               2        2           156.779    1.69089       0.3405 │
│ train_cifar_6da15_00000   TERMINATED      2    128   0.0509604               16        1            27.4742   2.31793       0.0995 │
│ train_cifar_6da15_00001   TERMINATED      1      4   0.000583917              4        2           110.273    1.91341       0.2069 │
│ train_cifar_6da15_00002   TERMINATED      1    128   0.00404185               8        1            44.1901   2.31397       0.1008 │
│ train_cifar_6da15_00003   TERMINATED     16      1   0.0300776                2        1           105.781    2.33666       0.1024 │
│ train_cifar_6da15_00004   TERMINATED      1     16   0.047715                16        1            28.5259   2.30867       0.1035 │
│ train_cifar_6da15_00005   TERMINATED      1      1   0.0207046                8        1            43.2041   2.31003       0.1022 │
│ train_cifar_6da15_00006   TERMINATED    256    256   0.000109715             16        2            52.0133   2.28008       0.1685 │
│ train_cifar_6da15_00007   TERMINATED     32    128   0.00200394              16       10           162.494    1.0932        0.6162 │
│ train_cifar_6da15_00009   TERMINATED      4     64   0.00109616               8        4           107.8      1.33079       0.5039 │
╰────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
(func pid=3876) [3, 10000] loss: 0.334
(func pid=3876) [3, 12000] loss: 0.279
(func pid=3876) [3, 14000] loss: 0.239
(func pid=3876) [3, 16000] loss: 0.208
(func pid=3876) [3, 18000] loss: 0.185
(func pid=3876) [3, 20000] loss: 0.166
Trial status: 9 TERMINATED | 1 RUNNING
Current time: 2025-08-07 18:26:09. Total running time: 4min 0s
Logical resource usage: 2.0/16 CPUs, 0/1 GPUs (0.0/1.0 accelerator_type:A10G)
╭────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ Trial name                status         l1     l2            lr     batch_size     iter     total time (s)      loss     accuracy │
├────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ train_cifar_6da15_00008   RUNNING         2     32   0.00087078               2        2           156.779    1.69089       0.3405 │
│ train_cifar_6da15_00000   TERMINATED      2    128   0.0509604               16        1            27.4742   2.31793       0.0995 │
│ train_cifar_6da15_00001   TERMINATED      1      4   0.000583917              4        2           110.273    1.91341       0.2069 │
│ train_cifar_6da15_00002   TERMINATED      1    128   0.00404185               8        1            44.1901   2.31397       0.1008 │
│ train_cifar_6da15_00003   TERMINATED     16      1   0.0300776                2        1           105.781    2.33666       0.1024 │
│ train_cifar_6da15_00004   TERMINATED      1     16   0.047715                16        1            28.5259   2.30867       0.1035 │
│ train_cifar_6da15_00005   TERMINATED      1      1   0.0207046                8        1            43.2041   2.31003       0.1022 │
│ train_cifar_6da15_00006   TERMINATED    256    256   0.000109715             16        2            52.0133   2.28008       0.1685 │
│ train_cifar_6da15_00007   TERMINATED     32    128   0.00200394              16       10           162.494    1.0932        0.6162 │
│ train_cifar_6da15_00009   TERMINATED      4     64   0.00109616               8        4           107.8      1.33079       0.5039 │
╰────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯

Trial train_cifar_6da15_00008 finished iteration 3 at 2025-08-07 18:26:16. Total running time: 4min 7s
╭────────────────────────────────────────────────────────────╮
│ Trial train_cifar_6da15_00008 result                       │
├────────────────────────────────────────────────────────────┤
│ checkpoint_dir_name                      checkpoint_000002 │
│ time_this_iter_s                                  58.77257 │
│ time_total_s                                     215.55124 │
│ training_iteration                                       3 │
│ accuracy                                            0.3434 │
│ loss                                               1.70959 │
╰────────────────────────────────────────────────────────────╯
Trial train_cifar_6da15_00008 saved a checkpoint for iteration 3 at: (local)/var/lib/ci-user/ray_results/train_cifar_2025-08-07_18-22-08/train_cifar_6da15_00008_8_batch_size=2,l1=2,l2=32,lr=0.0009_2025-08-07_18-22-08/checkpoint_000002
(func pid=3876) Checkpoint successfully created at: Checkpoint(filesystem=local, path=/var/lib/ci-user/ray_results/train_cifar_2025-08-07_18-22-08/train_cifar_6da15_00008_8_batch_size=2,l1=2,l2=32,lr=0.0009_2025-08-07_18-22-08/checkpoint_000002)
(func pid=3876) [4,  2000] loss: 1.621
(func pid=3876) [4,  4000] loss: 0.825
(func pid=3876) [4,  6000] loss: 0.551
(func pid=3876) [4,  8000] loss: 0.412

Trial status: 9 TERMINATED | 1 RUNNING
Current time: 2025-08-07 18:26:39. Total running time: 4min 30s
Logical resource usage: 2.0/16 CPUs, 0/1 GPUs (0.0/1.0 accelerator_type:A10G)
╭────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ Trial name                status         l1     l2            lr     batch_size     iter     total time (s)      loss     accuracy │
├────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ train_cifar_6da15_00008   RUNNING         2     32   0.00087078               2        3           215.551    1.70959       0.3434 │
│ train_cifar_6da15_00000   TERMINATED      2    128   0.0509604               16        1            27.4742   2.31793       0.0995 │
│ train_cifar_6da15_00001   TERMINATED      1      4   0.000583917              4        2           110.273    1.91341       0.2069 │
│ train_cifar_6da15_00002   TERMINATED      1    128   0.00404185               8        1            44.1901   2.31397       0.1008 │
│ train_cifar_6da15_00003   TERMINATED     16      1   0.0300776                2        1           105.781    2.33666       0.1024 │
│ train_cifar_6da15_00004   TERMINATED      1     16   0.047715                16        1            28.5259   2.30867       0.1035 │
│ train_cifar_6da15_00005   TERMINATED      1      1   0.0207046                8        1            43.2041   2.31003       0.1022 │
│ train_cifar_6da15_00006   TERMINATED    256    256   0.000109715             16        2            52.0133   2.28008       0.1685 │
│ train_cifar_6da15_00007   TERMINATED     32    128   0.00200394              16       10           162.494    1.0932        0.6162 │
│ train_cifar_6da15_00009   TERMINATED      4     64   0.00109616               8        4           107.8      1.33079       0.5039 │
╰────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
(func pid=3876) [4, 10000] loss: 0.325
(func pid=3876) [4, 12000] loss: 0.273
(func pid=3876) [4, 14000] loss: 0.231
(func pid=3876) [4, 16000] loss: 0.208
(func pid=3876) [4, 18000] loss: 0.183
(func pid=3876) [4, 20000] loss: 0.166
Trial status: 9 TERMINATED | 1 RUNNING
Current time: 2025-08-07 18:27:09. Total running time: 5min 0s
Logical resource usage: 2.0/16 CPUs, 0/1 GPUs (0.0/1.0 accelerator_type:A10G)
╭────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ Trial name                status         l1     l2            lr     batch_size     iter     total time (s)      loss     accuracy │
├────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ train_cifar_6da15_00008   RUNNING         2     32   0.00087078               2        3           215.551    1.70959       0.3434 │
│ train_cifar_6da15_00000   TERMINATED      2    128   0.0509604               16        1            27.4742   2.31793       0.0995 │
│ train_cifar_6da15_00001   TERMINATED      1      4   0.000583917              4        2           110.273    1.91341       0.2069 │
│ train_cifar_6da15_00002   TERMINATED      1    128   0.00404185               8        1            44.1901   2.31397       0.1008 │
│ train_cifar_6da15_00003   TERMINATED     16      1   0.0300776                2        1           105.781    2.33666       0.1024 │
│ train_cifar_6da15_00004   TERMINATED      1     16   0.047715                16        1            28.5259   2.30867       0.1035 │
│ train_cifar_6da15_00005   TERMINATED      1      1   0.0207046                8        1            43.2041   2.31003       0.1022 │
│ train_cifar_6da15_00006   TERMINATED    256    256   0.000109715             16        2            52.0133   2.28008       0.1685 │
│ train_cifar_6da15_00007   TERMINATED     32    128   0.00200394              16       10           162.494    1.0932        0.6162 │
│ train_cifar_6da15_00009   TERMINATED      4     64   0.00109616               8        4           107.8      1.33079       0.5039 │
╰────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯

Trial train_cifar_6da15_00008 finished iteration 4 at 2025-08-07 18:27:14. Total running time: 5min 6s
╭────────────────────────────────────────────────────────────╮
│ Trial train_cifar_6da15_00008 result                       │
├────────────────────────────────────────────────────────────┤
│ checkpoint_dir_name                      checkpoint_000003 │
│ time_this_iter_s                                  58.50352 │
│ time_total_s                                     274.05476 │
│ training_iteration                                       4 │
│ accuracy                                            0.3533 │
│ loss                                               1.65703 │
╰────────────────────────────────────────────────────────────╯
Trial train_cifar_6da15_00008 saved a checkpoint for iteration 4 at: (local)/var/lib/ci-user/ray_results/train_cifar_2025-08-07_18-22-08/train_cifar_6da15_00008_8_batch_size=2,l1=2,l2=32,lr=0.0009_2025-08-07_18-22-08/checkpoint_000003

Trial train_cifar_6da15_00008 completed after 4 iterations at 2025-08-07 18:27:14. Total running time: 5min 6s

Trial status: 10 TERMINATED
Current time: 2025-08-07 18:27:14. Total running time: 5min 6s
Logical resource usage: 2.0/16 CPUs, 0/1 GPUs (0.0/1.0 accelerator_type:A10G)
╭────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ Trial name                status         l1     l2            lr     batch_size     iter     total time (s)      loss     accuracy │
├────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ train_cifar_6da15_00000   TERMINATED      2    128   0.0509604               16        1            27.4742   2.31793       0.0995 │
│ train_cifar_6da15_00001   TERMINATED      1      4   0.000583917              4        2           110.273    1.91341       0.2069 │
│ train_cifar_6da15_00002   TERMINATED      1    128   0.00404185               8        1            44.1901   2.31397       0.1008 │
│ train_cifar_6da15_00003   TERMINATED     16      1   0.0300776                2        1           105.781    2.33666       0.1024 │
│ train_cifar_6da15_00004   TERMINATED      1     16   0.047715                16        1            28.5259   2.30867       0.1035 │
│ train_cifar_6da15_00005   TERMINATED      1      1   0.0207046                8        1            43.2041   2.31003       0.1022 │
│ train_cifar_6da15_00006   TERMINATED    256    256   0.000109715             16        2            52.0133   2.28008       0.1685 │
│ train_cifar_6da15_00007   TERMINATED     32    128   0.00200394              16       10           162.494    1.0932        0.6162 │
│ train_cifar_6da15_00008   TERMINATED      2     32   0.00087078               2        4           274.055    1.65703       0.3533 │
│ train_cifar_6da15_00009   TERMINATED      4     64   0.00109616               8        4           107.8      1.33079       0.5039 │
╰────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯

Best trial config: {'l1': 32, 'l2': 128, 'lr': 0.0020039435509646582, 'batch_size': 16}
Best trial final validation loss: 1.0931986199855805
Best trial final validation accuracy: 0.6162
(func pid=3876) Checkpoint successfully created at: Checkpoint(filesystem=local, path=/var/lib/ci-user/ray_results/train_cifar_2025-08-07_18-22-08/train_cifar_6da15_00008_8_batch_size=2,l1=2,l2=32,lr=0.0009_2025-08-07_18-22-08/checkpoint_000003)
Best trial test set accuracy: 0.6193

如果您运行代码,一个示例输出可能如下所示

Number of trials: 10/10 (10 TERMINATED)
+-----+--------------+------+------+-------------+--------+---------+------------+
| ... |   batch_size |   l1 |   l2 |          lr |   iter |    loss |   accuracy |
|-----+--------------+------+------+-------------+--------+---------+------------|
| ... |            2 |    1 |  256 | 0.000668163 |      1 | 2.31479 |     0.0977 |
| ... |            4 |   64 |    8 | 0.0331514   |      1 | 2.31605 |     0.0983 |
| ... |            4 |    2 |    1 | 0.000150295 |      1 | 2.30755 |     0.1023 |
| ... |           16 |   32 |   32 | 0.0128248   |     10 | 1.66912 |     0.4391 |
| ... |            4 |    8 |  128 | 0.00464561  |      2 | 1.7316  |     0.3463 |
| ... |            8 |  256 |    8 | 0.00031556  |      1 | 2.19409 |     0.1736 |
| ... |            4 |   16 |  256 | 0.00574329  |      2 | 1.85679 |     0.3368 |
| ... |            8 |    2 |    2 | 0.00325652  |      1 | 2.30272 |     0.0984 |
| ... |            2 |    2 |    2 | 0.000342987 |      2 | 1.76044 |     0.292  |
| ... |            4 |   64 |   32 | 0.003734    |      8 | 1.53101 |     0.4761 |
+-----+--------------+------+------+-------------+--------+---------+------------+

Best trial config: {'l1': 64, 'l2': 32, 'lr': 0.0037339984519545164, 'batch_size': 4}
Best trial final validation loss: 1.5310075663924216
Best trial final validation accuracy: 0.4761
Best trial test set accuracy: 0.4737

大多数试验都已提前停止,以避免浪费资源。表现最佳的试验达到了约 47% 的验证准确率,这在测试集上得到了证实。

就是这样!您现在可以调整 PyTorch 模型的参数了。

脚本总运行时间: (5 分 23.228 秒)