注意
转到末尾 下载完整示例代码。
使用 Ray Tune 进行超参数调优#
创建日期:2020年8月31日 | 最后更新:2025年6月24日 | 最后验证:2024年11月5日
超参数调优可以决定模型是表现平平还是高度精确。通常,仅仅是选择不同的学习率或改变网络层大小,就能对模型性能产生显著影响。
幸运的是,有一些工具可以帮助找到最佳参数组合。Ray Tune 是一个用于分布式超参数调优的行业标准工具。Ray Tune 包含了最新的超参数搜索算法,集成了各种分析库,并通过 Ray 的分布式机器学习引擎 原生支持分布式训练。
在本教程中,我们将向您展示如何将 Ray Tune 集成到您的 PyTorch 训练工作流程中。我们将扩展 PyTorch 文档中的此教程 来训练 CIFAR10 图像分类器。
正如您将看到的,我们只需要进行一些小的修改。具体来说,我们需要
将数据加载和训练封装到函数中,
使一些网络参数可配置,
添加检查点(可选),
并定义模型调优的搜索空间
要运行本教程,请确保已安装以下包
ray[tune]
: 分布式超参数调优库torchvision
: 用于数据转换器
设置/导入#
让我们从导入开始
from functools import partial
import os
import tempfile
from pathlib import Path
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torch.utils.data import random_split
import torchvision
import torchvision.transforms as transforms
from ray import tune
from ray import train
from ray.train import Checkpoint, get_checkpoint
from ray.tune.schedulers import ASHAScheduler
import ray.cloudpickle as pickle
大多数导入是构建 PyTorch 模型所必需的。只有最后几个导入是用于 Ray Tune 的。
数据加载器#
我们将数据加载器封装在自己的函数中,并传递一个全局数据目录。这样我们可以在不同的试验之间共享一个数据目录。
def load_data(data_dir="./data"):
transform = transforms.Compose(
[transforms.ToTensor(), transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))]
)
trainset = torchvision.datasets.CIFAR10(
root=data_dir, train=True, download=True, transform=transform
)
testset = torchvision.datasets.CIFAR10(
root=data_dir, train=False, download=True, transform=transform
)
return trainset, testset
可配置的神经网络#
我们只能调整那些可配置的参数。在这个例子中,我们可以指定全连接层的层大小
class Net(nn.Module):
def __init__(self, l1=120, l2=84):
super(Net, self).__init__()
self.conv1 = nn.Conv2d(3, 6, 5)
self.pool = nn.MaxPool2d(2, 2)
self.conv2 = nn.Conv2d(6, 16, 5)
self.fc1 = nn.Linear(16 * 5 * 5, l1)
self.fc2 = nn.Linear(l1, l2)
self.fc3 = nn.Linear(l2, 10)
def forward(self, x):
x = self.pool(F.relu(self.conv1(x)))
x = self.pool(F.relu(self.conv2(x)))
x = torch.flatten(x, 1) # flatten all dimensions except batch
x = F.relu(self.fc1(x))
x = F.relu(self.fc2(x))
x = self.fc3(x)
return x
训练函数#
现在变得有趣了,因为我们对 PyTorch 文档中的示例 进行了一些修改。
我们将训练脚本封装在一个函数 train_cifar(config, data_dir=None)
中。config
参数将接收我们希望训练的超参数。data_dir
指定我们加载和存储数据的目录,以便多个运行可以共享相同的数据源。如果提供了检查点,我们还在运行开始时加载模型和优化器状态。在本教程的下方,您将找到有关如何保存检查点及其用途的信息。
net = Net(config["l1"], config["l2"])
checkpoint = get_checkpoint()
if checkpoint:
with checkpoint.as_directory() as checkpoint_dir:
data_path = Path(checkpoint_dir) / "data.pkl"
with open(data_path, "rb") as fp:
checkpoint_state = pickle.load(fp)
start_epoch = checkpoint_state["epoch"]
net.load_state_dict(checkpoint_state["net_state_dict"])
optimizer.load_state_dict(checkpoint_state["optimizer_state_dict"])
else:
start_epoch = 0
优化器的学习率也可以配置
optimizer = optim.SGD(net.parameters(), lr=config["lr"], momentum=0.9)
我们还将训练数据分成训练和验证子集。因此,我们使用 80% 的数据进行训练,并在剩余的 20% 数据上计算验证损失。我们遍历训练集和测试集时的批次大小也是可配置的。
添加(多)GPU 支持与 DataParallel#
图像分类很大程度上受益于 GPU。幸运的是,我们可以在 Ray Tune 中继续使用 PyTorch 的抽象。因此,我们可以将模型封装在 nn.DataParallel
中,以支持多 GPU 上的数据并行训练
device = "cpu"
if torch.cuda.is_available():
device = "cuda:0"
if torch.cuda.device_count() > 1:
net = nn.DataParallel(net)
net.to(device)
通过使用 device
变量,我们确保即使没有可用的 GPU,训练也能正常工作。PyTorch 要求我们明确地将数据发送到 GPU 内存,如下所示
for i, data in enumerate(trainloader, 0):
inputs, labels = data
inputs, labels = inputs.to(device), labels.to(device)
该代码现在支持在 CPU、单个 GPU 和多个 GPU 上进行训练。值得注意的是,Ray 还支持分块 GPU,因此我们可以在试验之间共享 GPU,只要模型仍然适合 GPU 内存。我们稍后再讨论这个问题。
与 Ray Tune 通信#
最有趣的部分是与 Ray Tune 的通信
checkpoint_data = {
"epoch": epoch,
"net_state_dict": net.state_dict(),
"optimizer_state_dict": optimizer.state_dict(),
}
with tempfile.TemporaryDirectory() as checkpoint_dir:
data_path = Path(checkpoint_dir) / "data.pkl"
with open(data_path, "wb") as fp:
pickle.dump(checkpoint_data, fp)
checkpoint = Checkpoint.from_directory(checkpoint_dir)
train.report(
{"loss": val_loss / val_steps, "accuracy": correct / total},
checkpoint=checkpoint,
)
这里我们首先保存一个检查点,然后向 Ray Tune 报告一些指标。具体来说,我们向 Ray Tune 发送验证损失和准确率。然后 Ray Tune 可以使用这些指标来决定哪些超参数配置产生了最佳结果。这些指标还可以用于提前停止表现不佳的试验,以避免在这些试验上浪费资源。
保存检查点是可选的,但是,如果我们要使用像 基于群体的训练 这样的高级调度器,它是必需的。此外,通过保存检查点,我们以后可以加载训练好的模型并在测试集上验证它们。最后,保存检查点对于容错性很有用,它允许我们中断训练并在以后继续训练。
完整训练函数#
完整的代码示例如下
def train_cifar(config, data_dir=None):
net = Net(config["l1"], config["l2"])
device = "cpu"
if torch.cuda.is_available():
device = "cuda:0"
if torch.cuda.device_count() > 1:
net = nn.DataParallel(net)
net.to(device)
criterion = nn.CrossEntropyLoss()
optimizer = optim.SGD(net.parameters(), lr=config["lr"], momentum=0.9)
checkpoint = get_checkpoint()
if checkpoint:
with checkpoint.as_directory() as checkpoint_dir:
data_path = Path(checkpoint_dir) / "data.pkl"
with open(data_path, "rb") as fp:
checkpoint_state = pickle.load(fp)
start_epoch = checkpoint_state["epoch"]
net.load_state_dict(checkpoint_state["net_state_dict"])
optimizer.load_state_dict(checkpoint_state["optimizer_state_dict"])
else:
start_epoch = 0
trainset, testset = load_data(data_dir)
test_abs = int(len(trainset) * 0.8)
train_subset, val_subset = random_split(
trainset, [test_abs, len(trainset) - test_abs]
)
trainloader = torch.utils.data.DataLoader(
train_subset, batch_size=int(config["batch_size"]), shuffle=True, num_workers=8
)
valloader = torch.utils.data.DataLoader(
val_subset, batch_size=int(config["batch_size"]), shuffle=True, num_workers=8
)
for epoch in range(start_epoch, 10): # loop over the dataset multiple times
running_loss = 0.0
epoch_steps = 0
for i, data in enumerate(trainloader, 0):
# get the inputs; data is a list of [inputs, labels]
inputs, labels = data
inputs, labels = inputs.to(device), labels.to(device)
# zero the parameter gradients
optimizer.zero_grad()
# forward + backward + optimize
outputs = net(inputs)
loss = criterion(outputs, labels)
loss.backward()
optimizer.step()
# print statistics
running_loss += loss.item()
epoch_steps += 1
if i % 2000 == 1999: # print every 2000 mini-batches
print(
"[%d, %5d] loss: %.3f"
% (epoch + 1, i + 1, running_loss / epoch_steps)
)
running_loss = 0.0
# Validation loss
val_loss = 0.0
val_steps = 0
total = 0
correct = 0
for i, data in enumerate(valloader, 0):
with torch.no_grad():
inputs, labels = data
inputs, labels = inputs.to(device), labels.to(device)
outputs = net(inputs)
_, predicted = torch.max(outputs.data, 1)
total += labels.size(0)
correct += (predicted == labels).sum().item()
loss = criterion(outputs, labels)
val_loss += loss.cpu().numpy()
val_steps += 1
checkpoint_data = {
"epoch": epoch,
"net_state_dict": net.state_dict(),
"optimizer_state_dict": optimizer.state_dict(),
}
with tempfile.TemporaryDirectory() as checkpoint_dir:
data_path = Path(checkpoint_dir) / "data.pkl"
with open(data_path, "wb") as fp:
pickle.dump(checkpoint_data, fp)
checkpoint = Checkpoint.from_directory(checkpoint_dir)
train.report(
{"loss": val_loss / val_steps, "accuracy": correct / total},
checkpoint=checkpoint,
)
print("Finished Training")
如您所见,大部分代码都是直接改编自原始示例。
测试集准确率#
通常,机器学习模型的性能是在未用于训练模型的保留测试集上进行测试的。我们也将其封装在一个函数中
def test_accuracy(net, device="cpu"):
trainset, testset = load_data()
testloader = torch.utils.data.DataLoader(
testset, batch_size=4, shuffle=False, num_workers=2
)
correct = 0
total = 0
with torch.no_grad():
for data in testloader:
images, labels = data
images, labels = images.to(device), labels.to(device)
outputs = net(images)
_, predicted = torch.max(outputs.data, 1)
total += labels.size(0)
correct += (predicted == labels).sum().item()
return correct / total
该函数还期望一个 device
参数,以便我们可以在 GPU 上进行测试集验证。
配置搜索空间#
最后,我们需要定义 Ray Tune 的搜索空间。这是一个例子
config = {
"l1": tune.choice([2 ** i for i in range(9)]),
"l2": tune.choice([2 ** i for i in range(9)]),
"lr": tune.loguniform(1e-4, 1e-1),
"batch_size": tune.choice([2, 4, 8, 16])
}
tune.choice()
接受一个值列表,这些值是均匀采样的。在这个例子中,l1
和 l2
参数应该是 4 到 256 之间的 2 的幂,所以可以是 4、8、16、32、64、128 或 256。lr
(学习率)应该在 0.0001 和 0.1 之间均匀采样。最后,批次大小是在 2、4、8 和 16 之间进行选择。
在每次试验中,Ray Tune 将从这些搜索空间中随机抽样一组参数组合。然后它将并行训练多个模型,并找到其中表现最好的一个。我们还使用 ASHAScheduler
,它将提前终止表现不佳的试验。
我们用 functools.partial
包装 train_cifar
函数,以设置常量 data_dir
参数。我们还可以告诉 Ray Tune 每个试验应该有哪些资源可用
gpus_per_trial = 2
# ...
result = tune.run(
partial(train_cifar, data_dir=data_dir),
resources_per_trial={"cpu": 8, "gpu": gpus_per_trial},
config=config,
num_samples=num_samples,
scheduler=scheduler,
checkpoint_at_end=True)
您可以指定 CPU 数量,然后这些 CPU 可用于例如增加 PyTorch DataLoader
实例的 num_workers
。选定的 GPU 数量在每个试验中对 PyTorch 可见。试验无法访问未为其请求的 GPU——因此您不必担心两个试验使用相同的资源集。
在这里我们还可以指定小数 GPU,所以像 gpus_per_trial=0.5
这样的值是完全有效的。然后试验将共享 GPU。您只需确保模型仍然适合 GPU 内存。
在模型训练完成后,我们将找到表现最好的模型,并从检查点文件中加载训练好的网络。然后我们获取测试集准确率并通过打印报告所有内容。
完整的 main 函数如下所示
def main(num_samples=10, max_num_epochs=10, gpus_per_trial=2):
data_dir = os.path.abspath("./data")
load_data(data_dir)
config = {
"l1": tune.choice([2**i for i in range(9)]),
"l2": tune.choice([2**i for i in range(9)]),
"lr": tune.loguniform(1e-4, 1e-1),
"batch_size": tune.choice([2, 4, 8, 16]),
}
scheduler = ASHAScheduler(
metric="loss",
mode="min",
max_t=max_num_epochs,
grace_period=1,
reduction_factor=2,
)
result = tune.run(
partial(train_cifar, data_dir=data_dir),
resources_per_trial={"cpu": 2, "gpu": gpus_per_trial},
config=config,
num_samples=num_samples,
scheduler=scheduler,
)
best_trial = result.get_best_trial("loss", "min", "last")
print(f"Best trial config: {best_trial.config}")
print(f"Best trial final validation loss: {best_trial.last_result['loss']}")
print(f"Best trial final validation accuracy: {best_trial.last_result['accuracy']}")
best_trained_model = Net(best_trial.config["l1"], best_trial.config["l2"])
device = "cpu"
if torch.cuda.is_available():
device = "cuda:0"
if gpus_per_trial > 1:
best_trained_model = nn.DataParallel(best_trained_model)
best_trained_model.to(device)
best_checkpoint = result.get_best_checkpoint(trial=best_trial, metric="accuracy", mode="max")
with best_checkpoint.as_directory() as checkpoint_dir:
data_path = Path(checkpoint_dir) / "data.pkl"
with open(data_path, "rb") as fp:
best_checkpoint_data = pickle.load(fp)
best_trained_model.load_state_dict(best_checkpoint_data["net_state_dict"])
test_acc = test_accuracy(best_trained_model, device)
print("Best trial test set accuracy: {}".format(test_acc))
if __name__ == "__main__":
# You can change the number of GPUs per trial here:
main(num_samples=10, max_num_epochs=10, gpus_per_trial=0)
0%| | 0.00/170M [00:00<?, ?B/s]
0%| | 426k/170M [00:00<00:42, 4.02MB/s]
1%| | 1.44M/170M [00:00<00:22, 7.53MB/s]
2%|▏ | 2.72M/170M [00:00<00:17, 9.84MB/s]
3%|▎ | 4.33M/170M [00:00<00:13, 12.2MB/s]
4%|▎ | 6.32M/170M [00:00<00:11, 14.8MB/s]
5%|▌ | 8.81M/170M [00:00<00:08, 18.2MB/s]
7%|▋ | 11.9M/170M [00:00<00:07, 22.1MB/s]
9%|▉ | 15.0M/170M [00:00<00:06, 25.0MB/s]
10%|█ | 17.7M/170M [00:00<00:05, 25.6MB/s]
12%|█▏ | 20.3M/170M [00:01<00:05, 25.8MB/s]
14%|█▎ | 23.1M/170M [00:01<00:05, 26.4MB/s]
16%|█▌ | 27.7M/170M [00:01<00:04, 32.2MB/s]
19%|█▉ | 32.1M/170M [00:01<00:03, 35.8MB/s]
21%|██ | 35.7M/170M [00:01<00:03, 35.4MB/s]
23%|██▎ | 39.3M/170M [00:01<00:03, 33.8MB/s]
25%|██▌ | 42.7M/170M [00:01<00:03, 32.5MB/s]
27%|██▋ | 46.0M/170M [00:01<00:03, 31.8MB/s]
29%|██▉ | 49.2M/170M [00:01<00:03, 31.7MB/s]
31%|███ | 52.4M/170M [00:01<00:03, 31.2MB/s]
33%|███▎ | 55.5M/170M [00:02<00:03, 30.8MB/s]
34%|███▍ | 58.6M/170M [00:02<00:03, 30.3MB/s]
36%|███▌ | 61.7M/170M [00:02<00:03, 29.9MB/s]
38%|███▊ | 64.7M/170M [00:02<00:03, 30.0MB/s]
40%|███▉ | 67.8M/170M [00:02<00:03, 29.8MB/s]
42%|████▏ | 70.8M/170M [00:02<00:03, 29.9MB/s]
43%|████▎ | 73.9M/170M [00:02<00:03, 30.2MB/s]
45%|████▌ | 77.2M/170M [00:02<00:03, 30.7MB/s]
47%|████▋ | 80.2M/170M [00:02<00:02, 30.5MB/s]
49%|████▉ | 83.3M/170M [00:02<00:02, 30.3MB/s]
51%|█████ | 86.4M/170M [00:03<00:02, 30.1MB/s]
52%|█████▏ | 89.5M/170M [00:03<00:02, 30.3MB/s]
54%|█████▍ | 92.6M/170M [00:03<00:02, 30.7MB/s]
56%|█████▌ | 95.7M/170M [00:03<00:02, 30.5MB/s]
58%|█████▊ | 98.8M/170M [00:03<00:02, 30.4MB/s]
60%|█████▉ | 102M/170M [00:03<00:02, 31.2MB/s]
63%|██████▎ | 108M/170M [00:03<00:01, 38.6MB/s]
67%|██████▋ | 115M/170M [00:03<00:01, 48.8MB/s]
72%|███████▏ | 123M/170M [00:03<00:00, 56.6MB/s]
76%|███████▋ | 130M/170M [00:04<00:00, 62.4MB/s]
81%|████████ | 138M/170M [00:04<00:00, 66.4MB/s]
85%|████████▌ | 145M/170M [00:04<00:00, 69.1MB/s]
90%|████████▉ | 153M/170M [00:04<00:00, 71.3MB/s]
94%|█████████▍| 160M/170M [00:04<00:00, 70.9MB/s]
98%|█████████▊| 167M/170M [00:04<00:00, 71.5MB/s]
100%|██████████| 170M/170M [00:04<00:00, 37.4MB/s]
2025-08-07 18:22:07,276 WARNING services.py:1889 -- WARNING: The object store is using /tmp instead of /dev/shm because /dev/shm has only 2147467264 bytes available. This will harm performance! You may be able to free up space by deleting files in /dev/shm. If you are inside a Docker container, you can increase /dev/shm size by passing '--shm-size=10.24gb' to 'docker run' (or add it to the run_options list in a Ray cluster config). Make sure to set this to more than 30% of available RAM.
2025-08-07 18:22:07,439 INFO worker.py:1642 -- Started a local Ray instance.
2025-08-07 18:22:08,368 INFO tune.py:228 -- Initializing Ray automatically. For cluster usage or custom Ray initialization, call `ray.init(...)` before `tune.run(...)`.
2025-08-07 18:22:08,370 INFO tune.py:654 -- [output] This will use the new output engine with verbosity 2. To disable the new output and use the legacy output engine, set the environment variable RAY_AIR_NEW_OUTPUT=0. For more information, please see https://github.com/ray-project/ray/issues/36949
╭────────────────────────────────────────────────────────────────────╮
│ Configuration for experiment train_cifar_2025-08-07_18-22-08 │
├────────────────────────────────────────────────────────────────────┤
│ Search algorithm BasicVariantGenerator │
│ Scheduler AsyncHyperBandScheduler │
│ Number of trials 10 │
╰────────────────────────────────────────────────────────────────────╯
View detailed results here: /var/lib/ci-user/ray_results/train_cifar_2025-08-07_18-22-08
To visualize your results with TensorBoard, run: `tensorboard --logdir /var/lib/ci-user/ray_results/train_cifar_2025-08-07_18-22-08`
Trial status: 10 PENDING
Current time: 2025-08-07 18:22:08. Total running time: 0s
Logical resource usage: 16.0/16 CPUs, 0/1 GPUs (0.0/1.0 accelerator_type:A10G)
╭───────────────────────────────────────────────────────────────────────────────╮
│ Trial name status l1 l2 lr batch_size │
├───────────────────────────────────────────────────────────────────────────────┤
│ train_cifar_6da15_00000 PENDING 2 128 0.0509604 16 │
│ train_cifar_6da15_00001 PENDING 1 4 0.000583917 4 │
│ train_cifar_6da15_00002 PENDING 1 128 0.00404185 8 │
│ train_cifar_6da15_00003 PENDING 16 1 0.0300776 2 │
│ train_cifar_6da15_00004 PENDING 1 16 0.047715 16 │
│ train_cifar_6da15_00005 PENDING 1 1 0.0207046 8 │
│ train_cifar_6da15_00006 PENDING 256 256 0.000109715 16 │
│ train_cifar_6da15_00007 PENDING 32 128 0.00200394 16 │
│ train_cifar_6da15_00008 PENDING 2 32 0.00087078 2 │
│ train_cifar_6da15_00009 PENDING 4 64 0.00109616 8 │
╰───────────────────────────────────────────────────────────────────────────────╯
Trial train_cifar_6da15_00006 started with configuration:
╭──────────────────────────────────────────────────╮
│ Trial train_cifar_6da15_00006 config │
├──────────────────────────────────────────────────┤
│ batch_size 16 │
│ l1 256 │
│ l2 256 │
│ lr 0.00011 │
╰──────────────────────────────────────────────────╯
Trial train_cifar_6da15_00002 started with configuration:
╭──────────────────────────────────────────────────╮
│ Trial train_cifar_6da15_00002 config │
├──────────────────────────────────────────────────┤
│ batch_size 8 │
│ l1 1 │
│ l2 128 │
│ lr 0.00404 │
╰──────────────────────────────────────────────────╯
Trial train_cifar_6da15_00001 started with configuration:
╭──────────────────────────────────────────────────╮
│ Trial train_cifar_6da15_00001 config │
├──────────────────────────────────────────────────┤
│ batch_size 4 │
│ l1 1 │
│ l2 4 │
│ lr 0.00058 │
╰──────────────────────────────────────────────────╯
Trial train_cifar_6da15_00003 started with configuration:
╭──────────────────────────────────────────────────╮
│ Trial train_cifar_6da15_00003 config │
├──────────────────────────────────────────────────┤
│ batch_size 2 │
│ l1 16 │
│ l2 1 │
│ lr 0.03008 │
╰──────────────────────────────────────────────────╯
Trial train_cifar_6da15_00004 started with configuration:
╭──────────────────────────────────────────────────╮
│ Trial train_cifar_6da15_00004 config │
├──────────────────────────────────────────────────┤
│ batch_size 16 │
│ l1 1 │
│ l2 16 │
│ lr 0.04771 │
╰──────────────────────────────────────────────────╯
Trial train_cifar_6da15_00000 started with configuration:
╭──────────────────────────────────────────────────╮
│ Trial train_cifar_6da15_00000 config │
├──────────────────────────────────────────────────┤
│ batch_size 16 │
│ l1 2 │
│ l2 128 │
│ lr 0.05096 │
╰──────────────────────────────────────────────────╯
Trial train_cifar_6da15_00007 started with configuration:
╭────────────────────────────────────────────────╮
│ Trial train_cifar_6da15_00007 config │
├────────────────────────────────────────────────┤
│ batch_size 16 │
│ l1 32 │
│ l2 128 │
│ lr 0.002 │
╰────────────────────────────────────────────────╯
Trial train_cifar_6da15_00005 started with configuration:
╭─────────────────────────────────────────────────╮
│ Trial train_cifar_6da15_00005 config │
├─────────────────────────────────────────────────┤
│ batch_size 8 │
│ l1 1 │
│ l2 1 │
│ lr 0.0207 │
╰─────────────────────────────────────────────────╯
(func pid=3879) [1, 2000] loss: 2.337
(func pid=3882) [1, 2000] loss: 2.301 [repeated 7x across cluster] (Ray deduplicates logs by default. Set RAY_DEDUP_LOGS=0 to disable log deduplication, or see https://docs.rayai.org.cn/en/master/ray-observability/ray-logging.html#log-deduplication for more options.)
Trial status: 8 RUNNING | 2 PENDING
Current time: 2025-08-07 18:22:38. Total running time: 30s
Logical resource usage: 16.0/16 CPUs, 0/1 GPUs (0.0/1.0 accelerator_type:A10G)
╭───────────────────────────────────────────────────────────────────────────────╮
│ Trial name status l1 l2 lr batch_size │
├───────────────────────────────────────────────────────────────────────────────┤
│ train_cifar_6da15_00000 RUNNING 2 128 0.0509604 16 │
│ train_cifar_6da15_00001 RUNNING 1 4 0.000583917 4 │
│ train_cifar_6da15_00002 RUNNING 1 128 0.00404185 8 │
│ train_cifar_6da15_00003 RUNNING 16 1 0.0300776 2 │
│ train_cifar_6da15_00004 RUNNING 1 16 0.047715 16 │
│ train_cifar_6da15_00005 RUNNING 1 1 0.0207046 8 │
│ train_cifar_6da15_00006 RUNNING 256 256 0.000109715 16 │
│ train_cifar_6da15_00007 RUNNING 32 128 0.00200394 16 │
│ train_cifar_6da15_00008 PENDING 2 32 0.00087078 2 │
│ train_cifar_6da15_00009 PENDING 4 64 0.00109616 8 │
╰───────────────────────────────────────────────────────────────────────────────╯
Trial train_cifar_6da15_00007 finished iteration 1 at 2025-08-07 18:22:40. Total running time: 31s
╭────────────────────────────────────────────────────────────╮
│ Trial train_cifar_6da15_00007 result │
├────────────────────────────────────────────────────────────┤
│ checkpoint_dir_name checkpoint_000000 │
│ time_this_iter_s 27.01455 │
│ time_total_s 27.01455 │
│ training_iteration 1 │
│ accuracy 0.4034 │
│ loss 1.64608 │
╰────────────────────────────────────────────────────────────╯
Trial train_cifar_6da15_00007 saved a checkpoint for iteration 1 at: (local)/var/lib/ci-user/ray_results/train_cifar_2025-08-07_18-22-08/train_cifar_6da15_00007_7_batch_size=16,l1=32,l2=128,lr=0.0020_2025-08-07_18-22-08/checkpoint_000000
(func pid=3883) Checkpoint successfully created at: Checkpoint(filesystem=local, path=/var/lib/ci-user/ray_results/train_cifar_2025-08-07_18-22-08/train_cifar_6da15_00007_7_batch_size=16,l1=32,l2=128,lr=0.0020_2025-08-07_18-22-08/checkpoint_000000)
(func pid=3879) [1, 4000] loss: 1.169
Trial train_cifar_6da15_00000 finished iteration 1 at 2025-08-07 18:22:40. Total running time: 32s
╭────────────────────────────────────────────────────────────╮
│ Trial train_cifar_6da15_00000 result │
├────────────────────────────────────────────────────────────┤
│ checkpoint_dir_name checkpoint_000000 │
│ time_this_iter_s 27.47424 │
│ time_total_s 27.47424 │
│ training_iteration 1 │
│ accuracy 0.0995 │
│ loss 2.31793 │
╰────────────────────────────────────────────────────────────╯
Trial train_cifar_6da15_00000 saved a checkpoint for iteration 1 at: (local)/var/lib/ci-user/ray_results/train_cifar_2025-08-07_18-22-08/train_cifar_6da15_00000_0_batch_size=16,l1=2,l2=128,lr=0.0510_2025-08-07_18-22-08/checkpoint_000000
Trial train_cifar_6da15_00000 completed after 1 iterations at 2025-08-07 18:22:40. Total running time: 32s
Trial train_cifar_6da15_00008 started with configuration:
╭──────────────────────────────────────────────────╮
│ Trial train_cifar_6da15_00008 config │
├──────────────────────────────────────────────────┤
│ batch_size 2 │
│ l1 2 │
│ l2 32 │
│ lr 0.00087 │
╰──────────────────────────────────────────────────╯
(func pid=3878) [1, 4000] loss: 1.111
Trial train_cifar_6da15_00004 finished iteration 1 at 2025-08-07 18:22:41. Total running time: 33s
╭────────────────────────────────────────────────────────────╮
│ Trial train_cifar_6da15_00004 result │
├────────────────────────────────────────────────────────────┤
│ checkpoint_dir_name checkpoint_000000 │
│ time_this_iter_s 28.52594 │
│ time_total_s 28.52594 │
│ training_iteration 1 │
│ accuracy 0.1035 │
│ loss 2.30867 │
╰────────────────────────────────────────────────────────────╯
Trial train_cifar_6da15_00004 saved a checkpoint for iteration 1 at: (local)/var/lib/ci-user/ray_results/train_cifar_2025-08-07_18-22-08/train_cifar_6da15_00004_4_batch_size=16,l1=1,l2=16,lr=0.0477_2025-08-07_18-22-08/checkpoint_000000
Trial train_cifar_6da15_00004 completed after 1 iterations at 2025-08-07 18:22:41. Total running time: 33s
Trial train_cifar_6da15_00009 started with configuration:
╭─────────────────────────────────────────────────╮
│ Trial train_cifar_6da15_00009 config │
├─────────────────────────────────────────────────┤
│ batch_size 8 │
│ l1 4 │
│ l2 64 │
│ lr 0.0011 │
╰─────────────────────────────────────────────────╯
Trial train_cifar_6da15_00006 finished iteration 1 at 2025-08-07 18:22:42. Total running time: 33s
╭────────────────────────────────────────────────────────────╮
│ Trial train_cifar_6da15_00006 result │
├────────────────────────────────────────────────────────────┤
│ checkpoint_dir_name checkpoint_000000 │
│ time_this_iter_s 29.37421 │
│ time_total_s 29.37421 │
│ training_iteration 1 │
│ accuracy 0.1145 │
│ loss 2.29772 │
╰────────────────────────────────────────────────────────────╯
Trial train_cifar_6da15_00006 saved a checkpoint for iteration 1 at: (local)/var/lib/ci-user/ray_results/train_cifar_2025-08-07_18-22-08/train_cifar_6da15_00006_6_batch_size=16,l1=256,l2=256,lr=0.0001_2025-08-07_18-22-08/checkpoint_000000
(func pid=3879) [1, 6000] loss: 0.781 [repeated 3x across cluster]
Trial train_cifar_6da15_00005 finished iteration 1 at 2025-08-07 18:22:56. Total running time: 48s
╭────────────────────────────────────────────────────────────╮
│ Trial train_cifar_6da15_00005 result │
├────────────────────────────────────────────────────────────┤
│ checkpoint_dir_name checkpoint_000000 │
│ time_this_iter_s 43.20412 │
│ time_total_s 43.20412 │
│ training_iteration 1 │
│ accuracy 0.1022 │
│ loss 2.31003 │
╰────────────────────────────────────────────────────────────╯
Trial train_cifar_6da15_00005 saved a checkpoint for iteration 1 at: (local)/var/lib/ci-user/ray_results/train_cifar_2025-08-07_18-22-08/train_cifar_6da15_00005_5_batch_size=8,l1=1,l2=1,lr=0.0207_2025-08-07_18-22-08/checkpoint_000000
Trial train_cifar_6da15_00005 completed after 1 iterations at 2025-08-07 18:22:56. Total running time: 48s
(func pid=3881) Checkpoint successfully created at: Checkpoint(filesystem=local, path=/var/lib/ci-user/ray_results/train_cifar_2025-08-07_18-22-08/train_cifar_6da15_00005_5_batch_size=8,l1=1,l2=1,lr=0.0207_2025-08-07_18-22-08/checkpoint_000000) [repeated 4x across cluster]
(func pid=3880) [1, 2000] loss: 2.191 [repeated 4x across cluster]
Trial train_cifar_6da15_00002 finished iteration 1 at 2025-08-07 18:22:57. Total running time: 48s
╭────────────────────────────────────────────────────────────╮
│ Trial train_cifar_6da15_00002 result │
├────────────────────────────────────────────────────────────┤
│ checkpoint_dir_name checkpoint_000000 │
│ time_this_iter_s 44.1901 │
│ time_total_s 44.1901 │
│ training_iteration 1 │
│ accuracy 0.1008 │
│ loss 2.31397 │
╰────────────────────────────────────────────────────────────╯
Trial train_cifar_6da15_00002 saved a checkpoint for iteration 1 at: (local)/var/lib/ci-user/ray_results/train_cifar_2025-08-07_18-22-08/train_cifar_6da15_00002_2_batch_size=8,l1=1,l2=128,lr=0.0040_2025-08-07_18-22-08/checkpoint_000000
Trial train_cifar_6da15_00002 completed after 1 iterations at 2025-08-07 18:22:57. Total running time: 48s
Trial train_cifar_6da15_00007 finished iteration 2 at 2025-08-07 18:23:01. Total running time: 52s
╭────────────────────────────────────────────────────────────╮
│ Trial train_cifar_6da15_00007 result │
├────────────────────────────────────────────────────────────┤
│ checkpoint_dir_name checkpoint_000001 │
│ time_this_iter_s 21.00816 │
│ time_total_s 48.02271 │
│ training_iteration 2 │
│ accuracy 0.502 │
│ loss 1.36508 │
╰────────────────────────────────────────────────────────────╯
Trial train_cifar_6da15_00007 saved a checkpoint for iteration 2 at: (local)/var/lib/ci-user/ray_results/train_cifar_2025-08-07_18-22-08/train_cifar_6da15_00007_7_batch_size=16,l1=32,l2=128,lr=0.0020_2025-08-07_18-22-08/checkpoint_000001
(func pid=3876) [1, 4000] loss: 1.153 [repeated 4x across cluster]
Trial train_cifar_6da15_00006 finished iteration 2 at 2025-08-07 18:23:05. Total running time: 56s
╭────────────────────────────────────────────────────────────╮
│ Trial train_cifar_6da15_00006 result │
├────────────────────────────────────────────────────────────┤
│ checkpoint_dir_name checkpoint_000001 │
│ time_this_iter_s 22.63908 │
│ time_total_s 52.0133 │
│ training_iteration 2 │
│ accuracy 0.1685 │
│ loss 2.28008 │
╰────────────────────────────────────────────────────────────╯
Trial train_cifar_6da15_00006 saved a checkpoint for iteration 2 at: (local)/var/lib/ci-user/ray_results/train_cifar_2025-08-07_18-22-08/train_cifar_6da15_00006_6_batch_size=16,l1=256,l2=256,lr=0.0001_2025-08-07_18-22-08/checkpoint_000001
Trial train_cifar_6da15_00006 completed after 2 iterations at 2025-08-07 18:23:05. Total running time: 56s
(func pid=3882) Checkpoint successfully created at: Checkpoint(filesystem=local, path=/var/lib/ci-user/ray_results/train_cifar_2025-08-07_18-22-08/train_cifar_6da15_00006_6_batch_size=16,l1=256,l2=256,lr=0.0001_2025-08-07_18-22-08/checkpoint_000001) [repeated 3x across cluster]
Trial status: 5 TERMINATED | 5 RUNNING
Current time: 2025-08-07 18:23:08. Total running time: 1min 0s
Logical resource usage: 10.0/16 CPUs, 0/1 GPUs (0.0/1.0 accelerator_type:A10G)
╭────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ Trial name status l1 l2 lr batch_size iter total time (s) loss accuracy │
├────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ train_cifar_6da15_00001 RUNNING 1 4 0.000583917 4 │
│ train_cifar_6da15_00003 RUNNING 16 1 0.0300776 2 │
│ train_cifar_6da15_00007 RUNNING 32 128 0.00200394 16 2 48.0227 1.36508 0.502 │
│ train_cifar_6da15_00008 RUNNING 2 32 0.00087078 2 │
│ train_cifar_6da15_00009 RUNNING 4 64 0.00109616 8 │
│ train_cifar_6da15_00000 TERMINATED 2 128 0.0509604 16 1 27.4742 2.31793 0.0995 │
│ train_cifar_6da15_00002 TERMINATED 1 128 0.00404185 8 1 44.1901 2.31397 0.1008 │
│ train_cifar_6da15_00004 TERMINATED 1 16 0.047715 16 1 28.5259 2.30867 0.1035 │
│ train_cifar_6da15_00005 TERMINATED 1 1 0.0207046 8 1 43.2041 2.31003 0.1022 │
│ train_cifar_6da15_00006 TERMINATED 256 256 0.000109715 16 2 52.0133 2.28008 0.1685 │
╰────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
(func pid=3878) [1, 10000] loss: 0.392 [repeated 3x across cluster]
Trial train_cifar_6da15_00009 finished iteration 1 at 2025-08-07 18:23:14. Total running time: 1min 6s
╭────────────────────────────────────────────────────────────╮
│ Trial train_cifar_6da15_00009 result │
├────────────────────────────────────────────────────────────┤
│ checkpoint_dir_name checkpoint_000000 │
│ time_this_iter_s 33.09376 │
│ time_total_s 33.09376 │
│ training_iteration 1 │
│ accuracy 0.3654 │
│ loss 1.69085 │
╰────────────────────────────────────────────────────────────╯
Trial train_cifar_6da15_00009 saved a checkpoint for iteration 1 at: (local)/var/lib/ci-user/ray_results/train_cifar_2025-08-07_18-22-08/train_cifar_6da15_00009_9_batch_size=8,l1=4,l2=64,lr=0.0011_2025-08-07_18-22-08/checkpoint_000000
(func pid=3880) Checkpoint successfully created at: Checkpoint(filesystem=local, path=/var/lib/ci-user/ray_results/train_cifar_2025-08-07_18-22-08/train_cifar_6da15_00009_9_batch_size=8,l1=4,l2=64,lr=0.0011_2025-08-07_18-22-08/checkpoint_000000)
(func pid=3879) [1, 12000] loss: 0.389 [repeated 3x across cluster]
Trial train_cifar_6da15_00001 finished iteration 1 at 2025-08-07 18:23:17. Total running time: 1min 8s
╭────────────────────────────────────────────────────────────╮
│ Trial train_cifar_6da15_00001 result │
├────────────────────────────────────────────────────────────┤
│ checkpoint_dir_name checkpoint_000000 │
│ time_this_iter_s 63.91659 │
│ time_total_s 63.91659 │
│ training_iteration 1 │
│ accuracy 0.1995 │
│ loss 1.92404 │
╰────────────────────────────────────────────────────────────╯
Trial train_cifar_6da15_00001 saved a checkpoint for iteration 1 at: (local)/var/lib/ci-user/ray_results/train_cifar_2025-08-07_18-22-08/train_cifar_6da15_00001_1_batch_size=4,l1=1,l2=4,lr=0.0006_2025-08-07_18-22-08/checkpoint_000000
(func pid=3878) Checkpoint successfully created at: Checkpoint(filesystem=local, path=/var/lib/ci-user/ray_results/train_cifar_2025-08-07_18-22-08/train_cifar_6da15_00001_1_batch_size=4,l1=1,l2=4,lr=0.0006_2025-08-07_18-22-08/checkpoint_000000)
Trial train_cifar_6da15_00007 finished iteration 3 at 2025-08-07 18:23:17. Total running time: 1min 9s
╭────────────────────────────────────────────────────────────╮
│ Trial train_cifar_6da15_00007 result │
├────────────────────────────────────────────────────────────┤
│ checkpoint_dir_name checkpoint_000002 │
│ time_this_iter_s 16.28883 │
│ time_total_s 64.31153 │
│ training_iteration 3 │
│ accuracy 0.533 │
│ loss 1.30026 │
╰────────────────────────────────────────────────────────────╯
Trial train_cifar_6da15_00007 saved a checkpoint for iteration 3 at: (local)/var/lib/ci-user/ray_results/train_cifar_2025-08-07_18-22-08/train_cifar_6da15_00007_7_batch_size=16,l1=32,l2=128,lr=0.0020_2025-08-07_18-22-08/checkpoint_000002
(func pid=3879) [1, 14000] loss: 0.334 [repeated 2x across cluster]
(func pid=3883) [4, 2000] loss: 1.245 [repeated 4x across cluster]
Trial train_cifar_6da15_00007 finished iteration 4 at 2025-08-07 18:23:33. Total running time: 1min 25s
╭────────────────────────────────────────────────────────────╮
│ Trial train_cifar_6da15_00007 result │
├────────────────────────────────────────────────────────────┤
│ checkpoint_dir_name checkpoint_000003 │
│ time_this_iter_s 15.78644 │
│ time_total_s 80.09798 │
│ training_iteration 4 │
│ accuracy 0.5692 │
│ loss 1.21548 │
╰────────────────────────────────────────────────────────────╯
Trial train_cifar_6da15_00007 saved a checkpoint for iteration 4 at: (local)/var/lib/ci-user/ray_results/train_cifar_2025-08-07_18-22-08/train_cifar_6da15_00007_7_batch_size=16,l1=32,l2=128,lr=0.0020_2025-08-07_18-22-08/checkpoint_000003
(func pid=3883) Checkpoint successfully created at: Checkpoint(filesystem=local, path=/var/lib/ci-user/ray_results/train_cifar_2025-08-07_18-22-08/train_cifar_6da15_00007_7_batch_size=16,l1=32,l2=128,lr=0.0020_2025-08-07_18-22-08/checkpoint_000003) [repeated 2x across cluster]
(func pid=3876) [1, 12000] loss: 0.384 [repeated 4x across cluster]
Trial status: 5 TERMINATED | 5 RUNNING
Current time: 2025-08-07 18:23:38. Total running time: 1min 30s
Logical resource usage: 10.0/16 CPUs, 0/1 GPUs (0.0/1.0 accelerator_type:A10G)
╭────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ Trial name status l1 l2 lr batch_size iter total time (s) loss accuracy │
├────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ train_cifar_6da15_00001 RUNNING 1 4 0.000583917 4 1 63.9166 1.92404 0.1995 │
│ train_cifar_6da15_00003 RUNNING 16 1 0.0300776 2 │
│ train_cifar_6da15_00007 RUNNING 32 128 0.00200394 16 4 80.098 1.21548 0.5692 │
│ train_cifar_6da15_00008 RUNNING 2 32 0.00087078 2 │
│ train_cifar_6da15_00009 RUNNING 4 64 0.00109616 8 1 33.0938 1.69085 0.3654 │
│ train_cifar_6da15_00000 TERMINATED 2 128 0.0509604 16 1 27.4742 2.31793 0.0995 │
│ train_cifar_6da15_00002 TERMINATED 1 128 0.00404185 8 1 44.1901 2.31397 0.1008 │
│ train_cifar_6da15_00004 TERMINATED 1 16 0.047715 16 1 28.5259 2.30867 0.1035 │
│ train_cifar_6da15_00005 TERMINATED 1 1 0.0207046 8 1 43.2041 2.31003 0.1022 │
│ train_cifar_6da15_00006 TERMINATED 256 256 0.000109715 16 2 52.0133 2.28008 0.1685 │
╰────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
(func pid=3878) [2, 6000] loss: 0.635 [repeated 2x across cluster]
Trial train_cifar_6da15_00009 finished iteration 2 at 2025-08-07 18:23:41. Total running time: 1min 33s
╭────────────────────────────────────────────────────────────╮
│ Trial train_cifar_6da15_00009 result │
├────────────────────────────────────────────────────────────┤
│ checkpoint_dir_name checkpoint_000001 │
│ time_this_iter_s 26.73551 │
│ time_total_s 59.82926 │
│ training_iteration 2 │
│ accuracy 0.4238 │
│ loss 1.50893 │
╰────────────────────────────────────────────────────────────╯
Trial train_cifar_6da15_00009 saved a checkpoint for iteration 2 at: (local)/var/lib/ci-user/ray_results/train_cifar_2025-08-07_18-22-08/train_cifar_6da15_00009_9_batch_size=8,l1=4,l2=64,lr=0.0011_2025-08-07_18-22-08/checkpoint_000001
(func pid=3880) Checkpoint successfully created at: Checkpoint(filesystem=local, path=/var/lib/ci-user/ray_results/train_cifar_2025-08-07_18-22-08/train_cifar_6da15_00009_9_batch_size=8,l1=4,l2=64,lr=0.0011_2025-08-07_18-22-08/checkpoint_000001)
Trial train_cifar_6da15_00007 finished iteration 5 at 2025-08-07 18:23:49. Total running time: 1min 40s
╭────────────────────────────────────────────────────────────╮
│ Trial train_cifar_6da15_00007 result │
├────────────────────────────────────────────────────────────┤
│ checkpoint_dir_name checkpoint_000004 │
│ time_this_iter_s 15.66899 │
│ time_total_s 95.76696 │
│ training_iteration 5 │
│ accuracy 0.586 │
│ loss 1.17639 │
╰────────────────────────────────────────────────────────────╯
(func pid=3883) Checkpoint successfully created at: Checkpoint(filesystem=local, path=/var/lib/ci-user/ray_results/train_cifar_2025-08-07_18-22-08/train_cifar_6da15_00007_7_batch_size=16,l1=32,l2=128,lr=0.0020_2025-08-07_18-22-08/checkpoint_000004)
Trial train_cifar_6da15_00007 saved a checkpoint for iteration 5 at: (local)/var/lib/ci-user/ray_results/train_cifar_2025-08-07_18-22-08/train_cifar_6da15_00007_7_batch_size=16,l1=32,l2=128,lr=0.0020_2025-08-07_18-22-08/checkpoint_000004
(func pid=3878) [2, 8000] loss: 0.473 [repeated 4x across cluster]
(func pid=3878) [2, 10000] loss: 0.375 [repeated 3x across cluster]
Trial train_cifar_6da15_00003 finished iteration 1 at 2025-08-07 18:23:58. Total running time: 1min 50s
╭────────────────────────────────────────────────────────────╮
│ Trial train_cifar_6da15_00003 result │
├────────────────────────────────────────────────────────────┤
│ checkpoint_dir_name checkpoint_000000 │
│ time_this_iter_s 105.78079 │
│ time_total_s 105.78079 │
│ training_iteration 1 │
│ accuracy 0.1024 │
│ loss 2.33666 │
╰────────────────────────────────────────────────────────────╯
Trial train_cifar_6da15_00003 saved a checkpoint for iteration 1 at: (local)/var/lib/ci-user/ray_results/train_cifar_2025-08-07_18-22-08/train_cifar_6da15_00003_3_batch_size=2,l1=16,l2=1,lr=0.0301_2025-08-07_18-22-08/checkpoint_000000
Trial train_cifar_6da15_00003 completed after 1 iterations at 2025-08-07 18:23:58. Total running time: 1min 50s
(func pid=3879) Checkpoint successfully created at: Checkpoint(filesystem=local, path=/var/lib/ci-user/ray_results/train_cifar_2025-08-07_18-22-08/train_cifar_6da15_00003_3_batch_size=2,l1=16,l2=1,lr=0.0301_2025-08-07_18-22-08/checkpoint_000000)
Trial train_cifar_6da15_00001 finished iteration 2 at 2025-08-07 18:24:03. Total running time: 1min 54s
╭────────────────────────────────────────────────────────────╮
│ Trial train_cifar_6da15_00001 result │
├────────────────────────────────────────────────────────────┤
│ checkpoint_dir_name checkpoint_000001 │
│ time_this_iter_s 46.35663 │
│ time_total_s 110.27322 │
│ training_iteration 2 │
│ accuracy 0.2069 │
│ loss 1.91341 │
╰────────────────────────────────────────────────────────────╯
(func pid=3878) Checkpoint successfully created at: Checkpoint(filesystem=local, path=/var/lib/ci-user/ray_results/train_cifar_2025-08-07_18-22-08/train_cifar_6da15_00001_1_batch_size=4,l1=1,l2=4,lr=0.0006_2025-08-07_18-22-08/checkpoint_000001)
Trial train_cifar_6da15_00001 saved a checkpoint for iteration 2 at: (local)/var/lib/ci-user/ray_results/train_cifar_2025-08-07_18-22-08/train_cifar_6da15_00001_1_batch_size=4,l1=1,l2=4,lr=0.0006_2025-08-07_18-22-08/checkpoint_000001
Trial train_cifar_6da15_00001 completed after 2 iterations at 2025-08-07 18:24:03. Total running time: 1min 54s
Trial train_cifar_6da15_00007 finished iteration 6 at 2025-08-07 18:24:04. Total running time: 1min 56s
╭────────────────────────────────────────────────────────────╮
│ Trial train_cifar_6da15_00007 result │
├────────────────────────────────────────────────────────────┤
│ checkpoint_dir_name checkpoint_000005 │
│ time_this_iter_s 15.49965 │
│ time_total_s 111.26661 │
│ training_iteration 6 │
│ accuracy 0.6106 │
│ loss 1.11135 │
╰────────────────────────────────────────────────────────────╯
Trial train_cifar_6da15_00007 saved a checkpoint for iteration 6 at: (local)/var/lib/ci-user/ray_results/train_cifar_2025-08-07_18-22-08/train_cifar_6da15_00007_7_batch_size=16,l1=32,l2=128,lr=0.0020_2025-08-07_18-22-08/checkpoint_000005
(func pid=3876) [1, 20000] loss: 0.202 [repeated 4x across cluster]
Trial train_cifar_6da15_00009 finished iteration 3 at 2025-08-07 18:24:07. Total running time: 1min 58s
╭────────────────────────────────────────────────────────────╮
│ Trial train_cifar_6da15_00009 result │
├────────────────────────────────────────────────────────────┤
│ checkpoint_dir_name checkpoint_000002 │
│ time_this_iter_s 25.67368 │
│ time_total_s 85.50294 │
│ training_iteration 3 │
│ accuracy 0.4735 │
│ loss 1.41079 │
╰────────────────────────────────────────────────────────────╯
Trial train_cifar_6da15_00009 saved a checkpoint for iteration 3 at: (local)/var/lib/ci-user/ray_results/train_cifar_2025-08-07_18-22-08/train_cifar_6da15_00009_9_batch_size=8,l1=4,l2=64,lr=0.0011_2025-08-07_18-22-08/checkpoint_000002
Trial status: 7 TERMINATED | 3 RUNNING
Current time: 2025-08-07 18:24:08. Total running time: 2min 0s
Logical resource usage: 6.0/16 CPUs, 0/1 GPUs (0.0/1.0 accelerator_type:A10G)
╭────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ Trial name status l1 l2 lr batch_size iter total time (s) loss accuracy │
├────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ train_cifar_6da15_00007 RUNNING 32 128 0.00200394 16 6 111.267 1.11135 0.6106 │
│ train_cifar_6da15_00008 RUNNING 2 32 0.00087078 2 │
│ train_cifar_6da15_00009 RUNNING 4 64 0.00109616 8 3 85.5029 1.41079 0.4735 │
│ train_cifar_6da15_00000 TERMINATED 2 128 0.0509604 16 1 27.4742 2.31793 0.0995 │
│ train_cifar_6da15_00001 TERMINATED 1 4 0.000583917 4 2 110.273 1.91341 0.2069 │
│ train_cifar_6da15_00002 TERMINATED 1 128 0.00404185 8 1 44.1901 2.31397 0.1008 │
│ train_cifar_6da15_00003 TERMINATED 16 1 0.0300776 2 1 105.781 2.33666 0.1024 │
│ train_cifar_6da15_00004 TERMINATED 1 16 0.047715 16 1 28.5259 2.30867 0.1035 │
│ train_cifar_6da15_00005 TERMINATED 1 1 0.0207046 8 1 43.2041 2.31003 0.1022 │
│ train_cifar_6da15_00006 TERMINATED 256 256 0.000109715 16 2 52.0133 2.28008 0.1685 │
╰────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
(func pid=3883) [7, 2000] loss: 1.060
Trial train_cifar_6da15_00008 finished iteration 1 at 2025-08-07 18:24:14. Total running time: 2min 6s
╭────────────────────────────────────────────────────────────╮
│ Trial train_cifar_6da15_00008 result │
├────────────────────────────────────────────────────────────┤
│ checkpoint_dir_name checkpoint_000000 │
│ time_this_iter_s 93.84313 │
│ time_total_s 93.84313 │
│ training_iteration 1 │
│ accuracy 0.1973 │
│ loss 1.96308 │
╰────────────────────────────────────────────────────────────╯
Trial train_cifar_6da15_00008 saved a checkpoint for iteration 1 at: (local)/var/lib/ci-user/ray_results/train_cifar_2025-08-07_18-22-08/train_cifar_6da15_00008_8_batch_size=2,l1=2,l2=32,lr=0.0009_2025-08-07_18-22-08/checkpoint_000000
(func pid=3876) Checkpoint successfully created at: Checkpoint(filesystem=local, path=/var/lib/ci-user/ray_results/train_cifar_2025-08-07_18-22-08/train_cifar_6da15_00008_8_batch_size=2,l1=2,l2=32,lr=0.0009_2025-08-07_18-22-08/checkpoint_000000) [repeated 3x across cluster]
(func pid=3880) [4, 2000] loss: 1.375
Trial train_cifar_6da15_00007 finished iteration 7 at 2025-08-07 18:24:18. Total running time: 2min 10s
╭────────────────────────────────────────────────────────────╮
│ Trial train_cifar_6da15_00007 result │
├────────────────────────────────────────────────────────────┤
│ checkpoint_dir_name checkpoint_000006 │
│ time_this_iter_s 14.08048 │
│ time_total_s 125.3471 │
│ training_iteration 7 │
│ accuracy 0.6064 │
│ loss 1.12497 │
╰────────────────────────────────────────────────────────────╯
Trial train_cifar_6da15_00007 saved a checkpoint for iteration 7 at: (local)/var/lib/ci-user/ray_results/train_cifar_2025-08-07_18-22-08/train_cifar_6da15_00007_7_batch_size=16,l1=32,l2=128,lr=0.0020_2025-08-07_18-22-08/checkpoint_000006
(func pid=3876) [2, 2000] loss: 1.970
(func pid=3880) [4, 4000] loss: 0.692
(func pid=3883) [8, 2000] loss: 1.031 [repeated 2x across cluster]
Trial train_cifar_6da15_00009 finished iteration 4 at 2025-08-07 18:24:29. Total running time: 2min 21s
╭────────────────────────────────────────────────────────────╮
│ Trial train_cifar_6da15_00009 result │
├────────────────────────────────────────────────────────────┤
│ checkpoint_dir_name checkpoint_000003 │
│ time_this_iter_s 22.29746 │
│ time_total_s 107.8004 │
│ training_iteration 4 │
│ accuracy 0.5039 │
│ loss 1.33079 │
╰────────────────────────────────────────────────────────────╯
Trial train_cifar_6da15_00009 saved a checkpoint for iteration 4 at: (local)/var/lib/ci-user/ray_results/train_cifar_2025-08-07_18-22-08/train_cifar_6da15_00009_9_batch_size=8,l1=4,l2=64,lr=0.0011_2025-08-07_18-22-08/checkpoint_000003
Trial train_cifar_6da15_00009 completed after 4 iterations at 2025-08-07 18:24:29. Total running time: 2min 21s
(func pid=3880) Checkpoint successfully created at: Checkpoint(filesystem=local, path=/var/lib/ci-user/ray_results/train_cifar_2025-08-07_18-22-08/train_cifar_6da15_00009_9_batch_size=8,l1=4,l2=64,lr=0.0011_2025-08-07_18-22-08/checkpoint_000003) [repeated 2x across cluster]
Trial train_cifar_6da15_00007 finished iteration 8 at 2025-08-07 18:24:32. Total running time: 2min 24s
╭────────────────────────────────────────────────────────────╮
│ Trial train_cifar_6da15_00007 result │
├────────────────────────────────────────────────────────────┤
│ checkpoint_dir_name checkpoint_000007 │
│ time_this_iter_s 13.66092 │
│ time_total_s 139.00802 │
│ training_iteration 8 │
│ accuracy 0.6076 │
│ loss 1.11366 │
╰────────────────────────────────────────────────────────────╯
Trial train_cifar_6da15_00007 saved a checkpoint for iteration 8 at: (local)/var/lib/ci-user/ray_results/train_cifar_2025-08-07_18-22-08/train_cifar_6da15_00007_7_batch_size=16,l1=32,l2=128,lr=0.0020_2025-08-07_18-22-08/checkpoint_000007
(func pid=3876) [2, 8000] loss: 0.461 [repeated 2x across cluster]
Trial status: 8 TERMINATED | 2 RUNNING
Current time: 2025-08-07 18:24:38. Total running time: 2min 30s
Logical resource usage: 4.0/16 CPUs, 0/1 GPUs (0.0/1.0 accelerator_type:A10G)
╭────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ Trial name status l1 l2 lr batch_size iter total time (s) loss accuracy │
├────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ train_cifar_6da15_00007 RUNNING 32 128 0.00200394 16 8 139.008 1.11366 0.6076 │
│ train_cifar_6da15_00008 RUNNING 2 32 0.00087078 2 1 93.8431 1.96308 0.1973 │
│ train_cifar_6da15_00000 TERMINATED 2 128 0.0509604 16 1 27.4742 2.31793 0.0995 │
│ train_cifar_6da15_00001 TERMINATED 1 4 0.000583917 4 2 110.273 1.91341 0.2069 │
│ train_cifar_6da15_00002 TERMINATED 1 128 0.00404185 8 1 44.1901 2.31397 0.1008 │
│ train_cifar_6da15_00003 TERMINATED 16 1 0.0300776 2 1 105.781 2.33666 0.1024 │
│ train_cifar_6da15_00004 TERMINATED 1 16 0.047715 16 1 28.5259 2.30867 0.1035 │
│ train_cifar_6da15_00005 TERMINATED 1 1 0.0207046 8 1 43.2041 2.31003 0.1022 │
│ train_cifar_6da15_00006 TERMINATED 256 256 0.000109715 16 2 52.0133 2.28008 0.1685 │
│ train_cifar_6da15_00009 TERMINATED 4 64 0.00109616 8 4 107.8 1.33079 0.5039 │
╰────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
(func pid=3876) [2, 10000] loss: 0.358 [repeated 2x across cluster]
Trial train_cifar_6da15_00007 finished iteration 9 at 2025-08-07 18:24:44. Total running time: 2min 35s
╭────────────────────────────────────────────────────────────╮
│ Trial train_cifar_6da15_00007 result │
├────────────────────────────────────────────────────────────┤
│ checkpoint_dir_name checkpoint_000008 │
│ time_this_iter_s 11.80722 │
│ time_total_s 150.81524 │
│ training_iteration 9 │
│ accuracy 0.5799 │
│ loss 1.1971 │
╰────────────────────────────────────────────────────────────╯
Trial train_cifar_6da15_00007 saved a checkpoint for iteration 9 at: (local)/var/lib/ci-user/ray_results/train_cifar_2025-08-07_18-22-08/train_cifar_6da15_00007_7_batch_size=16,l1=32,l2=128,lr=0.0020_2025-08-07_18-22-08/checkpoint_000008
(func pid=3883) Checkpoint successfully created at: Checkpoint(filesystem=local, path=/var/lib/ci-user/ray_results/train_cifar_2025-08-07_18-22-08/train_cifar_6da15_00007_7_batch_size=16,l1=32,l2=128,lr=0.0020_2025-08-07_18-22-08/checkpoint_000008) [repeated 2x across cluster]
(func pid=3876) [2, 12000] loss: 0.291
(func pid=3883) [10, 2000] loss: 0.974
Trial train_cifar_6da15_00007 finished iteration 10 at 2025-08-07 18:24:55. Total running time: 2min 47s
╭────────────────────────────────────────────────────────────╮
│ Trial train_cifar_6da15_00007 result │
├────────────────────────────────────────────────────────────┤
│ checkpoint_dir_name checkpoint_000009 │
│ time_this_iter_s 11.67832 │
│ time_total_s 162.49356 │
│ training_iteration 10 │
│ accuracy 0.6162 │
│ loss 1.0932 │
╰────────────────────────────────────────────────────────────╯
Trial train_cifar_6da15_00007 saved a checkpoint for iteration 10 at: (local)/var/lib/ci-user/ray_results/train_cifar_2025-08-07_18-22-08/train_cifar_6da15_00007_7_batch_size=16,l1=32,l2=128,lr=0.0020_2025-08-07_18-22-08/checkpoint_000009
Trial train_cifar_6da15_00007 completed after 10 iterations at 2025-08-07 18:24:55. Total running time: 2min 47s
(func pid=3883) Checkpoint successfully created at: Checkpoint(filesystem=local, path=/var/lib/ci-user/ray_results/train_cifar_2025-08-07_18-22-08/train_cifar_6da15_00007_7_batch_size=16,l1=32,l2=128,lr=0.0020_2025-08-07_18-22-08/checkpoint_000009)
(func pid=3876) [2, 16000] loss: 0.213 [repeated 2x across cluster]
(func pid=3876) [2, 18000] loss: 0.190
Trial status: 9 TERMINATED | 1 RUNNING
Current time: 2025-08-07 18:25:08. Total running time: 3min 0s
Logical resource usage: 2.0/16 CPUs, 0/1 GPUs (0.0/1.0 accelerator_type:A10G)
╭────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ Trial name status l1 l2 lr batch_size iter total time (s) loss accuracy │
├────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ train_cifar_6da15_00008 RUNNING 2 32 0.00087078 2 1 93.8431 1.96308 0.1973 │
│ train_cifar_6da15_00000 TERMINATED 2 128 0.0509604 16 1 27.4742 2.31793 0.0995 │
│ train_cifar_6da15_00001 TERMINATED 1 4 0.000583917 4 2 110.273 1.91341 0.2069 │
│ train_cifar_6da15_00002 TERMINATED 1 128 0.00404185 8 1 44.1901 2.31397 0.1008 │
│ train_cifar_6da15_00003 TERMINATED 16 1 0.0300776 2 1 105.781 2.33666 0.1024 │
│ train_cifar_6da15_00004 TERMINATED 1 16 0.047715 16 1 28.5259 2.30867 0.1035 │
│ train_cifar_6da15_00005 TERMINATED 1 1 0.0207046 8 1 43.2041 2.31003 0.1022 │
│ train_cifar_6da15_00006 TERMINATED 256 256 0.000109715 16 2 52.0133 2.28008 0.1685 │
│ train_cifar_6da15_00007 TERMINATED 32 128 0.00200394 16 10 162.494 1.0932 0.6162 │
│ train_cifar_6da15_00009 TERMINATED 4 64 0.00109616 8 4 107.8 1.33079 0.5039 │
╰────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
(func pid=3876) [2, 20000] loss: 0.170
Trial train_cifar_6da15_00008 finished iteration 2 at 2025-08-07 18:25:17. Total running time: 3min 9s
╭────────────────────────────────────────────────────────────╮
│ Trial train_cifar_6da15_00008 result │
├────────────────────────────────────────────────────────────┤
│ checkpoint_dir_name checkpoint_000001 │
│ time_this_iter_s 62.93553 │
│ time_total_s 156.77867 │
│ training_iteration 2 │
│ accuracy 0.3405 │
│ loss 1.69089 │
╰────────────────────────────────────────────────────────────╯
Trial train_cifar_6da15_00008 saved a checkpoint for iteration 2 at: (local)/var/lib/ci-user/ray_results/train_cifar_2025-08-07_18-22-08/train_cifar_6da15_00008_8_batch_size=2,l1=2,l2=32,lr=0.0009_2025-08-07_18-22-08/checkpoint_000001
(func pid=3876) Checkpoint successfully created at: Checkpoint(filesystem=local, path=/var/lib/ci-user/ray_results/train_cifar_2025-08-07_18-22-08/train_cifar_6da15_00008_8_batch_size=2,l1=2,l2=32,lr=0.0009_2025-08-07_18-22-08/checkpoint_000001)
(func pid=3876) [3, 2000] loss: 1.652
(func pid=3876) [3, 4000] loss: 0.841
(func pid=3876) [3, 6000] loss: 0.563
(func pid=3876) [3, 8000] loss: 0.420
Trial status: 9 TERMINATED | 1 RUNNING
Current time: 2025-08-07 18:25:39. Total running time: 3min 30s
Logical resource usage: 2.0/16 CPUs, 0/1 GPUs (0.0/1.0 accelerator_type:A10G)
╭────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ Trial name status l1 l2 lr batch_size iter total time (s) loss accuracy │
├────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ train_cifar_6da15_00008 RUNNING 2 32 0.00087078 2 2 156.779 1.69089 0.3405 │
│ train_cifar_6da15_00000 TERMINATED 2 128 0.0509604 16 1 27.4742 2.31793 0.0995 │
│ train_cifar_6da15_00001 TERMINATED 1 4 0.000583917 4 2 110.273 1.91341 0.2069 │
│ train_cifar_6da15_00002 TERMINATED 1 128 0.00404185 8 1 44.1901 2.31397 0.1008 │
│ train_cifar_6da15_00003 TERMINATED 16 1 0.0300776 2 1 105.781 2.33666 0.1024 │
│ train_cifar_6da15_00004 TERMINATED 1 16 0.047715 16 1 28.5259 2.30867 0.1035 │
│ train_cifar_6da15_00005 TERMINATED 1 1 0.0207046 8 1 43.2041 2.31003 0.1022 │
│ train_cifar_6da15_00006 TERMINATED 256 256 0.000109715 16 2 52.0133 2.28008 0.1685 │
│ train_cifar_6da15_00007 TERMINATED 32 128 0.00200394 16 10 162.494 1.0932 0.6162 │
│ train_cifar_6da15_00009 TERMINATED 4 64 0.00109616 8 4 107.8 1.33079 0.5039 │
╰────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
(func pid=3876) [3, 10000] loss: 0.334
(func pid=3876) [3, 12000] loss: 0.279
(func pid=3876) [3, 14000] loss: 0.239
(func pid=3876) [3, 16000] loss: 0.208
(func pid=3876) [3, 18000] loss: 0.185
(func pid=3876) [3, 20000] loss: 0.166
Trial status: 9 TERMINATED | 1 RUNNING
Current time: 2025-08-07 18:26:09. Total running time: 4min 0s
Logical resource usage: 2.0/16 CPUs, 0/1 GPUs (0.0/1.0 accelerator_type:A10G)
╭────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ Trial name status l1 l2 lr batch_size iter total time (s) loss accuracy │
├────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ train_cifar_6da15_00008 RUNNING 2 32 0.00087078 2 2 156.779 1.69089 0.3405 │
│ train_cifar_6da15_00000 TERMINATED 2 128 0.0509604 16 1 27.4742 2.31793 0.0995 │
│ train_cifar_6da15_00001 TERMINATED 1 4 0.000583917 4 2 110.273 1.91341 0.2069 │
│ train_cifar_6da15_00002 TERMINATED 1 128 0.00404185 8 1 44.1901 2.31397 0.1008 │
│ train_cifar_6da15_00003 TERMINATED 16 1 0.0300776 2 1 105.781 2.33666 0.1024 │
│ train_cifar_6da15_00004 TERMINATED 1 16 0.047715 16 1 28.5259 2.30867 0.1035 │
│ train_cifar_6da15_00005 TERMINATED 1 1 0.0207046 8 1 43.2041 2.31003 0.1022 │
│ train_cifar_6da15_00006 TERMINATED 256 256 0.000109715 16 2 52.0133 2.28008 0.1685 │
│ train_cifar_6da15_00007 TERMINATED 32 128 0.00200394 16 10 162.494 1.0932 0.6162 │
│ train_cifar_6da15_00009 TERMINATED 4 64 0.00109616 8 4 107.8 1.33079 0.5039 │
╰────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
Trial train_cifar_6da15_00008 finished iteration 3 at 2025-08-07 18:26:16. Total running time: 4min 7s
╭────────────────────────────────────────────────────────────╮
│ Trial train_cifar_6da15_00008 result │
├────────────────────────────────────────────────────────────┤
│ checkpoint_dir_name checkpoint_000002 │
│ time_this_iter_s 58.77257 │
│ time_total_s 215.55124 │
│ training_iteration 3 │
│ accuracy 0.3434 │
│ loss 1.70959 │
╰────────────────────────────────────────────────────────────╯
Trial train_cifar_6da15_00008 saved a checkpoint for iteration 3 at: (local)/var/lib/ci-user/ray_results/train_cifar_2025-08-07_18-22-08/train_cifar_6da15_00008_8_batch_size=2,l1=2,l2=32,lr=0.0009_2025-08-07_18-22-08/checkpoint_000002
(func pid=3876) Checkpoint successfully created at: Checkpoint(filesystem=local, path=/var/lib/ci-user/ray_results/train_cifar_2025-08-07_18-22-08/train_cifar_6da15_00008_8_batch_size=2,l1=2,l2=32,lr=0.0009_2025-08-07_18-22-08/checkpoint_000002)
(func pid=3876) [4, 2000] loss: 1.621
(func pid=3876) [4, 4000] loss: 0.825
(func pid=3876) [4, 6000] loss: 0.551
(func pid=3876) [4, 8000] loss: 0.412
Trial status: 9 TERMINATED | 1 RUNNING
Current time: 2025-08-07 18:26:39. Total running time: 4min 30s
Logical resource usage: 2.0/16 CPUs, 0/1 GPUs (0.0/1.0 accelerator_type:A10G)
╭────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ Trial name status l1 l2 lr batch_size iter total time (s) loss accuracy │
├────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ train_cifar_6da15_00008 RUNNING 2 32 0.00087078 2 3 215.551 1.70959 0.3434 │
│ train_cifar_6da15_00000 TERMINATED 2 128 0.0509604 16 1 27.4742 2.31793 0.0995 │
│ train_cifar_6da15_00001 TERMINATED 1 4 0.000583917 4 2 110.273 1.91341 0.2069 │
│ train_cifar_6da15_00002 TERMINATED 1 128 0.00404185 8 1 44.1901 2.31397 0.1008 │
│ train_cifar_6da15_00003 TERMINATED 16 1 0.0300776 2 1 105.781 2.33666 0.1024 │
│ train_cifar_6da15_00004 TERMINATED 1 16 0.047715 16 1 28.5259 2.30867 0.1035 │
│ train_cifar_6da15_00005 TERMINATED 1 1 0.0207046 8 1 43.2041 2.31003 0.1022 │
│ train_cifar_6da15_00006 TERMINATED 256 256 0.000109715 16 2 52.0133 2.28008 0.1685 │
│ train_cifar_6da15_00007 TERMINATED 32 128 0.00200394 16 10 162.494 1.0932 0.6162 │
│ train_cifar_6da15_00009 TERMINATED 4 64 0.00109616 8 4 107.8 1.33079 0.5039 │
╰────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
(func pid=3876) [4, 10000] loss: 0.325
(func pid=3876) [4, 12000] loss: 0.273
(func pid=3876) [4, 14000] loss: 0.231
(func pid=3876) [4, 16000] loss: 0.208
(func pid=3876) [4, 18000] loss: 0.183
(func pid=3876) [4, 20000] loss: 0.166
Trial status: 9 TERMINATED | 1 RUNNING
Current time: 2025-08-07 18:27:09. Total running time: 5min 0s
Logical resource usage: 2.0/16 CPUs, 0/1 GPUs (0.0/1.0 accelerator_type:A10G)
╭────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ Trial name status l1 l2 lr batch_size iter total time (s) loss accuracy │
├────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ train_cifar_6da15_00008 RUNNING 2 32 0.00087078 2 3 215.551 1.70959 0.3434 │
│ train_cifar_6da15_00000 TERMINATED 2 128 0.0509604 16 1 27.4742 2.31793 0.0995 │
│ train_cifar_6da15_00001 TERMINATED 1 4 0.000583917 4 2 110.273 1.91341 0.2069 │
│ train_cifar_6da15_00002 TERMINATED 1 128 0.00404185 8 1 44.1901 2.31397 0.1008 │
│ train_cifar_6da15_00003 TERMINATED 16 1 0.0300776 2 1 105.781 2.33666 0.1024 │
│ train_cifar_6da15_00004 TERMINATED 1 16 0.047715 16 1 28.5259 2.30867 0.1035 │
│ train_cifar_6da15_00005 TERMINATED 1 1 0.0207046 8 1 43.2041 2.31003 0.1022 │
│ train_cifar_6da15_00006 TERMINATED 256 256 0.000109715 16 2 52.0133 2.28008 0.1685 │
│ train_cifar_6da15_00007 TERMINATED 32 128 0.00200394 16 10 162.494 1.0932 0.6162 │
│ train_cifar_6da15_00009 TERMINATED 4 64 0.00109616 8 4 107.8 1.33079 0.5039 │
╰────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
Trial train_cifar_6da15_00008 finished iteration 4 at 2025-08-07 18:27:14. Total running time: 5min 6s
╭────────────────────────────────────────────────────────────╮
│ Trial train_cifar_6da15_00008 result │
├────────────────────────────────────────────────────────────┤
│ checkpoint_dir_name checkpoint_000003 │
│ time_this_iter_s 58.50352 │
│ time_total_s 274.05476 │
│ training_iteration 4 │
│ accuracy 0.3533 │
│ loss 1.65703 │
╰────────────────────────────────────────────────────────────╯
Trial train_cifar_6da15_00008 saved a checkpoint for iteration 4 at: (local)/var/lib/ci-user/ray_results/train_cifar_2025-08-07_18-22-08/train_cifar_6da15_00008_8_batch_size=2,l1=2,l2=32,lr=0.0009_2025-08-07_18-22-08/checkpoint_000003
Trial train_cifar_6da15_00008 completed after 4 iterations at 2025-08-07 18:27:14. Total running time: 5min 6s
Trial status: 10 TERMINATED
Current time: 2025-08-07 18:27:14. Total running time: 5min 6s
Logical resource usage: 2.0/16 CPUs, 0/1 GPUs (0.0/1.0 accelerator_type:A10G)
╭────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ Trial name status l1 l2 lr batch_size iter total time (s) loss accuracy │
├────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ train_cifar_6da15_00000 TERMINATED 2 128 0.0509604 16 1 27.4742 2.31793 0.0995 │
│ train_cifar_6da15_00001 TERMINATED 1 4 0.000583917 4 2 110.273 1.91341 0.2069 │
│ train_cifar_6da15_00002 TERMINATED 1 128 0.00404185 8 1 44.1901 2.31397 0.1008 │
│ train_cifar_6da15_00003 TERMINATED 16 1 0.0300776 2 1 105.781 2.33666 0.1024 │
│ train_cifar_6da15_00004 TERMINATED 1 16 0.047715 16 1 28.5259 2.30867 0.1035 │
│ train_cifar_6da15_00005 TERMINATED 1 1 0.0207046 8 1 43.2041 2.31003 0.1022 │
│ train_cifar_6da15_00006 TERMINATED 256 256 0.000109715 16 2 52.0133 2.28008 0.1685 │
│ train_cifar_6da15_00007 TERMINATED 32 128 0.00200394 16 10 162.494 1.0932 0.6162 │
│ train_cifar_6da15_00008 TERMINATED 2 32 0.00087078 2 4 274.055 1.65703 0.3533 │
│ train_cifar_6da15_00009 TERMINATED 4 64 0.00109616 8 4 107.8 1.33079 0.5039 │
╰────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
Best trial config: {'l1': 32, 'l2': 128, 'lr': 0.0020039435509646582, 'batch_size': 16}
Best trial final validation loss: 1.0931986199855805
Best trial final validation accuracy: 0.6162
(func pid=3876) Checkpoint successfully created at: Checkpoint(filesystem=local, path=/var/lib/ci-user/ray_results/train_cifar_2025-08-07_18-22-08/train_cifar_6da15_00008_8_batch_size=2,l1=2,l2=32,lr=0.0009_2025-08-07_18-22-08/checkpoint_000003)
Best trial test set accuracy: 0.6193
如果您运行代码,一个示例输出可能如下所示
Number of trials: 10/10 (10 TERMINATED)
+-----+--------------+------+------+-------------+--------+---------+------------+
| ... | batch_size | l1 | l2 | lr | iter | loss | accuracy |
|-----+--------------+------+------+-------------+--------+---------+------------|
| ... | 2 | 1 | 256 | 0.000668163 | 1 | 2.31479 | 0.0977 |
| ... | 4 | 64 | 8 | 0.0331514 | 1 | 2.31605 | 0.0983 |
| ... | 4 | 2 | 1 | 0.000150295 | 1 | 2.30755 | 0.1023 |
| ... | 16 | 32 | 32 | 0.0128248 | 10 | 1.66912 | 0.4391 |
| ... | 4 | 8 | 128 | 0.00464561 | 2 | 1.7316 | 0.3463 |
| ... | 8 | 256 | 8 | 0.00031556 | 1 | 2.19409 | 0.1736 |
| ... | 4 | 16 | 256 | 0.00574329 | 2 | 1.85679 | 0.3368 |
| ... | 8 | 2 | 2 | 0.00325652 | 1 | 2.30272 | 0.0984 |
| ... | 2 | 2 | 2 | 0.000342987 | 2 | 1.76044 | 0.292 |
| ... | 4 | 64 | 32 | 0.003734 | 8 | 1.53101 | 0.4761 |
+-----+--------------+------+------+-------------+--------+---------+------------+
Best trial config: {'l1': 64, 'l2': 32, 'lr': 0.0037339984519545164, 'batch_size': 4}
Best trial final validation loss: 1.5310075663924216
Best trial final validation accuracy: 0.4761
Best trial test set accuracy: 0.4737
大多数试验都已提前停止,以避免浪费资源。表现最佳的试验达到了约 47% 的验证准确率,这在测试集上得到了证实。
就是这样!您现在可以调整 PyTorch 模型的参数了。
脚本总运行时间: (5 分 23.228 秒)