注意
转到末尾 下载完整的示例代码。
介绍 || 张量 || 自动微分 || 构建模型 || TensorBoard 支持 || 训练模型 || 模型理解
使用 PyTorch 进行训练#
创建于:2021 年 11 月 30 日 | 最后更新于:2023 年 5 月 31 日 | 最后验证于:2024 年 11 月 05 日
请观看下面的视频或在 youtube 上观看。
简介#
在之前的视频中,我们已经讨论并演示了
使用 torch.nn 模块的神经网络层和函数构建模型
自动梯度计算的机制,这是基于梯度的模型训练的核心
使用 TensorBoard 可视化训练进度和其他活动
在本视频中,我们将为您增加一些新工具
我们将熟悉 `Dataset` 和 `DataLoader` 抽象,以及它们如何在训练循环中简化将数据馈送到模型的过程
我们将讨论特定的损失函数以及何时使用它们
我们将介绍 PyTorch 优化器,它实现了根据损失函数的结果调整模型权重的算法
最后,我们将把所有这些结合起来,看看一个完整的 PyTorch 训练循环的实际运行。
数据集和 DataLoader#
`Dataset` 和 `DataLoader` 类封装了从存储中提取数据并在批次中将其暴露给训练循环的过程。
`Dataset` 负责访问和处理单个数据实例。
`DataLoader` 从 `Dataset` 中提取数据实例(自动或使用您定义的采样器),将它们收集成批次,然后返回供您的训练循环使用。`DataLoader` 可以与任何类型的数据集一起使用,无论它们包含什么类型的数据。
在本教程中,我们将使用 TorchVision 提供的 Fashion-MNIST 数据集。我们使用 `torchvision.transforms.Normalize()` 来零均值化并规范化图像块内容的分布,并下载训练和验证数据分割。
import torch
import torchvision
import torchvision.transforms as transforms
# PyTorch TensorBoard support
from torch.utils.tensorboard import SummaryWriter
from datetime import datetime
transform = transforms.Compose(
[transforms.ToTensor(),
transforms.Normalize((0.5,), (0.5,))])
# Create datasets for training & validation, download if necessary
training_set = torchvision.datasets.FashionMNIST('./data', train=True, transform=transform, download=True)
validation_set = torchvision.datasets.FashionMNIST('./data', train=False, transform=transform, download=True)
# Create data loaders for our datasets; shuffle for training, not for validation
training_loader = torch.utils.data.DataLoader(training_set, batch_size=4, shuffle=True)
validation_loader = torch.utils.data.DataLoader(validation_set, batch_size=4, shuffle=False)
# Class labels
classes = ('T-shirt/top', 'Trouser', 'Pullover', 'Dress', 'Coat',
'Sandal', 'Shirt', 'Sneaker', 'Bag', 'Ankle Boot')
# Report split sizes
print('Training set has {} instances'.format(len(training_set)))
print('Validation set has {} instances'.format(len(validation_set)))
0%| | 0.00/26.4M [00:00<?, ?B/s]
0%| | 65.5k/26.4M [00:00<01:12, 362kB/s]
1%| | 197k/26.4M [00:00<00:45, 574kB/s]
3%|▎ | 754k/26.4M [00:00<00:14, 1.71MB/s]
11%|█ | 2.95M/26.4M [00:00<00:04, 5.81MB/s]
30%|██▉ | 7.83M/26.4M [00:00<00:01, 13.4MB/s]
51%|█████ | 13.5M/26.4M [00:01<00:00, 19.4MB/s]
72%|███████▏ | 19.0M/26.4M [00:01<00:00, 23.0MB/s]
89%|████████▉ | 23.6M/26.4M [00:01<00:00, 27.7MB/s]
100%|██████████| 26.4M/26.4M [00:01<00:00, 18.0MB/s]
0%| | 0.00/29.5k [00:00<?, ?B/s]
100%|██████████| 29.5k/29.5k [00:00<00:00, 325kB/s]
0%| | 0.00/4.42M [00:00<?, ?B/s]
1%|▏ | 65.5k/4.42M [00:00<00:12, 356kB/s]
5%|▌ | 229k/4.42M [00:00<00:06, 672kB/s]
20%|██ | 885k/4.42M [00:00<00:01, 1.99MB/s]
62%|██████▏ | 2.72M/4.42M [00:00<00:00, 5.10MB/s]
100%|██████████| 4.42M/4.42M [00:00<00:00, 5.95MB/s]
0%| | 0.00/5.15k [00:00<?, ?B/s]
100%|██████████| 5.15k/5.15k [00:00<00:00, 51.7MB/s]
Training set has 60000 instances
Validation set has 10000 instances
一如既往,让我们将数据可视化作为一次健全性检查
import matplotlib.pyplot as plt
import numpy as np
# Helper function for inline image display
def matplotlib_imshow(img, one_channel=False):
if one_channel:
img = img.mean(dim=0)
img = img / 2 + 0.5 # unnormalize
npimg = img.numpy()
if one_channel:
plt.imshow(npimg, cmap="Greys")
else:
plt.imshow(np.transpose(npimg, (1, 2, 0)))
dataiter = iter(training_loader)
images, labels = next(dataiter)
# Create a grid from the images and show them
img_grid = torchvision.utils.make_grid(images)
matplotlib_imshow(img_grid, one_channel=True)
print(' '.join(classes[labels[j]] for j in range(4)))

Sandal Sneaker Shirt Bag
模型#
本示例中使用的模型是 LeNet-5 的一个变体 — 如果您看过本系列之前的视频,应该会很熟悉。
import torch.nn as nn
import torch.nn.functional as F
# PyTorch models inherit from torch.nn.Module
class GarmentClassifier(nn.Module):
def __init__(self):
super(GarmentClassifier, self).__init__()
self.conv1 = nn.Conv2d(1, 6, 5)
self.pool = nn.MaxPool2d(2, 2)
self.conv2 = nn.Conv2d(6, 16, 5)
self.fc1 = nn.Linear(16 * 4 * 4, 120)
self.fc2 = nn.Linear(120, 84)
self.fc3 = nn.Linear(84, 10)
def forward(self, x):
x = self.pool(F.relu(self.conv1(x)))
x = self.pool(F.relu(self.conv2(x)))
x = x.view(-1, 16 * 4 * 4)
x = F.relu(self.fc1(x))
x = F.relu(self.fc2(x))
x = self.fc3(x)
return x
model = GarmentClassifier()
损失函数#
在本示例中,我们将使用交叉熵损失。为演示起见,我们将创建假输出和标签值的批次,将它们通过损失函数运行,并检查结果。
loss_fn = torch.nn.CrossEntropyLoss()
# NB: Loss functions expect data in batches, so we're creating batches of 4
# Represents the model's confidence in each of the 10 classes for a given input
dummy_outputs = torch.rand(4, 10)
# Represents the correct class among the 10 being tested
dummy_labels = torch.tensor([1, 5, 3, 7])
print(dummy_outputs)
print(dummy_labels)
loss = loss_fn(dummy_outputs, dummy_labels)
print('Total loss for this batch: {}'.format(loss.item()))
tensor([[0.5981, 0.7205, 0.4472, 0.4691, 0.1565, 0.5347, 0.4308, 0.1182, 0.9646,
0.4539],
[0.6230, 0.4794, 0.2207, 0.2924, 0.7148, 0.8645, 0.5875, 0.5251, 0.6756,
0.0916],
[0.0501, 0.7904, 0.7441, 0.5225, 0.3061, 0.6760, 0.3924, 0.6372, 0.5151,
0.8732],
[0.2018, 0.5311, 0.8389, 0.1922, 0.0745, 0.7502, 0.9822, 0.4657, 0.7697,
0.1901]])
tensor([1, 5, 3, 7])
Total loss for this batch: 2.2027742862701416
优化器#
在本示例中,我们将使用简单的 随机梯度下降 和动量。
尝试对这种优化方案进行一些修改可能会有所启发
学习率决定了优化器所采取的步长大小。不同的学习率对您的训练结果有什么影响,包括准确率和收敛时间?
动量在多个步骤中将优化器推向最强的梯度方向。改变这个值会对您的结果产生什么影响?
尝试一些不同的优化算法,例如平均 SGD、Adagrad 或 Adam。您的结果有何不同?
# Optimizers specified in the torch.optim package
optimizer = torch.optim.SGD(model.parameters(), lr=0.001, momentum=0.9)
训练循环#
下面是一个执行一个训练 epoch 的函数。它枚举 DataLoader 中的数据,并在每次循环迭代中执行以下操作:
从 DataLoader 获取一个训练数据批次
将优化器的梯度清零
执行推理 — 即,从模型获取输入批次的预测
计算该组预测与数据集上的标签之间的损失
计算学习权重的反向梯度
告诉优化器执行一个学习步骤 — 即,根据此批次的梯度,按照我们选择的优化算法调整模型的学习权重
它每 1000 个批次报告一次损失。
最后,它报告最后一个 1000 个批次的平均每批次损失,以便与验证运行进行比较
def train_one_epoch(epoch_index, tb_writer):
running_loss = 0.
last_loss = 0.
# Here, we use enumerate(training_loader) instead of
# iter(training_loader) so that we can track the batch
# index and do some intra-epoch reporting
for i, data in enumerate(training_loader):
# Every data instance is an input + label pair
inputs, labels = data
# Zero your gradients for every batch!
optimizer.zero_grad()
# Make predictions for this batch
outputs = model(inputs)
# Compute the loss and its gradients
loss = loss_fn(outputs, labels)
loss.backward()
# Adjust learning weights
optimizer.step()
# Gather data and report
running_loss += loss.item()
if i % 1000 == 999:
last_loss = running_loss / 1000 # loss per batch
print(' batch {} loss: {}'.format(i + 1, last_loss))
tb_x = epoch_index * len(training_loader) + i + 1
tb_writer.add_scalar('Loss/train', last_loss, tb_x)
running_loss = 0.
return last_loss
每个 Epoch 的活动#
我们每个 epoch 都需要做几件事情:
进行验证,检查我们在未用于训练的数据集上的相对损失,并报告此结果
保存模型的副本
在这里,我们将在 TensorBoard 中进行报告。这需要您转到命令行启动 TensorBoard,然后在另一个浏览器标签页中打开它。
# Initializing in a separate cell so we can easily add more epochs to the same run
timestamp = datetime.now().strftime('%Y%m%d_%H%M%S')
writer = SummaryWriter('runs/fashion_trainer_{}'.format(timestamp))
epoch_number = 0
EPOCHS = 5
best_vloss = 1_000_000.
for epoch in range(EPOCHS):
print('EPOCH {}:'.format(epoch_number + 1))
# Make sure gradient tracking is on, and do a pass over the data
model.train(True)
avg_loss = train_one_epoch(epoch_number, writer)
running_vloss = 0.0
# Set the model to evaluation mode, disabling dropout and using population
# statistics for batch normalization.
model.eval()
# Disable gradient computation and reduce memory consumption.
with torch.no_grad():
for i, vdata in enumerate(validation_loader):
vinputs, vlabels = vdata
voutputs = model(vinputs)
vloss = loss_fn(voutputs, vlabels)
running_vloss += vloss
avg_vloss = running_vloss / (i + 1)
print('LOSS train {} valid {}'.format(avg_loss, avg_vloss))
# Log the running loss averaged per batch
# for both training and validation
writer.add_scalars('Training vs. Validation Loss',
{ 'Training' : avg_loss, 'Validation' : avg_vloss },
epoch_number + 1)
writer.flush()
# Track best performance, and save the model's state
if avg_vloss < best_vloss:
best_vloss = avg_vloss
model_path = 'model_{}_{}'.format(timestamp, epoch_number)
torch.save(model.state_dict(), model_path)
epoch_number += 1
EPOCH 1:
batch 1000 loss: 1.666409859918058
batch 2000 loss: 0.8216810051053762
batch 3000 loss: 0.693145078105852
batch 4000 loss: 0.6443965511168354
batch 5000 loss: 0.6123742864592933
batch 6000 loss: 0.5695766103928909
batch 7000 loss: 0.5409413252712693
batch 8000 loss: 0.5383153433622793
batch 9000 loss: 0.48026449825975576
batch 10000 loss: 0.4591459574009059
batch 11000 loss: 0.45217856835146086
batch 12000 loss: 0.431060717097309
batch 13000 loss: 0.41652981538244055
batch 14000 loss: 0.435001613863511
batch 15000 loss: 0.4117226452493924
LOSS train 0.4117226452493924 valid 0.42531269788742065
EPOCH 2:
batch 1000 loss: 0.3943638932242757
batch 2000 loss: 0.39510620032442967
batch 3000 loss: 0.40187308340048183
batch 4000 loss: 0.41561483964993384
batch 5000 loss: 0.37135440780574575
batch 6000 loss: 0.3847427979120985
batch 7000 loss: 0.3660853395376471
batch 8000 loss: 0.3599262051352125
batch 9000 loss: 0.36613601676898544
batch 10000 loss: 0.34619443843280895
batch 11000 loss: 0.3421523532573119
batch 12000 loss: 0.37944928950941537
batch 13000 loss: 0.3445565646337418
batch 14000 loss: 0.3472710616480363
batch 15000 loss: 0.3482665803800919
LOSS train 0.3482665803800919 valid 0.37191668152809143
EPOCH 3:
batch 1000 loss: 0.35298623689083614
batch 2000 loss: 0.31526475692175154
batch 3000 loss: 0.354445223361603
batch 4000 loss: 0.31954076824391087
batch 5000 loss: 0.30167409399730966
batch 6000 loss: 0.32178128572105563
batch 7000 loss: 0.31245879809299365
batch 8000 loss: 0.3102076395740296
batch 9000 loss: 0.3193566365780716
batch 10000 loss: 0.3245317395089805
batch 11000 loss: 0.32724233834208283
batch 12000 loss: 0.3273154704665576
batch 13000 loss: 0.3198279506397084
batch 14000 loss: 0.3135476417306054
batch 15000 loss: 0.31637832522210374
LOSS train 0.31637832522210374 valid 0.3359675407409668
EPOCH 4:
batch 1000 loss: 0.27941275065656734
batch 2000 loss: 0.2823940862530035
batch 3000 loss: 0.2894134281675447
batch 4000 loss: 0.3015546747631597
batch 5000 loss: 0.28293730535544453
batch 6000 loss: 0.2941953631842043
batch 7000 loss: 0.3244606865464448
batch 8000 loss: 0.2946359610656218
batch 9000 loss: 0.3051677185113658
batch 10000 loss: 0.2765467494608965
batch 11000 loss: 0.31629972430641645
batch 12000 loss: 0.3217379439852521
batch 13000 loss: 0.2986907337167504
batch 14000 loss: 0.2571812377775459
batch 15000 loss: 0.2835259589429043
LOSS train 0.2835259589429043 valid 0.33261457085609436
EPOCH 5:
batch 1000 loss: 0.27721858343805705
batch 2000 loss: 0.2762320360558242
batch 3000 loss: 0.2741196601182746
batch 4000 loss: 0.27815906952939257
batch 5000 loss: 0.2765891040311908
batch 6000 loss: 0.28914274197602935
batch 7000 loss: 0.27360277335835415
batch 8000 loss: 0.2811103402964691
batch 9000 loss: 0.2858065232049412
batch 10000 loss: 0.25068630761879285
batch 11000 loss: 0.2620843322443907
batch 12000 loss: 0.29563811091540265
batch 13000 loss: 0.2757980781155493
batch 14000 loss: 0.27335850994923383
batch 15000 loss: 0.2731545760410836
LOSS train 0.2731545760410836 valid 0.3064103424549103
加载保存的模型版本
saved_model = GarmentClassifier()
saved_model.load_state_dict(torch.load(PATH))
加载模型后,它就可以满足您的任何需求 — 进一步的训练、推理或分析。
请注意,如果您的模型有影响模型结构的构造函数参数,您需要提供它们,并以与保存时完全相同的方式配置模型。
其他资源#
在 pytorch.org 上,有关 数据实用程序(包括 Dataset 和 DataLoader)的文档
关于 GPU 训练 固定内存使用 的说明
TorchVision、TorchText 和 TorchAudio 中可用数据集的文档 (TorchAudio)
PyTorch 中可用的 损失函数 的文档
包含优化器及相关工具(如学习率调度)的 torch.optim 包 的文档
关于 保存和加载模型 的详细教程
pytorch.org 的 教程 部分包含各种训练任务的教程,包括不同领域的分类、生成对抗网络、强化学习等
脚本总运行时间: (3 分钟 3.170 秒)