序列化¶

序列化和反序列化是一个人们特别关心的问题，尤其是在我们将 torchao 与其他库集成时。在这里，我们想描述 torchao 优化（量化或稀疏化）模型的序列化和反序列化是如何工作的。

序列化和反序列化流程¶

这是序列化和反序列化流程

import copy
import tempfile
import torch
from torchao.utils import get_model_size_in_bytes
from torchao.quantization.quant_api import (
    quantize_,
    Int4WeightOnlyConfig,
)

class ToyLinearModel(torch.nn.Module):
    def __init__(self, m=64, n=32, k=64):
        super().__init__()
        self.linear1 = torch.nn.Linear(m, n, bias=False)
        self.linear2 = torch.nn.Linear(n, k, bias=False)

    def example_inputs(self, batch_size=1, dtype=torch.float32, device="cpu"):
        return (torch.randn(batch_size, self.linear1.in_features, dtype=dtype, device=device),)

    def forward(self, x):
        x = self.linear1(x)
        x = self.linear2(x)
        return x

dtype = torch.bfloat16
m = ToyLinearModel(1024, 1024, 1024).eval().to(dtype).to("cuda")
print(f"original model size: {get_model_size_in_bytes(m) / 1024 / 1024} MB")

example_inputs = m.example_inputs(dtype=dtype, device="cuda")
quantize_(m, Int4WeightOnlyConfig())
print(f"quantized model size: {get_model_size_in_bytes(m) / 1024 / 1024} MB")

ref = m(*example_inputs)
with tempfile.NamedTemporaryFile() as f:
    torch.save(m.state_dict(), f)
    f.seek(0)
    state_dict = torch.load(f)

with torch.device("meta"):
    m_loaded = ToyLinearModel(1024, 1024, 1024).eval().to(dtype)

# `linear.weight` is nn.Parameter, so we check the type of `linear.weight.data`
print(f"type of weight before loading: {type(m_loaded.linear1.weight.data), type(m_loaded.linear2.weight.data)}")
m_loaded.load_state_dict(state_dict, assign=True)
print(f"type of weight after loading: {type(m_loaded.linear1.weight), type(m_loaded.linear2.weight)}")

res = m_loaded(*example_inputs)
assert torch.equal(res, ref)

序列化优化模型时会发生什么？¶

要序列化优化模型，我们只需要调用 torch.save(m.state_dict(), f)，因为在 torchao 中，我们使用张量子类来表示不同的数据类型或支持量化和稀疏性等不同的优化技术。因此，优化后，唯一改变的是权重张量被更改为优化后的权重张量，而模型结构完全没有改变。例如

原始浮点模型 state_dict

{"linear1.weight": float_weight1, "linear2.weight": float_weight2}

量化模型 state_dict

{"linear1.weight": quantized_weight1, "linear2.weight": quantized_weight2, ...}

量化模型的大小通常会比原始浮点模型小，但这取决于您使用的具体技术和实现。您可以使用 torchao.utils.get_model_size_in_bytes 工具函数打印模型大小。特别是对于上面使用 Int4WeightOnlyConfig 量化的示例，我们可以看到尺寸减小了约 4 倍。

original model size: 4.0 MB
quantized model size: 1.0625 MB

反序列化优化模型时会发生什么？¶

要反序列化优化模型，我们可以在 meta 设备上初始化浮点模型，然后使用 assign=True 通过 model.load_state_dict 加载优化后的 state_dict。

with torch.device("meta"):
    m_loaded = ToyLinearModel(1024, 1024, 1024).eval().to(dtype)

print(f"type of weight before loading: {type(m_loaded.linear1.weight), type(m_loaded.linear2.weight)}")
m_loaded.load_state_dict(state_dict, assign=True)
print(f"type of weight after loading: {type(m_loaded.linear1.weight), type(m_loaded.linear2.weight)}")

我们在 meta 设备上初始化模型的原因是避免初始化原始浮点模型，因为原始浮点模型可能不适合我们要用于推理的设备。

在 m_loaded.load_state_dict(state_dict, assign=True) 中发生的事情是，相应的权重（例如 m_loaded.linear1.weight）会使用 state_dict 中的张量进行更新，这些张量是优化后的张量子类实例（例如 int4 AffineQuantizedTensor）。要使其工作，不需要依赖 torchao。

我们还可以通过检查权重张量的类型来验证权重是否已正确加载

type of weight before loading: (<class 'torch.Tensor'>, <class 'torch.Tensor'>)
type of weight after loading: (<class 'torchao.dtypes.affine_quantized_tensor.AffineQuantizedTensor'>, <class 'torchao.dtypes.affine_quantized_tensor.AffineQuantizedTensor'>)

序列化¶

序列化和反序列化流程¶

序列化优化模型时会发生什么？¶

反序列化优化模型时会发生什么？¶

文档

教程

资源