(第 3 部分)在 vLLM、SGLang、ExecuTorch 上提供服务¶
TorchAO 通过利用我们集成到合作框架中的量化和稀疏技术,提供端到端的预训练、微调和模型优化流程。这是展示此端到端流程的 3 个教程中的第 3 部分,重点关注服务步骤。

本教程演示了如何使用 torchao 作为底层优化引擎执行训练后量化和部署模型进行推理,并通过 HuggingFace Transformers、vLLM 和 ExecuTorch 无缝集成。
使用 HuggingFace 进行训练后量化¶
HuggingFace Transformers 提供与 torchao 量化的无缝集成。TorchAoConfig
在模型加载期间自动应用 torchao 的优化量化算法。
pip install git+https://github.com/huggingface/transformers@main
pip install --pre torchao --index-url https://download.pytorch.org/whl/nightly/cu126
pip install torch
pip install accelerate
在此示例中,我们将在 Phi-4 mini-instruct 模型上使用 Float8DynamicActivationFloat8WeightConfig
。
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, TorchAoConfig
from torchao.quantization import Float8DynamicActivationFloat8WeightConfig, PerRow
model_id = "microsoft/Phi-4-mini-instruct"
quant_config = Float8DynamicActivationFloat8WeightConfig(granularity=PerRow())
quantization_config = TorchAoConfig(quant_type=quant_config)
quantized_model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto", torch_dtype=torch.bfloat16, quantization_config=quantization_config)
tokenizer = AutoTokenizer.from_pretrained(model_id)
# Push the model to hub
USER_ID = "YOUR_USER_ID"
MODEL_NAME = model_id.split("/")[-1]
save_to = f"{USER_ID}/{MODEL_NAME}-float8dq"
quantized_model.push_to_hub(save_to, safe_serialization=False)
tokenizer.push_to_hub(save_to)
注意
有关支持的量化和稀疏性配置的更多信息,请参阅 HF-Torchao 文档。
服务与推理¶
使用 vLLM 进行服务与推理¶
vLLM 在服务量化模型时自动利用 torchao 的优化内核,显著提高了吞吐量。
首先,安装支持 torchao 的 vLLM
pip install vllm --pre --extra-index-url https://wheels.vllm.ai/nightly
pip install --pre torchao --index-url https://download.pytorch.org/whl/nightly/cu126
为了在 vLLM 中提供服务,我们使用了上一步 使用 HuggingFace 进行训练后量化 中量化并推送到 Hugging Face hub 的模型。
# Server
vllm serve pytorch/Phi-4-mini-instruct-float8dq --tokenizer microsoft/Phi-4-mini-instruct -O3
# Client
curl https://:8000/v1/chat/completions -H "Content-Type: application/json" -d '{
"model": "pytorch/Phi-4-mini-instruct-float8dq",
"messages": [
{"role": "user", "content": "Give me a short introduction to large language models."}
],
"temperature": 0.6,
"top_p": 0.95,
"top_k": 20,
"max_tokens": 32768
}'
使用 vLLM 服务 float8 动态量化模型在 H100 上显示 VRAM 减少 36%,推理速度提高 1.15-1.2 倍,精度几乎没有影响。内存基准测试 和 性能基准测试 了解更多详情。
注意
有关 vLLM 集成的更多信息,请参阅详细指南 与 VLLM 集成:架构和使用指南。
使用 SGLang 进行服务与推理¶
(即将推出!)
使用 Transformers 进行推理¶
安装所需的包
pip install git+https://github.com/huggingface/transformers@main
pip install torchao
pip install torch
pip install accelerate
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
torch.random.manual_seed(0)
model_path = "pytorch/Phi-4-mini-instruct-float8dq"
model = AutoModelForCausalLM.from_pretrained(
model_path,
device_map="auto",
torch_dtype="auto",
trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained(model_path)
messages = [
{"role": "system", "content": "You are a helpful AI assistant."},
{"role": "user", "content": "Can you provide ways to eat combinations of bananas and dragonfruits?"},
{"role": "assistant", "content": "Sure! Here are some ways to eat bananas and dragonfruits together: 1. Banana and dragonfruit smoothie: Blend bananas and dragonfruits together with some milk and honey. 2. Banana and dragonfruit salad: Mix sliced bananas and dragonfruits together with some lemon juice and honey."},
{"role": "user", "content": "What about solving an 2x + 3 = 7 equation?"},
]
pipe = pipeline(
"text-generation",
model=model,
tokenizer=tokenizer,
)
generation_args = {
"max_new_tokens": 500,
"return_full_text": False,
"temperature": 0.0,
"do_sample": False,
}
output = pipe(messages, **generation_args)
print(output[0]['generated_text'])
使用 ExecuTorch 进行移动部署¶
ExecuTorch 通过 torchao 的移动优化量化方案实现设备端推理。8da4w(8 位动态激活,4 位权重)配置专门为移动部署设计。可选地,在降低到 ExecuTorch 之前,我们可以使用 QAT (第 2 部分)使用 QAT、QLoRA 和 float8 进行微调 对模型进行微调,这已证明在量化模型质量方面有所改进。
[可选] 解绑嵌入权重¶
可选地,我们可以对嵌入和 lm_head 进行不同的量化,由于这些层是绑定的,我们首先需要解绑模型
from transformers import (
AutoModelForCausalLM,
AutoProcessor,
AutoTokenizer,
)
import torch
from transformers.modeling_utils import find_tied_parameters
model_id = "microsoft/Phi-4-mini-instruct"
untied_model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype="auto", device_map="auto")
tokenizer = AutoTokenizer.from_pretrained(model_id)
print(untied_model)
print("tied weights:", find_tied_parameters(untied_model))
if getattr(untied_model.config.get_text_config(decoder=True), "tie_word_embeddings"):
setattr(untied_model.config.get_text_config(decoder=True), "tie_word_embeddings", False)
untied_model._tied_weights_keys = []
untied_model.lm_head.weight = torch.nn.Parameter(untied_model.lm_head.weight.clone())
print("tied weights:", find_tied_parameters(untied_model))
USER_ID = "YOUR_USER_ID"
MODEL_NAME = model_id.split("/")[-1]
save_to = f"{USER_ID}/{MODEL_NAME}-untied-weights"
untied_model.push_to_hub(save_to)
tokenizer.push_to_hub(save_to)
# or save locally
save_to_local_path = f"{MODEL_NAME}-untied-weights"
untied_model.save_pretrained(save_to_local_path)
tokenizer.save_pretrained(save_to)
步骤 1:创建移动优化量化¶
使用 TorchAO 的 Int8DynamicActivationIntxWeightConfig
配置对模型进行移动部署量化。如果按照上一步解绑了嵌入和 lm_head,我们可以使用 IntxWeightOnlyConfig
配置对嵌入进行量化,并使用 Int8DynamicActivationIntxWeightConfig
配置对 lm_head 进行量化。
from transformers import (
AutoModelForCausalLM,
AutoProcessor,
AutoTokenizer,
TorchAoConfig,
)
from torchao.quantization.quant_api import (
IntxWeightOnlyConfig,
Int8DynamicActivationIntxWeightConfig,
ModuleFqnToConfig,
quantize_,
)
from torchao.quantization.granularity import PerGroup, PerAxis
import torch
# we start from the model with untied weights
model_id = "microsoft/Phi-4-mini-instruct"
USER_ID = "YOUR_USER_ID"
MODEL_NAME = model_id.split("/")[-1]
untied_model_id = f"{USER_ID}/{MODEL_NAME}-untied-weights"
untied_model_local_path = f"{MODEL_NAME}-untied-weights"
# embedding_config is required only if we untied the embedding and lm_head in the previous step, else we can use only linear config for quantization
embedding_config = IntxWeightOnlyConfig(
weight_dtype=torch.int8,
granularity=PerAxis(0),
)
linear_config = Int8DynamicActivationIntxWeightConfig(
weight_dtype=torch.int4,
weight_granularity=PerGroup(32),
weight_scale_dtype=torch.bfloat16,
)
quant_config = ModuleFqnToConfig({"_default": linear_config, "model.embed_tokens": embedding_config})
quantization_config = TorchAoConfig(quant_type=quant_config, include_embedding=True, untie_embedding_weights=True, modules_to_not_convert=[])
# either use `untied_model_id` or `untied_model_local_path`
quantized_model = AutoModelForCausalLM.from_pretrained(untied_model_id, torch_dtype=torch.float32, device_map="auto", quantization_config=quantization_config)
tokenizer = AutoTokenizer.from_pretrained(model_id)
# Push to hub
MODEL_NAME = model_id.split("/")[-1]
save_to = f"{USER_ID}/{MODEL_NAME}-8da4w"
quantized_model.push_to_hub(save_to, safe_serialization=False)
tokenizer.push_to_hub(save_to)
步骤 2:导出到 ExecuTorch¶
将量化模型转换为 .pte 文件,可在移动设备上运行。
# Install ExecuTorch
git clone https://github.com/pytorch/executorch.git
cd executorch
./install_requirements.sh
# Convert checkpoint format for ExecuTorch
python -m executorch.examples.models.phi_4_mini.convert_weights pytorch_model.bin pytorch_model_converted.bin
# Export to PTE format with torchao optimizations preserved
PARAMS="executorch/examples/models/phi_4_mini/config.json"
python -m executorch.examples.models.llama.export_llama \
--model "phi_4_mini" \
--checkpoint "pytorch_model_converted.bin" \
--params "$PARAMS" \
-kv \
--use_sdpa_with_kv_cache \
-X \
--metadata '{"get_bos_id":199999, "get_eos_ids":[200020,199999]}' \
--max_seq_length 128 \
--max_context_length 128 \
--output_name="phi4-mini-8da4w.pte"
.pte 文件可以在移动手机上使用 ExecuTorch 运行。请按照 说明 在 iOS 设备上执行此操作。
移动性能特征¶
torchao 优化的 8da4w 模型提供
内存:iPhone 15 Pro 上约 3.2GB
速度:iPhone 15 Pro 上约 17 token/秒
精度:在大多数基准测试中,保持在原始模型的 5-10% 范围内
注意
有关测试 ExecuTorch 模型和重现基准测试的详细说明,请参阅 HF Phi-4-mini-instruct-8da4w 模型。
评估¶
模型质量评估¶
使用 lm-evaluation-harness 评估量化模型
# Install evaluation framework
# Need to install lm-eval from source: https://github.com/EleutherAI/lm-evaluation-harness#install
# Evaluate baseline model
lm_eval --model hf --model_args pretrained=microsoft/Phi-4-mini-instruct --tasks hellaswag --device cuda:0 --batch_size 8
# Evaluate torchao-quantized model (float8dq)
lm_eval --model hf --model_args pretrained=pytorch/Phi-4-mini-instruct-float8dq --tasks hellaswag --device cuda:0 --batch_size 8
内存基准测试¶
对于 Phi-4-mini-instruct,当使用 float8 动态量化时,与基线模型相比,我们可以将峰值内存使用量减少 36%。
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
# use "microsoft/Phi-4-mini-instruct" or "pytorch/Phi-4-mini-instruct-float8dq"
model_id = "pytorch/Phi-4-mini-instruct-float8dq"
quantized_model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto", torch_dtype=torch.bfloat16)
tokenizer = AutoTokenizer.from_pretrained(model_id)
torch.cuda.reset_peak_memory_stats()
prompt = "Hey, are you conscious? Can you talk to me?"
messages = [
{
"role": "system",
"content": "",
},
{"role": "user", "content": prompt},
]
templated_prompt = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True,
)
print("Prompt:", prompt)
print("Templated prompt:", templated_prompt)
inputs = tokenizer(
templated_prompt,
return_tensors="pt",
).to("cuda")
generated_ids = quantized_model.generate(**inputs, max_new_tokens=128)
output_text = tokenizer.batch_decode(
generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print("Response:", output_text[0][len(prompt):])
mem = torch.cuda.max_memory_reserved() / 1e9
print(f"Peak Memory Usage: {mem:.02f} GB")
输出
Prompt: Hey, are you conscious? Can you talk to me?
Templated prompt: <|system|><|end|><|user|>Hey, are you conscious? Can you talk to me?<|end|><|assistant|>
Response: Hello! Yes, I am a digital assistant, and I am fully operational and ready to assist you. How can I help you today?
Peak Memory Usage: 5.70 GB
基准 |
Phi-4 mini-instruct |
Phi-4-mini-instruct-float8dq |
---|---|---|
峰值内存 (GB) |
8.91 |
5.70(减少 36%) |
性能基准测试¶
延迟基准测试¶
# baseline
python benchmarks/benchmark_latency.py --input-len 256 --output-len 256 --model microsoft/Phi-4-mini-instruct --batch-size 1
# float8dq
VLLM_DISABLE_COMPILE_CACHE=1 python benchmarks/benchmark_latency.py --input-len 256 --output-len 256 --model pytorch/Phi-4-mini-instruct-float8dq --batch-size 1
服务基准测试¶
我们对服务环境中的吞吐量进行了基准测试。
# Setup: Get vllm source code
git clone git@github.com:vllm-project/vllm.git
# Install vllm
VLLM_USE_PRECOMPILED=1 pip install --editable .
# Run the benchmarks under vllm root folder:
# Download sharegpt dataset:
wget https://hugging-face.cn/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json
# Other datasets can be found in: https://github.com/vllm-project/vllm/tree/main/benchmarks
# Note: you can change the number of prompts to be benchmarked with --num-prompts argument for benchmark_serving script.
# For baseline
# Server:
vllm serve microsoft/Phi-4-mini-instruct --tokenizer microsoft/Phi-4-mini-instruct -O3
# Client:
python benchmarks/benchmark_serving.py --backend vllm --dataset-name sharegpt --tokenizer microsoft/Phi-4-mini-instruct --dataset-path ./ShareGPT_V3_unfiltered_cleaned_split.json --model microsoft/Phi-4-mini-instruct --num-prompts 1
# For float8dq
# Server:
VLLM_DISABLE_COMPILE_CACHE=1 vllm serve pytorch/Phi-4-mini-instruct-float8dq --tokenizer microsoft/Phi-4-mini-instruct -O3
# Client:
python benchmarks/benchmark_serving.py --backend vllm --dataset-name sharegpt --tokenizer microsoft/Phi-4-mini-instruct --dataset-path ./ShareGPT_V3_unfiltered_cleaned_split.json --model pytorch/Phi-4-mini-instruct-float8dq --num-prompts 1
结果 (H100 机器)¶
基准 |
Phi-4-mini-instruct |
Phi-4-mini-instruct-float8dq |
---|---|---|
延迟 (batch_size=1) |
1.64秒 |
1.41秒 (1.16倍加速) |
延迟 (batch_size=128) |
3.1秒 |
2.72秒 (1.14倍加速) |
服务 (num_prompts=1) |
1.35 请求/秒 |
1.57 请求/秒 (1.16倍加速) |
服务 (num_prompts=1000) |
66.68 请求/秒 |
80.53 请求/秒 (1.21倍加速) |
结论¶
本教程演示了 torchao 的量化和稀疏技术如何无缝集成到整个机器学习部署堆栈中
HuggingFace Transformers 提供带有 torchao 量化的简便模型加载
vLLM 利用 torchao 的优化内核实现高吞吐量服务
ExecuTorch 启用带有 torchao 移动优化方案的移动部署
lm-evaluation-harness 提供模型质量评估
所有这些框架都使用 torchao 作为底层优化引擎,确保一致的性能提升和易于集成。所示的量化技术显著减少了内存(3-4 倍)并提高了性能(1.5-2 倍),同时将模型质量保持在大多数应用程序可接受的范围内。
对于生产部署,请务必在您的特定用例和硬件上进行基准测试,以验证性能和精度权衡。