使用 Qualcomm AI Engine Direct 后端构建和运行 Llama 3 8B Instruct¶

本教程演示如何为 Qualcomm AI Engine Direct 后端导出 Llama 3 8B Instruct，并在 Qualcomm 设备上运行模型。

先决条件¶

如果您尚未完成，请按照设置 ExecuTorch 中的说明设置 ExecuTorch 仓库和环境。
请阅读使用 Qualcomm AI Engine Direct 后端构建和运行 ExecuTorch 页面，了解如何在 Qualcomm 设备上使用 Qualcomm AI Engine Direct 后端导出和运行模型。
请遵循 executorch llama 的 README，了解如何通过 ExecuTorch 在移动设备上运行 llama 模型。
一台配备 16GB RAM 的 Qualcomm 设备
- 我们正在继续优化内存使用，以确保与内存较低的设备兼容。
Qualcomm AI Engine Direct SDK 的版本为 2.28.0 或更高版本。

说明¶

步骤 1：从 Spin Quant 准备模型的检查点和优化矩阵¶

有关 Llama 3 分词器和检查点，请参阅 https://github.com/meta-llama/llama-models/blob/main/README.md 了解如何下载 tokenizer.model、consolidated.00.pth 和 params.json。
要获取优化矩阵，请参阅 GitHub 上的 SpinQuant。您可以在量化模型部分下载优化的旋转矩阵。请选择 LLaMA-3-8B/8B_W4A16KV16_lr_1.5_seed_0。

步骤 2：导出到使用 Qualcomm AI Engine Direct 后端的 ExecuTorch¶

在设备上部署 Llama 3 等大型语言模型会带来以下挑战：

模型尺寸过大，无法在设备内存中进行推理。
模型加载和推理时间长。
量化困难。

为了解决这些挑战，我们实施了以下解决方案：

使用 quantization.pt2e_quantize = "qnn_16a4w' 量化激活和权重，从而减小磁盘上的模型大小，并减轻推理期间的内存压力。
使用 backed.qnn.num_sharding = 8 将模型分片为子部分。
执行图转换，将操作转换为更适合加速器的操作。
使用 backend.qnn.optimized_rotation_path = "<path_to_optimized_matrix>" 应用 Spin Quant 的 R1 和 R2，以提高准确性。
使用 quantization.calibration_data = "<|start_header_id|>system<|end_header_id|..."，以确保在量化过程中，校准包含提示模板中的特殊标记。有关提示模板的更多详细信息，请参阅模型卡。

要使用 Qualcomm AI Engine Direct 后端进行导出，请确保以下条件：

主机内存超过 100GB（RAM + 交换空间）。
整个过程需要几个小时。

# path/to/config.yaml
base:
  model_class: llama3
  checkpoint: path/to/consolidated.00.pth
  params: path/to/params.json
  tokenizer_path: path/to/tokenizer.model
  metadata: '{"get_bos_id":128000, "get_eos_ids":[128009, 128001]}'
model:
  use_kv_cache: True
  enable_dynamic_shape: False
quantization:
  pt2e_quantize: qnn_16a4w
  # Please note that calibration_data must include the prompt template for special tokens.
  calibration_data: "<|start_header_id|>system<|end_header_id|>\n\nYou are a funny chatbot.<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nCould you tell me about Facebook?<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n"
backend:
  qnn:
    enabled: True
    num_sharding: 8
    

# export_llm
python -m extension.llm.export.export_llm \
  --config path/to/config.yaml

步骤 3：在配备 Qualcomm SoC 的 Android 智能手机上调用运行时¶

为 Android 构建支持 Qualcomm AI Engine Direct 后端的 executorch

cmake \
    -DCMAKE_TOOLCHAIN_FILE="${ANDROID_NDK_ROOT}/build/cmake/android.toolchain.cmake" \
    -DANDROID_ABI=arm64-v8a \
    -DCMAKE_INSTALL_PREFIX=cmake-android-out \
    -DCMAKE_BUILD_TYPE=Release \
    -DEXECUTORCH_BUILD_EXTENSION_DATA_LOADER=ON \
    -DEXECUTORCH_BUILD_EXTENSION_MODULE=ON \
    -DEXECUTORCH_BUILD_EXTENSION_TENSOR=ON \
    -DEXECUTORCH_BUILD_QNN=ON \
    -DQNN_SDK_ROOT=${QNN_SDK_ROOT} \
    -DEXECUTORCH_BUILD_KERNELS_OPTIMIZED=ON \
    -DEXECUTORCH_BUILD_KERNELS_QUANTIZED=ON \
    -DEXECUTORCH_BUILD_KERNELS_CUSTOM=ON \
    -Bcmake-android-out .

cmake --build cmake-android-out -j16 --target install --config Release

为 Android 构建 llama runner

    cmake \
        -DCMAKE_TOOLCHAIN_FILE="${ANDROID_NDK_ROOT}"/build/cmake/android.toolchain.cmake  \
        -DANDROID_ABI=arm64-v8a \
        -DCMAKE_INSTALL_PREFIX=cmake-android-out \
        -DCMAKE_BUILD_TYPE=Release -DPYTHON_EXECUTABLE=python \
        -DEXECUTORCH_BUILD_QNN=ON \
        -DEXECUTORCH_BUILD_KERNELS_OPTIMIZED=ON \
        -DEXECUTORCH_BUILD_KERNELS_QUANTIZED=ON \
        -DEXECUTORCH_BUILD_KERNELS_CUSTOM=ON \
        -Bcmake-android-out/examples/models/llama examples/models/llama

    cmake --build cmake-android-out/examples/models/llama -j16 --config Release

通过 adb shell 在 Android 上运行 先决条件：确保在手机的开发者选项中启用 USB 调试。

3.1 连接您的 Android 手机

3.2 我们需要将所需的 QNN 库推送到设备。

# make sure you have write-permission on below path.
DEVICE_DIR=/data/local/tmp/llama
adb shell mkdir -p ${DEVICE_DIR}
adb push ${QNN_SDK_ROOT}/lib/aarch64-android/libQnnHtp.so ${DEVICE_DIR}
adb push ${QNN_SDK_ROOT}/lib/aarch64-android/libQnnSystem.so ${DEVICE_DIR}
adb push ${QNN_SDK_ROOT}/lib/aarch64-android/libQnnHtpV69Stub.so ${DEVICE_DIR}
adb push ${QNN_SDK_ROOT}/lib/aarch64-android/libQnnHtpV73Stub.so ${DEVICE_DIR}
adb push ${QNN_SDK_ROOT}/lib/aarch64-android/libQnnHtpV75Stub.so ${DEVICE_DIR}
adb push ${QNN_SDK_ROOT}/lib/hexagon-v69/unsigned/libQnnHtpV69Skel.so ${DEVICE_DIR}
adb push ${QNN_SDK_ROOT}/lib/hexagon-v73/unsigned/libQnnHtpV73Skel.so ${DEVICE_DIR}
adb push ${QNN_SDK_ROOT}/lib/hexagon-v75/unsigned/libQnnHtpV75Skel.so ${DEVICE_DIR}

3.3 上传模型、分词器和 llama runner 二进制文件到手机

adb push <model.pte> ${DEVICE_DIR}
adb push <tokenizer.model> ${DEVICE_DIR}
adb push cmake-android-out/lib/libqnn_executorch_backend.so ${DEVICE_DIR}
adb push cmake-out-android/examples/models/llama/llama_main ${DEVICE_DIR}

3.4 运行模型

adb shell "cd ${DEVICE_DIR} && ./llama_main --model_path <model.pte> --tokenizer_path <tokenizer.model> --prompt \"<|start_header_id|>system<|end_header_id|>\n\nYou are a funny chatbot.<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nCould you tell me about Facebook?<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n\" --seq_len 128"

您应该会看到以下消息：

<|start_header_id|>system<|end_header_id|>\n\nYou are a funny chatbot.<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nCould you tell me about Facebook?<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\nHello! I'd be delighted to chat with you about Facebook. Facebook is a social media platform that was created in 2004 by Mark Zuckerberg and his colleagues while he was a student at Harvard University. It was initially called "Facemaker" but later changed to Facebook, which is a combination of the words "face" and "book". The platform was initially intended for people to share their thoughts and share information with their friends, but it quickly grew to become one of the

有什么进展？¶

性能改进
减少推理期间的内存压力，以支持 12GB Qualcomm 设备
支持更多 LLM（Qwen、Phi-4-mini 等）

常见问题解答¶

如果您在重现本教程时遇到任何问题，请在 ExecuTorch 仓库上提交一个 GitHub issue，并使用 #qcom_aisw 标签。