构建说明¶

注意： 最新的构建说明已嵌入在 FBGEMM 仓库下的脚本集中，位于 setup_env.bash。

当前可用的 FBGEMM_GPU 构建变体包括：

仅 CPU
CUDA
ROCm

构建 FBGEMM_GPU 的一般步骤如下：

建立一个隔离的构建环境。
为仅 CPU、CUDA 或 ROCm 构建设置工具链。
安装 PyTorch。
运行构建脚本。

建立一个隔离的构建环境¶

安装 Miniconda¶

建议设置 Miniconda 环境以实现可复现的构建。

export PLATFORM_NAME="$(uname -s)-$(uname -m)"

# Set the Miniconda prefix directory
miniconda_prefix=$HOME/miniconda

# Download the Miniconda installer
wget -q "https://repo.anaconda.com/miniconda/Miniconda3-latest-${PLATFORM_NAME}.sh" -O miniconda.sh

# Run the installer
bash miniconda.sh -b -p "$miniconda_prefix" -u

# Load the shortcuts
. ~/.bashrc

# Run updates
conda update -n base -c conda-forge -y conda

从现在开始，所有安装命令都将在 Conda 环境中运行或针对 Conda 环境运行。

设置 Conda 环境¶

使用指定的 Python 版本创建 Conda 环境。

env_name=<ENV NAME>
python_version=3.13

# Create the environment
conda create -y -n ${env_name} -c conda-forge python="${python_version}"

# Upgrade PIP and pyOpenSSL package
conda run -n ${env_name} pip install --upgrade pip
conda run -n ${env_name} python -m pip install pyOpenSSL>22.1.0

设置仅 CPU 构建¶

请按照设置隔离的构建环境中的说明设置 Conda 环境，然后执行安装构建工具。

设置 CUDA 构建¶

FBGEMM_GPU 的 CUDA 构建需要一个支持计算能力 3.5+ 的最新版本 nvcc。可以通过预构建的 Docker 镜像或在裸机上通过 Conda 安装来设置机器以进行 FBGEMM_GPU 的 CUDA 构建。请注意，构建时不需要 GPU 或 NVIDIA 驱动程序，因为它们仅在运行时使用。

CUDA Docker 镜像¶

对于通过 Docker 进行的设置，只需为您想要的 Linux 发行版和 CUDA 版本拉取 CUDA 的 Docker 镜像。

# Run for Ubuntu 22.04, CUDA 12.6
docker run -it --entrypoint "/bin/bash" nvidia/cuda:12.6.0-devel-ubuntu22.04

从这里开始，其余的构建环境可以通过 Conda 构建，因为它仍然是创建隔离且可复现构建环境的推荐机制。

安装 CUDA¶

通过 Conda 安装完整的 CUDA 软件包，其中包括 NVML。

# See https://anaconda.org/nvidia/cuda for all available versions of CUDA
cuda_version=12.4.1

# Install the full CUDA package
conda install -n ${env_name} -y cuda -c "nvidia/label/cuda-${cuda_version}"

验证是否找到 cuda_runtime.h、libnvidia-ml.so 和 libnccl.so*。

conda_prefix=$(conda run -n ${env_name} printenv CONDA_PREFIX)

find "${conda_prefix}" -name cuda_runtime.h
find "${conda_prefix}" -name libnvidia-ml.so
find "${conda_prefix}" -name libnccl.so*

安装 cuDNN¶

cuDNN 是 FBGEMM_GPU 的 CUDA 变体的构建时依赖项。下载并解压给定 CUDA 版本的 cuDNN 软件包。

# cuDNN package URLs for each platform and CUDA version can be found in:
# https://github.com/pytorch/builder/blob/main/common/install_cuda.sh
cudnn_url=https://developer.download.nvidia.com/compute/cudnn/redist/cudnn/linux-x86_64/cudnn-linux-x86_64-8.9.2.26_cuda12-archive.tar.xz

# Download and unpack cuDNN
wget -q "${cudnn_url}" -O cudnn.tar.xz
tar -xvf cudnn.tar.xz

安装 CUTLASS¶

此部分仅适用于构建实验性的 FBGEMM_GPU GenAI 模块。CUTLASS 应该已经在仓库中作为 git 子模块（参见准备构建）。以下包含路径已添加到 CMake 配置中。

set(THIRDPARTY ${FBGEMM}/external)

${THIRDPARTY}/cutlass/include
${THIRDPARTY}/cutlass/tools/util/include

设置 ROCm 构建¶

FBGEMM_GPU 支持在 AMD (ROCm) 设备上运行。可以通过预构建的 Docker 镜像或裸机来设置机器以进行 FBGEMM_GPU 的 ROCm 构建。

ROCm Docker 镜像¶

对于通过 Docker 进行的设置，只需为您想要的 ROCm 版本拉取 ROCm 的最小 Docker 镜像。

# Run for ROCm 6.3
docker run -it --entrypoint "/bin/bash" rocm/rocm-terminal:6.3

虽然完整的 ROCm Docker 镜像预装了所有 ROCm 软件包，但这会导致 Docker 容器非常大，因此出于这个原因，建议使用最小镜像来构建和运行 FBGEMM_GPU。

从这里开始，其余的构建环境可以通过 Conda 构建，因为它仍然是创建隔离且可复现构建环境的推荐机制。

安装 ROCm¶

通过操作系统包管理器安装完整的 ROCm 软件包。完整的说明可以在 ROCm 安装指南中找到。

# [OPTIONAL] Disable apt installation prompts
export DEBIAN_FRONTEND=noninteractive

# Update the repo DB
apt update

# Download the installer
wget -q https://repo.radeon.com/amdgpu-install/6.3.1/ubuntu/focal/amdgpu-install_6.3.60301-1_all.deb -O amdgpu-install.deb

# Run the installer
apt install ./amdgpu-install.deb

# Install ROCm
amdgpu-install -y --usecase=hiplibsdk,rocm --no-dkms

安装 MIOpen¶

MIOpen 是 FBGEMM_GPU 的 ROCm 变体的依赖项，需要安装。

apt install hipify-clang miopen-hip miopen-hip-dev

安装构建工具¶

本节中的说明适用于 FBGEMM_GPU 所有变体的构建。

C/C++ 编译器 (GCC)¶

安装支持 C++20 的 GCC 工具链版本。还需要安装 sysroot 包，以避免在编译 FBGEMM_CPU 时出现与 GLIBCXX 相关的缺失版本符号的问题。

# Set GCC to 10.4.0 to keep compatibility with older versions of GLIBCXX
#
# A newer versions of GCC also works, but will need to be accompanied by an
# appropriate updated version of the sysroot_linux package.
gcc_version=10.4.0

conda install -n ${env_name} -c conda-forge --override-channels -y \
  gxx_linux-64=${gcc_version} \
  sysroot_linux-64=2.17

虽然可以使用较新版本的 GCC，但使用较新版本 GCC 编译的二进制文件将无法与 Ubuntu 20.04 或 CentOS Stream 8 等较旧系统兼容，因为编译后的库将引用较新版本 GLIBCXX 的符号，而系统的 libstdc++.so.6 将不支持这些符号。要查看可用 libstdc++.so.6 支持的 GLIBC 和 GLIBCXX 版本：

libcxx_path=/path/to/libstdc++.so.6

# Print supported for GLIBC versions
objdump -TC "${libcxx_path}" | grep GLIBC_ | sed 's/.*GLIBC_\([.0-9]*\).*/GLIBC_\1/g' | sort -Vu | cat

# Print supported for GLIBCXX versions
objdump -TC "${libcxx_path}" | grep GLIBCXX_ | sed 's/.*GLIBCXX_\([.0-9]*\).*/GLIBCXX_\1/g' | sort -Vu | cat

C/C++ 编译器 (Clang)¶

可以使用 Clang 作为主机编译器来构建 FBGEMM 和 FBGEMM_GPU（仅 CPU 和 CUDA 变体）。为此，请安装支持 C++20 的 Clang 工具链版本。

# Minimum LLVM+Clang version required for FBGEMM_GPU
llvm_version=16.0.6

# NOTE: libcxx from conda-forge is outdated for linux-aarch64, so we cannot
# explicitly specify the version number
conda install -n ${env_name} -c conda-forge --override-channels -y \
    clangxx=${llvm_version} \
    libcxx \
    llvm-openmp=${llvm_version} \
    compiler-rt=${llvm_version}

# Append $CONDA_PREFIX/lib to $LD_LIBRARY_PATH in the Conda environment
ld_library_path=$(conda run -n ${env_name} printenv LD_LIBRARY_PATH)
conda_prefix=$(conda run -n ${env_name} printenv CONDA_PREFIX)
conda env config vars set -n ${env_name} LD_LIBRARY_PATH="${ld_library_path}:${conda_prefix}/lib"

# Set NVCC_PREPEND_FLAGS in the Conda environment for Clang to work correctly as the host compiler
conda env config vars set -n ${env_name} NVCC_PREPEND_FLAGS=\"-std=c++20 -Xcompiler -std=c++20 -Xcompiler -stdlib=libstdc++ -ccbin ${clangxx_path} -allow-unsupported-compiler\"

请注意，对于 CUDA 代码编译，即使 nvcc 支持 Clang 作为主机编译器，也只有 libstd++ (GCC 对 C++ 标准库的实现) 支持 nvcc 使用的任何主机编译器。

这意味着，无论 FBGEMM_GPU 的 CUDA 变体是使用 Clang 还是其他方式构建，GCC 都是必需的依赖项。在此场景下，建议先安装 GCC 工具链，然后再安装 Clang 工具链；请参阅 C/C++ 编译器 (GCC) 以获取说明。

编译器符号链接¶

安装编译器工具链后，将 C 和 C++ 编译器符号链接到 binpath（根据需要覆盖现有符号链接）。在 Conda 环境中，binpath 位于 $CONDA_PREFIX/bin。

conda_prefix=$(conda run -n ${env_name} printenv CONDA_PREFIX)

ln -sf "${path_to_either_gcc_or_clang}" "$(conda_prefix)/bin/cc"
ln -sf "${path_to_either_gcc_or_clang}" "$(conda_prefix)/bin/c++"

这些符号链接将在 FBGEMM_GPU 构建配置阶段稍后使用。

其他构建工具¶

安装其他必要的构建工具，例如 ninja、cmake 等。

conda install -n ${env_name} -c conda-forge --override-channels -y \
    click \
    cmake \
    hypothesis \
    jinja2 \
    make \
    ncurses \
    ninja \
    numpy \
    scikit-build \
    tbb \
    wheel

安装 PyTorch¶

官方 PyTorch 主页包含最权威的 PyTorch 安装说明，无论是通过 Conda 还是 PIP。

通过 Conda 安装¶

# Install the latest nightly
conda install -n ${env_name} -y pytorch -c pytorch-nightly

# Install the latest test (RC)
conda install -n ${env_name} -y pytorch -c pytorch-test

# Install a specific version
conda install -n ${env_name} -y pytorch==2.0.0 -c pytorch

请注意，在没有指定版本的情况下通过 Conda 安装 PyTorch（例如 nightly 构建）可能并不总是可靠。例如，已知 PyTorch nightly 的 GPU 构建比仅 CPU 构建晚 2 小时才出现在 Conda 中。因此，在此时间窗口内，在 Conda 中安装 pytorch-nightly 将会静默回退到安装仅 CPU 变体。

另请注意，由于 GPU 和仅 CPU 版本的 PyTorch 都被放置在同一个 artifact bucket 中，因此安装过程中选择的 PyTorch 变体将取决于系统是否安装了 CUDA。因此，对于 GPU 构建，在安装 PyTorch 之前先安装 CUDA / ROCm 非常重要。

通过 PyTorch PIP 安装¶

与 Conda 相比，通过 PyTorch PIP 安装 PyTorch 更可取，因为它更加确定且可靠。

# Install the latest nightly, CPU variant
conda run -n ${env_name} pip install --pre torch --index-url https://download.pytorch.org/whl/nightly/cpu/

# Install the latest test (RC), CUDA variant
conda run -n ${env_name} pip install --pre torch --index-url https://download.pytorch.org/whl/test/cu126/

# Install a specific version, CUDA variant
conda run -n ${env_name} pip install torch==2.6.0+cu126 --index-url https://download.pytorch.org/whl/cu126/

# Install the latest nightly, ROCm variant
conda run -n ${env_name} pip install --pre torch --index-url https://download.pytorch.org/whl/nightly/rocm6.3/

截至撰写本文时，对于安装 PyTorch 的 ROCm 变体，PyTorch PIP 是唯一可用的渠道。

安装后检查¶

通过 import 测试验证 PyTorch 安装（版本和变体）。

# Ensure that the package loads properly
conda run -n ${env_name} python -c "import torch.distributed"

# Verify the version and variant of the installation
conda run -n ${env_name} python -c "import torch; print(torch.__version__)"

对于 PyTorch 的 CUDA 变体，请验证是否至少找到 cuda_cmake_macros.h。

conda_prefix=$(conda run -n ${env_name} printenv CONDA_PREFIX)
find "${conda_prefix}" -name cuda_cmake_macros.h

安装 PyTorch-Triton¶

此部分仅适用于构建实验性的 FBGEMM_GPU Triton-GEMM 模块。Triton 应通过 pytorch-triton 安装，该包通常在安装 torch 时一并安装，但也可以手动安装。

# pytorch-triton repos:
# https://download.pytorch.org/whl/nightly/pytorch-triton/
# https://download.pytorch.org/whl/nightly/pytorch-triton-rocm/

# The version SHA should follow the one pinned in PyTorch
# https://github.com/pytorch/pytorch/blob/main/.ci/docker/ci_commit_pins/triton.txt
conda run -n ${env_name} pip install --pre pytorch-triton==3.0.0+dedb7bdf33 --index-url https://download.pytorch.org/whl/nightly/

通过 import 测试验证 PyTorch-Triton 安装。

# Ensure that the package loads properly
conda run -n ${env_name} python -c "import triton"

其他构建前设置¶

准备构建¶

克隆仓库及其子模块，并安装 requirements.txt。

# !! Run inside the Conda environment !!

# Select a version tag
FBGEMM_VERSION=v1.4.0

# Clone the repo along with its submodules
git clone --recursive -b ${FBGEMM_VERSION} https://github.com/pytorch/FBGEMM.git fbgemm_${FBGEMM_VERSION}

# Install additional required packages for building and testing
cd fbgemm_${FBGEMM_VERSION}/fbgemm_gpu
pip install -r requirements.txt

构建过程¶

FBGEMM_GPU 构建过程使用基于 scikit-build 的 CMake 构建流程，并保留跨安装运行的状态。因此，构建可能会过时，并在由于缺少依赖项等原因导致构建失败后重新运行时引起问题。为解决此问题，只需清除构建缓存。

# !! Run in fbgemm_gpu/ directory inside the Conda environment !!

python setup.py clean

设置 Wheel 构建变量¶

在构建 Python wheel 时，必须首先正确设置包名称、Python 版本标签和 Python 平台名称。

# Set the package name depending on the build variant
export package_name=fbgemm_gpu_{cpu, cuda, rocm}

# Set the Python version tag.  It should follow the convention `py<major><minor>`,
# e.g. Python 3.13 --> py313
export python_tag=py313

# Determine the processor architecture
export ARCH=$(uname -m)

# Set the Python platform name for the Linux case
export python_plat_name="manylinux_2_28_${ARCH}"
# For the macOS (x86_64) case
export python_plat_name="macosx_10_9_${ARCH}"
# For the macOS (arm64) case
export python_plat_name="macosx_11_0_${ARCH}"
# For the Windows case
export python_plat_name="win_${ARCH}"

仅 CPU 构建¶

对于仅 CPU 构建，需要指定 --cpu_only 标志。

# !! Run in fbgemm_gpu/ directory inside the Conda environment !!

# Build the wheel artifact only
python setup.py bdist_wheel \
    --build-variant=cpu \
    --python-tag="${python_tag}" \
    --plat-name="${python_plat_name}"

# Build and install the library into the Conda environment (GCC)
python setup.py install \
    --build-variant=cpu

# NOTE: To build the package as part of generating the documentation, use
# `--build-variant=docs` flag instead!

要使用 Clang + libstdc++ 而不是 GCC 进行构建，只需附加 --cxxprefix 标志。

# !! Run in fbgemm_gpu/ directory inside the Conda environment !!

# Build the wheel artifact only
python setup.py bdist_wheel \
    --build-variant=cpu \
    --python-tag="${python_tag}" \
    --plat-name="${python_plat_name}" \
    --cxxprefix=$CONDA_PREFIX

# Build and install the library into the Conda environment (Clang)
python setup.py install \
    --build-variant=cpu
    --cxxprefix=$CONDA_PREFIX

请注意，这假定 Clang 工具链已正确安装，并作为 ${cxxprefix}/bin/cc 和 ${cxxprefix}/bin/c++ 提供。

要启用运行时调试功能（例如 CUDA 和 HIP 中的设备端断言），只需在调用 setup.py 时附加 --debug 标志。

CUDA 构建¶

为 CUDA 构建 FBGEMM_GPU 需要安装 NVML 和 cuDNN，并通过环境变量使其可用于构建。然而，构建包不需要 CUDA 设备。

与仅 CPU 构建类似，通过附加 --cxxprefix=$CONDA_PREFIX 到构建命令，可以启用使用 Clang + libstdc++ 进行构建，前提是工具链已正确安装。

# !! Run in fbgemm_gpu/ directory inside the Conda environment !!

# [OPTIONAL] Specify the CUDA installation paths
# This may be required if CMake is unable to find nvcc
export CUDACXX=/path/to/nvcc
export CUDA_BIN_PATH=/path/to/cuda/installation

# [OPTIONAL] Provide the CUB installation directory (applicable only to CUDA versions prior to 11.1)
export CUB_DIR=/path/to/cub

# [OPTIONAL] Allow NVCC to use host compilers that are newer than what NVCC officially supports
nvcc_prepend_flags=(
  -allow-unsupported-compiler
)

# [OPTIONAL] If clang is the host compiler, set NVCC to use libstdc++ since libc++ is not supported
nvcc_prepend_flags+=(
  -Xcompiler -stdlib=libstdc++
  -ccbin "/path/to/clang++"
)

# [OPTIONAL] Set NVCC_PREPEND_FLAGS as needed
export NVCC_PREPEND_FLAGS="${nvcc_prepend_flags[@]}"

# [OPTIONAL] Enable verbose NVCC logs
export NVCC_VERBOSE=1

# Specify cuDNN header and library paths
export CUDNN_INCLUDE_DIR=/path/to/cudnn/include
export CUDNN_LIBRARY=/path/to/cudnn/lib

# Specify NVML filepath
export NVML_LIB_PATH=/path/to/libnvidia-ml.so

# Specify NCCL filepath
export NCCL_LIB_PATH=/path/to/libnccl.so.2

# Build for SM70/80 (V100/A100 GPU); update as needed
# If not specified, only the CUDA architecture supported by current system will be targeted
# If not specified and no CUDA device is present either, all CUDA architectures will be targeted
cuda_arch_list=7.0;8.0

# Unset TORCH_CUDA_ARCH_LIST if it exists, bc it takes precedence over
# -DTORCH_CUDA_ARCH_LIST during the invocation of setup.py
unset TORCH_CUDA_ARCH_LIST

# Build the wheel artifact only
python setup.py bdist_wheel \
    --build-variant=cuda \
    --python-tag="${python_tag}" \
    --plat-name="${python_plat_name}" \
    --nvml_lib_path=${NVML_LIB_PATH} \
    --nccl_lib_path=${NCCL_LIB_PATH} \
    -DTORCH_CUDA_ARCH_LIST="${cuda_arch_list}"

# Build and install the library into the Conda environment
python setup.py install \
    --build-variant=cuda \
    --nvml_lib_path=${NVML_LIB_PATH} \
    --nccl_lib_path=${NCCL_LIB_PATH} \
    -DTORCH_CUDA_ARCH_LIST="${cuda_arch_list}"

ROCm 构建¶

对于 ROCm 构建，需要指定 ROCM_PATH 和 PYTORCH_ROCM_ARCH。然而，构建包不需要 ROCm 设备。

与仅 CPU 和 CUDA 构建类似，通过附加 --cxxprefix=$CONDA_PREFIX 到构建命令，可以启用使用 Clang + libstdc++ 进行构建，前提是工具链已正确安装。

# !! Run in fbgemm_gpu/ directory inside the Conda environment !!

export ROCM_PATH=/path/to/rocm

# [OPTIONAL] If libtbb.so is missing, create the symlink (presuming libtbb.so.12 is present)
ln -s "${CONDA_PREFIX}/lib/libtbb.so.12" "${CONDA_PREFIX}/lib/libtbb.so"

# [OPTIONAL] Enable verbose HIPCC logs
export HIPCC_VERBOSE=1

# Build for the target architecture of the ROCm device installed on the machine (e.g. 'gfx908,gfx90a,gfx942')
# See https://rocm.docs.amd.com/en/latest/reference/gpu-arch-specs.html for list
export PYTORCH_ROCM_ARCH=$(${ROCM_PATH}/bin/rocminfo | grep -o -m 1 'gfx.*')

# Build the wheel artifact only
python setup.py bdist_wheel \
    --build-variant=rocm \
    --python-tag="${python_tag}" \
    --plat-name="${python_plat_name}" \
    -DAMDGPU_TARGETS="${PYTORCH_ROCM_ARCH}" \
    -DHIP_ROOT_DIR="${ROCM_PATH}" \
    -DCMAKE_C_FLAGS="-DTORCH_USE_HIP_DSA" \
    -DCMAKE_CXX_FLAGS="-DTORCH_USE_HIP_DSA"

# Build and install the library into the Conda environment
python setup.py install \
    --build-variant=rocm \
    -DAMDGPU_TARGETS="${PYTORCH_ROCM_ARCH}" \
    -DHIP_ROOT_DIR="${ROCM_PATH}" \
    -DCMAKE_C_FLAGS="-DTORCH_USE_HIP_DSA" \
    -DCMAKE_CXX_FLAGS="-DTORCH_USE_HIP_DSA"

构建后检查 (供开发者使用)¶

构建完成后，运行一些检查来验证构建是否正确是有益的。

未定义符号检查¶

由于 FBGEMM_GPU 包含大量 Jinja 和 C++ 模板实例化，因此确保在开发过程中不会意外生成未定义符号非常重要。

# !! Run in fbgemm_gpu/ directory inside the Conda environment !!

# Locate the built .SO file
fbgemm_gpu_lib_path=$(find . -name fbgemm_gpu_py.so)

# Check that the undefined symbols don't include fbgemm_gpu-defined functions
nm -gDCu "${fbgemm_gpu_lib_path}" | sort

GLIBC 版本兼容性检查¶

验证 GLIBCXX 的版本号以及某些函数符号的可用性也很有用。

# !! Run in fbgemm_gpu/ directory inside the Conda environment !!

# Locate the built .SO file
fbgemm_gpu_lib_path=$(find . -name fbgemm_gpu_py.so)

# Note the versions of GLIBCXX referenced by the .SO
# The libstdc++.so.6 available on the install target must support these versions
objdump -TC "${fbgemm_gpu_lib_path}" | grep GLIBCXX | sed 's/.*GLIBCXX_\([.0-9]*\).*/GLIBCXX_\1/g' | sort -Vu | cat

# Test for the existence of a given function symbol in the .SO
nm -gDC "${fbgemm_gpu_lib_path}" | grep " fbgemm_gpu::merge_pooled_embeddings("
nm -gDC "${fbgemm_gpu_lib_path}" | grep " fbgemm_gpu::jagged_2d_to_dense("