Hello, after seeing that ROCm packages [1] are available in the Guix-HPC channel, I decided to try and package PyTorch 2.2.1 with ROCm 6.0.2.
For this, I first unbundled the (many) remaining dependencies of the python-pytorch package and updated it to 2.2.1, the patch series for which can be found here [2,3]. For building ROCm and building the remaining packages, I did not apply the same quality standard as for python-pytorch and just tried to get it working at all with ROCM 6.0.2. To reduce the build time, I also only tested them for gfx1101 as set in the %amdgpu-targets variable in amd/rocm-base.scm (which needs to be adjusted for other GPUs). Here, it seemed to work fine on my GPU. The changes for the ROCm packages are here [4] as a modification of Guix-HPC. There, the python-pytorch-rocm package in amd/machine-learning.scm depends on the python-pytorch-avx package in [2,3]. Both python-pytorch and python-pytorch-avx support AVX2 / AVX-512 instructions, but the latter also has support for fbgemm and nnpack. I used it over python-pytorch because AVX2 or AVX-512 instructions should be available on a CPU with PCIe atomics anyway, which ROCm requires. For some packages, such as composable-kernel, the build time and memory requirement is already very high when building only for one GPU architecture, so maybe it would be best to make a separate package for each architecture? I'm not sure they can be combined however, as the GPU code is included in the shared libraries. Thus all dependent packages like python-pytorch-rocm would need to be built for each architecture as well, which is a large duplication for the non-GPU parts. There were a few other issues as well, some of them should probably be addressed upstream: - Many tests assume a GPU to be present, so they need to be disabled. - For several packages (e.g. rocfft), I had to disable the validate-runpath? phase, as there was an error when reading ELF files. It is however possible that I also disabled it for packages where it was not necessary, but it was the case for rocblas at least. Here, kernels generated are contained in ELF files, which are detected by elf-file? in guix/build/utils.scm, but rejected by has-elf-header? in guix/elf.scm, which leads to an error. - Dependencies of python-tensile copy source files and later copy them with shutil.copy, sometimes twice. This leads to permission errors, as the permissions in the store are kept, so I patched it to use shutil.copyfile instead. - There were a few errors due to using the GCC 11 system headers with rocm-toolchain (which is based on Clang+LLVM). For roctracer, replacing std::experimental::filesystem by std::filesystem suffices, but for rocthrust, the placement new operator is not found. I applied the patch from Gentoo [5], where it is replaced by a simple assignment. It looks like UB to me though, even if it happens to work. The question is whether this is a bug in libstdc++, clang or amdclang++... - rocMLIR also contains a fork of the LLVM source tree and it is not clear at a first glance how exactly it differs from the main ROCm fork of LLVM or upstream LLVM. It would be really great to have these packages in Guix proper, but first of course the base ROCm packages need to be added after deciding how to deal with the different architectures. Also, are several ROCm versions necessary or would only one (the current latest) version suffice? Cheers, David [1] https://hpc.guix.info/blog/2024/01/hip-and-rocm-come-to-guix/ [2] https://issues.guix.gnu.org/69591 [3] https://codeberg.org/dtelsing/Guix/src/branch/pytorch [4] https://codeberg.org/dtelsing/Guix-HPC/src/branch/pytorch-rocm [5] https://gitweb.gentoo.org/repo/gentoo.git/tree/sci-libs/rocThrust/files/rocThrust-4.0-operator_new.patch