PyTorch with ROCm

David Elsing Sun, 24 Mar 2024 05:08:21 -0700

Hello,

after seeing that ROCm packages [1] are available in the Guix-HPC
channel, I decided to try and package PyTorch 2.2.1 with ROCm 6.0.2.


For this, I first unbundled the (many) remaining dependencies of the
python-pytorch package and updated it to 2.2.1, the patch series for
which can be found here [2,3].

For building ROCm and building the remaining packages, I did not apply
the same quality standard as for python-pytorch and just tried to get it
working at all with ROCM 6.0.2. To reduce the build time, I also only
tested them for gfx1101 as set in the %amdgpu-targets variable in
amd/rocm-base.scm (which needs to be adjusted for other GPUs). Here, it
seemed to work fine on my GPU.

The changes for the ROCm packages are here [4] as a modification of
Guix-HPC. There, the python-pytorch-rocm package in
amd/machine-learning.scm depends on the python-pytorch-avx package in
[2,3]. Both python-pytorch and python-pytorch-avx support AVX2 / AVX-512
instructions, but the latter also has support for fbgemm and nnpack. I
used it over python-pytorch because AVX2 or AVX-512 instructions should
be available on a CPU with PCIe atomics anyway, which ROCm requires.

For some packages, such as composable-kernel, the build time and
memory requirement is already very high when building only for one GPU
architecture, so maybe it would be best to make a separate package for
each architecture?
I'm not sure they can be combined however, as the GPU code is included
in the shared libraries. Thus all dependent packages like
python-pytorch-rocm would need to be built for each architecture as
well, which is a large duplication for the non-GPU parts.

There were a few other issues as well, some of them should probably be
addressed upstream:
- Many tests assume a GPU to be present, so they need to be disabled.
- For several packages (e.g. rocfft), I had to disable the
  validate-runpath? phase, as there was an error when reading ELF
  files. It is however possible that I also disabled it for packages
  where it was not necessary, but it was the case for rocblas at
  least. Here, kernels generated are contained in ELF files, which are
  detected by elf-file? in guix/build/utils.scm, but rejected by
  has-elf-header? in guix/elf.scm, which leads to an error.
- Dependencies of python-tensile copy source files and later copy them
  with shutil.copy, sometimes twice. This leads to permission errors,
  as the permissions in the store are kept, so I patched it to use
  shutil.copyfile instead.
- There were a few errors due to using the GCC 11 system headers with
  rocm-toolchain (which is based on Clang+LLVM). For roctracer,
  replacing std::experimental::filesystem by std::filesystem suffices,
  but for rocthrust, the placement new operator is not found. I
  applied the patch from Gentoo [5], where it is replaced by a simple
  assignment. It looks like UB to me though, even if it happens to
  work. The question is whether this is a bug in libstdc++, clang or
  amdclang++...
- rocMLIR also contains a fork of the LLVM source tree and it is not
  clear at a first glance how exactly it differs from the main ROCm
  fork of LLVM or upstream LLVM.

It would be really great to have these packages in Guix proper, but
first of course the base ROCm packages need to be added after deciding
how to deal with the different architectures. Also, are several ROCm
versions necessary or would only one (the current latest) version
suffice?

Cheers,
David

[1] https://hpc.guix.info/blog/2024/01/hip-and-rocm-come-to-guix/
[2] https://issues.guix.gnu.org/69591
[3] https://codeberg.org/dtelsing/Guix/src/branch/pytorch
[4] https://codeberg.org/dtelsing/Guix-HPC/src/branch/pytorch-rocm
[5] 
https://gitweb.gentoo.org/repo/gentoo.git/tree/sci-libs/rocThrust/files/rocThrust-4.0-operator_new.patch

PyTorch with ROCm

Reply via email to