Hi! Ludovic Courtès <ludovic.cour...@inria.fr> writes:
> I’m happy to merge your changes in the ‘guix-hpc’ channel for the time > being (I can create you an account there if you wish so you can create > merge requests etc.). Let me know! Ok sure, that sounds good! I made the packages only for ROCm 6.0.2 so far though. > I agree with Ricardo that this should be merged into Guix proper > eventually. This is still in flux and we’d need to check what Kjetil > and Thomas at AMD think, in particular wrt. versions, so no ETA so far. Yes I agree, the ROCm packages are not ready to be merged yet. > Is PyTorch able to build code for several GPU architectures and pick the > right one at run time? If it does, that would seem like the better > option for me, unless that is indeed so computationally expensive that > it’s not affordable. It is the same as for other HIP/ROCm libraries, so the GPU architectures chosen at build time are all available at runtime and automatically picked. For reference, the Arch Linux package for PyTorch [1] enables 12 architectures. I think the architectures which can be chosen at compile time also depend on the ROCm version. >> I'm not sure they can be combined however, as the GPU code is included >> in the shared libraries. Thus all dependent packages like >> python-pytorch-rocm would need to be built for each architecture as >> well, which is a large duplication for the non-GPU parts. > > Yeah, but maybe that’s OK if we keep the number of supported GPU > architectures to a minimum? If it's no issue for the build farm it would probably be good to include a set of default architectures (the officially supported ones?) like you suggested, and make it easy to recompile all dependent packages for other architectures. Maybe this can be done with a package transformation like for '--tune'?. IIRC, building composable-kernel for the default architectures with 16 threads exceeded 32 GB of memory before I cancelled the build and set it to only architecture. >> - Many tests assume a GPU to be present, so they need to be disabled. > > Yes. I/we’d like to eventually support that. (There’d need to be some > annotation in derivations or packages specifying what hardware is > required, and ‘cuirass remote-worker’, ‘guix offload’, etc. would need > to honor that.) That sounds like a good idea, could this also include CPU ISA extensions, such as AVX2 and AVX-512? >> - For several packages (e.g. rocfft), I had to disable the >> validate-runpath? phase, as there was an error when reading ELF >> files. It is however possible that I also disabled it for packages >> where it was not necessary, but it was the case for rocblas at >> least. Here, kernels generated are contained in ELF files, which are >> detected by elf-file? in guix/build/utils.scm, but rejected by >> has-elf-header? in guix/elf.scm, which leads to an error. > > Weird. We’d need to look more closely into the errors you got. I think the issue is simply that elf-file? just checks the magic bytes and has-elf-header? checks for the entire header. If the former returns #t and the latter #f, an error is raised by parse-elf in guix/elf.scm. It seems some ROCm (or tensile?) ELF files have another header format. > Oh, just noticed your patch bring a lot of things beyond PyTorch itself! > I think there’s some overlap with > <https://gitlab.inria.fr/guix-hpc/guix-hpc/-/merge_requests/38>, we > should synchronize. Ah, I did not see this before, the overlap seems to be tensile, roctracer and rocblas. For rocblas, I saw that they set "-DAMDGPU_TARGETS=gfx1030;gfx90a", probably for testing? Thank you! David [1] https://gitlab.archlinux.org/archlinux/packaging/packages/python-pytorch/-/blob/ae90c1e8bdb99af458ca0a545c5736950a747690/PKGBUILD