Re: PyTorch with ROCm

David Elsing Sun, 31 Mar 2024 15:22:11 -0700

Hi!

Ludovic Courtès <ludovic.cour...@inria.fr> writes:


> I’m happy to merge your changes in the ‘guix-hpc’ channel for the time
> being (I can create you an account there if you wish so you can create
> merge requests etc.).  Let me know!

Ok sure, that sounds good! I made the packages only for ROCm 6.0.2 so
far though.

> I agree with Ricardo that this should be merged into Guix proper
> eventually.  This is still in flux and we’d need to check what Kjetil
> and Thomas at AMD think, in particular wrt. versions, so no ETA so far.

Yes I agree, the ROCm packages are not ready to be merged yet.

> Is PyTorch able to build code for several GPU architectures and pick the
> right one at run time?  If it does, that would seem like the better
> option for me, unless that is indeed so computationally expensive that
> it’s not affordable.

It is the same as for other HIP/ROCm libraries, so the GPU architectures
chosen at build time are all available at runtime and automatically
picked. For reference, the Arch Linux package for PyTorch [1] enables 12
architectures. I think the architectures which can be chosen at compile
time also depend on the ROCm version.

>> I'm not sure they can be combined however, as the GPU code is included
>> in the shared libraries. Thus all dependent packages like
>> python-pytorch-rocm would need to be built for each architecture as
>> well, which is a large duplication for the non-GPU parts.
>
> Yeah, but maybe that’s OK if we keep the number of supported GPU
> architectures to a minimum?

If it's no issue for the build farm it would probably be good to include
a set of default architectures (the officially supported ones?) like you
suggested, and make it easy to recompile all dependent packages for
other architectures. Maybe this can be done with a package
transformation like for '--tune'?. IIRC, building composable-kernel for
the default architectures with 16 threads exceeded 32 GB of memory
before I cancelled the build and set it to only architecture.

>> - Many tests assume a GPU to be present, so they need to be disabled.
>
> Yes.  I/we’d like to eventually support that.  (There’d need to be some
> annotation in derivations or packages specifying what hardware is
> required, and ‘cuirass remote-worker’, ‘guix offload’, etc. would need
> to honor that.)

That sounds like a good idea, could this also include CPU ISA
extensions, such as AVX2 and AVX-512?

>> - For several packages (e.g. rocfft), I had to disable the
>>   validate-runpath? phase, as there was an error when reading ELF
>>   files. It is however possible that I also disabled it for packages
>>   where it was not necessary, but it was the case for rocblas at
>>   least. Here, kernels generated are contained in ELF files, which are
>>   detected by elf-file? in guix/build/utils.scm, but rejected by
>>   has-elf-header? in guix/elf.scm, which leads to an error.
>
> Weird.  We’d need to look more closely into the errors you got.

I think the issue is simply that elf-file? just checks the magic bytes
and has-elf-header? checks for the entire header. If the former returns
#t and the latter #f, an error is raised by parse-elf in guix/elf.scm.
It seems some ROCm (or tensile?) ELF files have another header format.

> Oh, just noticed your patch bring a lot of things beyond PyTorch itself!
> I think there’s some overlap with
> <https://gitlab.inria.fr/guix-hpc/guix-hpc/-/merge_requests/38>, we
> should synchronize.
Ah, I did not see this before, the overlap seems to be tensile,
roctracer and rocblas. For rocblas, I saw that they set
"-DAMDGPU_TARGETS=gfx1030;gfx90a", probably for testing?

Thank you!
David

[1] 
https://gitlab.archlinux.org/archlinux/packaging/packages/python-pytorch/-/blob/ae90c1e8bdb99af458ca0a545c5736950a747690/PKGBUILD

Re: PyTorch with ROCm

Reply via email to