Re: PyTorch with ROCm

Ludovic Courtès Tue, 02 Apr 2024 07:01:48 -0700

Hello!

(Cc’ing my colleague Romain who may work on related things soon.)


David Elsing <david.els...@posteo.net> skribis:

> It is the same as for other HIP/ROCm libraries, so the GPU architectures
> chosen at build time are all available at runtime and automatically
> picked. For reference, the Arch Linux package for PyTorch [1] enables 12
> architectures. I think the architectures which can be chosen at compile
> time also depend on the ROCm version.

Nice.  We’d have to check what the size and build time tradeoff is, but
it makes sense to enable a bunch of architectures.

>>> I'm not sure they can be combined however, as the GPU code is included
>>> in the shared libraries. Thus all dependent packages like
>>> python-pytorch-rocm would need to be built for each architecture as
>>> well, which is a large duplication for the non-GPU parts.
>>
>> Yeah, but maybe that’s OK if we keep the number of supported GPU
>> architectures to a minimum?
>
> If it's no issue for the build farm it would probably be good to include
> a set of default architectures (the officially supported ones?) like you
> suggested, and make it easy to recompile all dependent packages for
> other architectures. Maybe this can be done with a package
> transformation like for '--tune'?. IIRC, building composable-kernel for
> the default architectures with 16 threads exceeded 32 GB of memory
> before I cancelled the build and set it to only architecture.

Yeah, we could think about a transformation option.  Maybe
‘--with-configure-flags=python-pytorch=-DAMDGPU_TARGETS=xyz’ would work,
and if not, we can come up with a specific transformation and/or an
procedure that takes a list of architectures and returns a package.

>>> - Many tests assume a GPU to be present, so they need to be disabled.
>>
>> Yes.  I/we’d like to eventually support that.  (There’d need to be some
>> annotation in derivations or packages specifying what hardware is
>> required, and ‘cuirass remote-worker’, ‘guix offload’, etc. would need
>> to honor that.)
>
> That sounds like a good idea, could this also include CPU ISA
> extensions, such as AVX2 and AVX-512?

That’d be great, yes.  Don’t hold your breath though as I/we haven’t
scheduled work on this yet.  If you’re interested in working on it, we
can discuss it of course.

> I think the issue is simply that elf-file? just checks the magic bytes
> and has-elf-header? checks for the entire header. If the former returns
> #t and the latter #f, an error is raised by parse-elf in guix/elf.scm.
> It seems some ROCm (or tensile?) ELF files have another header format.

Uh, never came across such a situation.  What’s so special about those
ELF files?  How are they created?

>> Oh, just noticed your patch bring a lot of things beyond PyTorch itself!
>> I think there’s some overlap with
>> <https://gitlab.inria.fr/guix-hpc/guix-hpc/-/merge_requests/38>, we
>> should synchronize.
> Ah, I did not see this before, the overlap seems to be tensile,
> roctracer and rocblas. For rocblas, I saw that they set
> "-DAMDGPU_TARGETS=gfx1030;gfx90a", probably for testing?

Could be, we’ll see.

Thanks,
Ludo’.

Re: PyTorch with ROCm

Reply via email to