Hello! (Cc’ing my colleague Romain who may work on related things soon.)
David Elsing <david.els...@posteo.net> skribis: > It is the same as for other HIP/ROCm libraries, so the GPU architectures > chosen at build time are all available at runtime and automatically > picked. For reference, the Arch Linux package for PyTorch [1] enables 12 > architectures. I think the architectures which can be chosen at compile > time also depend on the ROCm version. Nice. We’d have to check what the size and build time tradeoff is, but it makes sense to enable a bunch of architectures. >>> I'm not sure they can be combined however, as the GPU code is included >>> in the shared libraries. Thus all dependent packages like >>> python-pytorch-rocm would need to be built for each architecture as >>> well, which is a large duplication for the non-GPU parts. >> >> Yeah, but maybe that’s OK if we keep the number of supported GPU >> architectures to a minimum? > > If it's no issue for the build farm it would probably be good to include > a set of default architectures (the officially supported ones?) like you > suggested, and make it easy to recompile all dependent packages for > other architectures. Maybe this can be done with a package > transformation like for '--tune'?. IIRC, building composable-kernel for > the default architectures with 16 threads exceeded 32 GB of memory > before I cancelled the build and set it to only architecture. Yeah, we could think about a transformation option. Maybe ‘--with-configure-flags=python-pytorch=-DAMDGPU_TARGETS=xyz’ would work, and if not, we can come up with a specific transformation and/or an procedure that takes a list of architectures and returns a package. >>> - Many tests assume a GPU to be present, so they need to be disabled. >> >> Yes. I/we’d like to eventually support that. (There’d need to be some >> annotation in derivations or packages specifying what hardware is >> required, and ‘cuirass remote-worker’, ‘guix offload’, etc. would need >> to honor that.) > > That sounds like a good idea, could this also include CPU ISA > extensions, such as AVX2 and AVX-512? That’d be great, yes. Don’t hold your breath though as I/we haven’t scheduled work on this yet. If you’re interested in working on it, we can discuss it of course. > I think the issue is simply that elf-file? just checks the magic bytes > and has-elf-header? checks for the entire header. If the former returns > #t and the latter #f, an error is raised by parse-elf in guix/elf.scm. > It seems some ROCm (or tensile?) ELF files have another header format. Uh, never came across such a situation. What’s so special about those ELF files? How are they created? >> Oh, just noticed your patch bring a lot of things beyond PyTorch itself! >> I think there’s some overlap with >> <https://gitlab.inria.fr/guix-hpc/guix-hpc/-/merge_requests/38>, we >> should synchronize. > Ah, I did not see this before, the overlap seems to be tensile, > roctracer and rocblas. For rocblas, I saw that they set > "-DAMDGPU_TARGETS=gfx1030;gfx90a", probably for testing? Could be, we’ll see. Thanks, Ludo’.