On Wed, 5 Feb 2025, Tobias Burnus wrote:

> Hi Andrew,
> 
> Andrew Stubbs wrote:
> > On 05/02/2025 11:14, Tobias Burnus wrote:
> >> Therefore, the following GPUs are now supported in addition: gfx902,
> >> gfx904, gfx909, gfx1031, gfx1032, gfx1033, gfx1034, gfx1035, gfx1101,
> >> gfx1102, gfx1150, gfx1151, gfx1152, and gfx1153. However, the multilib
> >> config has not been touched, hence, those 14 device types and
> >> gfx{9,10-3,11}-generic are not supported by default. Currently, the
> >> following 9 GPUs are enabled by default:gfx900, gfx906, gfx908, gfx90a,
> >> gfx90c, gfx1030, gfx1036, gfx1100, andgfx1103.
> >
> > I'm not too happy about adding a whole list of specific devices that we have
> > not tested. So far, whenever I have added a new device there have been
> > meta-data oddities and such-like that needed to be tweaked. 
> 
> Well, the idea is: If AMD has collected them under the same generic name, the
> ISA must be compatible. The LLVM page lists some restrictions (such as not
> having sramecc support when using generic) but none of the listed items match
> what we have.
> 
> I fail to see how an ISA that works with, e.g., gfx9-generic will suddenly
> fail when compiling for it with gfx902, which except for the ELF flag contains
> identical code.
> 
> > I also don't like adding knowledge of unsupported devices purely for
> > improving diagnostics.
> 
> I think we have the option to delegate the checking purely to ROCm. Then
> gfx9-generic will run on gfx909 – or we do our own checking. But then we need
> to somehow know whether gfx9-generic code will run on gfx909 or not – or we
> bluntly reject it.
> 
> > It's fine for the known-unsupported devices, but wait a month or so and
> > there will be new unknown-unsupported devices, and the message degrades
> > again. Worse, the new diagnostic can recommend trying -march=<name> for
> > devices which the compiler will recognize but have never been tested, and
> > probably don't have multilibs configured.
> 
> The having-no-multilib-configured issue is difficult to come by, unless we
> want to filter them out when building libgomp. We could do so, however, by
> doing some preprocessing.
> 
> The problem is that we then need to have two checks:
> 
> (a) Whether it runs (if we don't relegate it to ROCm) – in that case, gfx902
> hardware with gfx9-generic should just work, even if there is neither a gfx902
> nor gfx9-generic multilib. After all, the user managed to link the executable.
> 
> (b) When recompiling on the same system as running the build, suggesting a
> -march=gfx... that has a multilib would be better, i.e. here the filtered-out
> value could be helpful.
> 
> (c) For suggesting generic, we also would need to check the ROCm version to
> only propose it when ROCm is > 6.3, assuming that's the thing.
> 
> BTW: The issue of having no multilib configured is not really new. We had it
> before with fiji or when the user configured GCC in some non-default way. (As
> we currently enable all GPUs by default. But I think we didn't do so for a
> while for the gfx1... ones. But I don't recall whether we did do so for a
> release or not.)
> 
> And I think the error is also not to illegible:
> 
> ld: error: unable to find library -lgfortran
> 
> gcn mkoffload: fatal error: .../x86_64-pc-linux-gnu-accel-amdgcn-amdhsa-gcc
> returned 1 exit status
> 
> 
> > A better approach might be to pattern-match "gfx{9,10,11}" in the name HSA
> > gives you for the physical device and recommend generic
> > -march=gfx{9,10,11}-generic in those cases?
> 
> I think that will be way worse. — gfx908 and gfx90a are *not* compatible with
> gfx9-generic. Similarly, gfx94{0,1,2}/gfx950 are gfx9 devices but only in
> gfx9-4-generic and not supported by us. And for gfx10, we only support
> gfx10-3-generic, i.e. gfx103x (technically x = 0...6, currently only 0 and 3),
> but not gfx10-1-generic (gfx101{0,1,2,3}).
> 
> Thus, I think it is way better to assume that GPUs listed for each
> gfx*-generic as having identical ISA than any other proposed way. We could
> hard code this ourselves (as done in the patch) or to do it by letting ROCm do
> the job.
> 
> (There are some restrictions listed, like "not all VGPR can be used on
> gfx1100" but as we added gfx1103, we can just use the gfx1103 settings as
> gfx1100 does not have those features, either.)
> 
> Thus, I still regard my proposed approach as superior.

For distributors it might be good to just ship -generic multilibs and
have all specific -march=gfxXYZ to map to their respective -generic
variant.  That is, consider the configured multilibs when interpreting
-march=gfxXYZ which probably means always configuring the -generic
multilibs (and back to dependence on llvm19 and recent ROCm for the
runtime ...).

That said, I'm happy about -generic, and I hope it ends up in GCC 15
in some way.

> 
> > I'm happy to add the new gfx9-generic, and improving the diagnostics is
> > always good, but I'm not convinced about making it look like we support
> > devices we've never tested.
> 
> As mentioned, AMD regards them as compatible. I am happy to add some wording
> like "(unsupported)" to the -march= documentation, in case it helps.

It's also realistically a chicken-and-egg issue - IMO making
-march=gfx1034 available without having tested it (by copy-pasting from
a lower tier?) is going to get it testet quicker than when there's
no way to test.  And there's always bugzilla if it doesn't work ...

Richard.

> Tobias
> 

-- 
Richard Biener <rguent...@suse.de>
SUSE Software Solutions Germany GmbH,
Frankenstrasse 146, 90461 Nuernberg, Germany;
GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg)

Reply via email to