On Wed, 5 Feb 2025, Tobias Burnus wrote: > Hi Andrew, > > Andrew Stubbs wrote: > > On 05/02/2025 11:14, Tobias Burnus wrote: > >> Therefore, the following GPUs are now supported in addition: gfx902, > >> gfx904, gfx909, gfx1031, gfx1032, gfx1033, gfx1034, gfx1035, gfx1101, > >> gfx1102, gfx1150, gfx1151, gfx1152, and gfx1153. However, the multilib > >> config has not been touched, hence, those 14 device types and > >> gfx{9,10-3,11}-generic are not supported by default. Currently, the > >> following 9 GPUs are enabled by default:gfx900, gfx906, gfx908, gfx90a, > >> gfx90c, gfx1030, gfx1036, gfx1100, andgfx1103. > > > > I'm not too happy about adding a whole list of specific devices that we have > > not tested. So far, whenever I have added a new device there have been > > meta-data oddities and such-like that needed to be tweaked. > > Well, the idea is: If AMD has collected them under the same generic name, the > ISA must be compatible. The LLVM page lists some restrictions (such as not > having sramecc support when using generic) but none of the listed items match > what we have. > > I fail to see how an ISA that works with, e.g., gfx9-generic will suddenly > fail when compiling for it with gfx902, which except for the ELF flag contains > identical code. > > > I also don't like adding knowledge of unsupported devices purely for > > improving diagnostics. > > I think we have the option to delegate the checking purely to ROCm. Then > gfx9-generic will run on gfx909 – or we do our own checking. But then we need > to somehow know whether gfx9-generic code will run on gfx909 or not – or we > bluntly reject it. > > > It's fine for the known-unsupported devices, but wait a month or so and > > there will be new unknown-unsupported devices, and the message degrades > > again. Worse, the new diagnostic can recommend trying -march=<name> for > > devices which the compiler will recognize but have never been tested, and > > probably don't have multilibs configured. > > The having-no-multilib-configured issue is difficult to come by, unless we > want to filter them out when building libgomp. We could do so, however, by > doing some preprocessing. > > The problem is that we then need to have two checks: > > (a) Whether it runs (if we don't relegate it to ROCm) – in that case, gfx902 > hardware with gfx9-generic should just work, even if there is neither a gfx902 > nor gfx9-generic multilib. After all, the user managed to link the executable. > > (b) When recompiling on the same system as running the build, suggesting a > -march=gfx... that has a multilib would be better, i.e. here the filtered-out > value could be helpful. > > (c) For suggesting generic, we also would need to check the ROCm version to > only propose it when ROCm is > 6.3, assuming that's the thing. > > BTW: The issue of having no multilib configured is not really new. We had it > before with fiji or when the user configured GCC in some non-default way. (As > we currently enable all GPUs by default. But I think we didn't do so for a > while for the gfx1... ones. But I don't recall whether we did do so for a > release or not.) > > And I think the error is also not to illegible: > > ld: error: unable to find library -lgfortran > > gcn mkoffload: fatal error: .../x86_64-pc-linux-gnu-accel-amdgcn-amdhsa-gcc > returned 1 exit status > > > > A better approach might be to pattern-match "gfx{9,10,11}" in the name HSA > > gives you for the physical device and recommend generic > > -march=gfx{9,10,11}-generic in those cases? > > I think that will be way worse. — gfx908 and gfx90a are *not* compatible with > gfx9-generic. Similarly, gfx94{0,1,2}/gfx950 are gfx9 devices but only in > gfx9-4-generic and not supported by us. And for gfx10, we only support > gfx10-3-generic, i.e. gfx103x (technically x = 0...6, currently only 0 and 3), > but not gfx10-1-generic (gfx101{0,1,2,3}). > > Thus, I think it is way better to assume that GPUs listed for each > gfx*-generic as having identical ISA than any other proposed way. We could > hard code this ourselves (as done in the patch) or to do it by letting ROCm do > the job. > > (There are some restrictions listed, like "not all VGPR can be used on > gfx1100" but as we added gfx1103, we can just use the gfx1103 settings as > gfx1100 does not have those features, either.) > > Thus, I still regard my proposed approach as superior.
For distributors it might be good to just ship -generic multilibs and have all specific -march=gfxXYZ to map to their respective -generic variant. That is, consider the configured multilibs when interpreting -march=gfxXYZ which probably means always configuring the -generic multilibs (and back to dependence on llvm19 and recent ROCm for the runtime ...). That said, I'm happy about -generic, and I hope it ends up in GCC 15 in some way. > > > I'm happy to add the new gfx9-generic, and improving the diagnostics is > > always good, but I'm not convinced about making it look like we support > > devices we've never tested. > > As mentioned, AMD regards them as compatible. I am happy to add some wording > like "(unsupported)" to the -march= documentation, in case it helps. It's also realistically a chicken-and-egg issue - IMO making -march=gfx1034 available without having tested it (by copy-pasting from a lower tier?) is going to get it testet quicker than when there's no way to test. And there's always bugzilla if it doesn't work ... Richard. > Tobias > -- Richard Biener <rguent...@suse.de> SUSE Software Solutions Germany GmbH, Frankenstrasse 146, 90461 Nuernberg, Germany; GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg)