On 05/02/2025 12:51, Tobias Burnus wrote:
Hi Andrew,
Andrew Stubbs wrote:
On 05/02/2025 11:14, Tobias Burnus wrote:
Therefore, the following GPUs are now supported in addition: gfx902,
gfx904, gfx909, gfx1031, gfx1032, gfx1033, gfx1034, gfx1035, gfx1101,
gfx1102, gfx1150, gfx1151, gfx1152, and gfx1153. However, the
multilib config has not been touched, hence, those 14 device types
and gfx{9,10-3,11}-generic are not supported by default. Currently,
the following 9 GPUs are enabled by default:gfx900, gfx906, gfx908,
gfx90a, gfx90c, gfx1030, gfx1036, gfx1100, andgfx1103.
I'm not too happy about adding a whole list of specific devices that
we have not tested. So far, whenever I have added a new device there
have been meta-data oddities and such-like that needed to be tweaked.
Well, the idea is: If AMD has collected them under the same generic
name, the ISA must be compatible. The LLVM page lists some restrictions
(such as not having sramecc support when using generic) but none of the
listed items match what we have.
I fail to see how an ISA that works with, e.g., gfx9-generic will
suddenly fail when compiling for it with gfx902, which except for the
ELF flag contains identical code.
I can imagine that the "generic" ISA is some safe subset of the actual
ISA, but that's not really the issue here.
The biggest source of incompatibility has been in the metadata that is
attached to each kernel entrypoint. Possibly these new devices would be
fine, maybe even probably, but until we've proven it then I don't like
making assumptions.
I also don't like adding knowledge of unsupported devices purely for
improving diagnostics.
I think we have the option to delegate the checking purely to ROCm. Then
gfx9-generic will run on gfx909 – or we do our own checking. But then we
need to somehow know whether gfx9-generic code will run on gfx909 or not
– or we bluntly reject it.
Yes, it would be a shame to refuse to run on a new device that would
work just fine in generic mode, only because we never added it to a list.
It's fine for the known-unsupported devices, but wait a month or so
and there will be new unknown-unsupported devices, and the message
degrades again. Worse, the new diagnostic can recommend trying -
march=<name> for devices which the compiler will recognize but have
never been tested, and probably don't have multilibs configured.
The having-no-multilib-configured issue is difficult to come by, unless
we want to filter them out when building libgomp. We could do so,
however, by doing some preprocessing.
The problem is that we then need to have two checks:
(a) Whether it runs (if we don't relegate it to ROCm) – in that case,
gfx902 hardware with gfx9-generic should just work, even if there is
neither a gfx902 nor gfx9-generic multilib. After all, the user managed
to link the executable.
The only reason libgomp has a test is because the error message given by
HSA was not friendly ("wrong arguments", IIRC), and we wanted to prove
that. Having a try-and-catch approach for generic devices might be better?
(b) When recompiling on the same system as running the build, suggesting
a -march=gfx... that has a multilib would be better, i.e. here the
filtered-out value could be helpful.
(c) For suggesting generic, we also would need to check the ROCm version
to only propose it when ROCm is > 6.3, assuming that's the thing.
This we can do. Or we can just catch run-failures, probably.
BTW: The issue of having no multilib configured is not really new. We
had it before with fiji or when the user configured GCC in some non-
default way. (As we currently enable all GPUs by default. But I think we
didn't do so for a while for the gfx1... ones. But I don't recall
whether we did do so for a release or not.)
Not new, no, but the Fiji device was a special case. These new devices
are likely to be much more common in the wild.
Depending what the down-sides are, it might even make sense for us to
deliberately *only* support the generic devices for RDNA (once ROCm 6.3+
is widely deployed). We would continue to support the specific CDNA
devices that are deployed in HPC.
And I think the error is also not to illegible:
ld: error: unable to find library -lgfortran
gcn mkoffload: fatal error: .../x86_64-pc-linux-gnu-accel-amdgcn-amdhsa-
gcc returned 1 exit status
I believe you get that message when the multilib is configured but
absent. If it's just plain not configured then the compiler will attempt
to link in the default multilib, which results in an ELF arch mismatch
error.
A better approach might be to pattern-match "gfx{9,10,11}" in the name
HSA gives you for the physical device and recommend generic -
march=gfx{9,10,11}-generic in those cases?
I think that will be way worse. — gfx908 and gfx90a are *not* compatible
with gfx9-generic. Similarly, gfx94{0,1,2}/gfx950 are gfx9 devices but
only in gfx9-4-generic and not supported by us. And for gfx10, we only
support gfx10-3-generic, i.e. gfx103x (technically x = 0...6, currently
only 0 and 3), but not gfx10-1-generic (gfx101{0,1,2,3}).
gfx908 and gfx90a would be on the specific list, and known to libgomp,
so we could easily avoid those exceptions, I suppose, but this whole
thing does seem suboptimal.
However, if the message only suggests *trying* the generic ISA then that
may be OK. I don't like it suggesting options that are not options the
compiler will outright reject.
Thus, I think it is way better to assume that GPUs listed for each gfx*-
generic as having identical ISA than any other proposed way. We could
hard code this ourselves (as done in the patch) or to do it by letting
ROCm do the job.
(There are some restrictions listed, like "not all VGPR can be used on
gfx1100" but as we added gfx1103, we can just use the gfx1103 settings
as gfx1100 does not have those features, either.)
Thus, I still regard my proposed approach as superior.
I'm happy to add the new gfx9-generic, and improving the diagnostics
is always good, but I'm not convinced about making it look like we
support devices we've never tested.
As mentioned, AMD regards them as compatible. I am happy to add some
wording like "(unsupported)" to the -march= documentation, in case it
helps.
"Unsupported" is ambiguous; "experimental" might better explain that
YMMV, but I'm still not convinced about adding them just for the
diagnostics.
Instead of adding "GCN_DEVICE" for all of them, there's nothing stopping
us introducing GCN_UNSUPPORTED_DEVICE which is ignored by the .opt
generation, etc, but can be used for diagnostics. In fact, I had this
concept in an older version of the patch series that introduced
gcn-devices.def, but I removed it because maintaining a list of
unsupported devices seemed like make-work and doomed to be forever out
of date in any toolchain in the wild.
Andrew