On 05/02/2025 12:51, Tobias Burnus wrote:
Hi Andrew,

Andrew Stubbs wrote:
On 05/02/2025 11:14, Tobias Burnus wrote:
Therefore, the following GPUs are now supported in addition: gfx902, gfx904, gfx909, gfx1031, gfx1032, gfx1033, gfx1034, gfx1035, gfx1101, gfx1102, gfx1150, gfx1151, gfx1152, and gfx1153. However, the multilib config has not been touched, hence, those 14 device types and gfx{9,10-3,11}-generic are not supported by default. Currently, the following 9 GPUs are enabled by default:gfx900, gfx906, gfx908, gfx90a, gfx90c, gfx1030, gfx1036, gfx1100, andgfx1103.

I'm not too happy about adding a whole list of specific devices that we have not tested. So far, whenever I have added a new device there have been meta-data oddities and such-like that needed to be tweaked.

Well, the idea is: If AMD has collected them under the same generic name, the ISA must be compatible. The LLVM page lists some restrictions (such as not having sramecc support when using generic) but none of the listed items match what we have.

I fail to see how an ISA that works with, e.g., gfx9-generic will suddenly fail when compiling for it with gfx902, which except for the ELF flag contains identical code.

I can imagine that the "generic" ISA is some safe subset of the actual ISA, but that's not really the issue here.

The biggest source of incompatibility has been in the metadata that is attached to each kernel entrypoint. Possibly these new devices would be fine, maybe even probably, but until we've proven it then I don't like making assumptions.

I also don't like adding knowledge of unsupported devices purely for improving diagnostics.

I think we have the option to delegate the checking purely to ROCm. Then gfx9-generic will run on gfx909 – or we do our own checking. But then we need to somehow know whether gfx9-generic code will run on gfx909 or not – or we bluntly reject it.

Yes, it would be a shame to refuse to run on a new device that would work just fine in generic mode, only because we never added it to a list.

It's fine for the known-unsupported devices, but wait a month or so and there will be new unknown-unsupported devices, and the message degrades again. Worse, the new diagnostic can recommend trying - march=<name> for devices which the compiler will recognize but have never been tested, and probably don't have multilibs configured.

The having-no-multilib-configured issue is difficult to come by, unless we want to filter them out when building libgomp. We could do so, however, by doing some preprocessing.

The problem is that we then need to have two checks:

(a) Whether it runs (if we don't relegate it to ROCm) – in that case, gfx902 hardware with gfx9-generic should just work, even if there is neither a gfx902 nor gfx9-generic multilib. After all, the user managed to link the executable.

The only reason libgomp has a test is because the error message given by HSA was not friendly ("wrong arguments", IIRC), and we wanted to prove that. Having a try-and-catch approach for generic devices might be better?

(b) When recompiling on the same system as running the build, suggesting a -march=gfx... that has a multilib would be better, i.e. here the filtered-out value could be helpful.

(c) For suggesting generic, we also would need to check the ROCm version to only propose it when ROCm is > 6.3, assuming that's the thing.

This we can do. Or we can just catch run-failures, probably.

BTW: The issue of having no multilib configured is not really new. We had it before with fiji or when the user configured GCC in some non- default way. (As we currently enable all GPUs by default. But I think we didn't do so for a while for the gfx1... ones. But I don't recall whether we did do so for a release or not.)

Not new, no, but the Fiji device was a special case. These new devices are likely to be much more common in the wild.

Depending what the down-sides are, it might even make sense for us to deliberately *only* support the generic devices for RDNA (once ROCm 6.3+ is widely deployed). We would continue to support the specific CDNA devices that are deployed in HPC.

And I think the error is also not to illegible:

ld: error: unable to find library -lgfortran

gcn mkoffload: fatal error: .../x86_64-pc-linux-gnu-accel-amdgcn-amdhsa- gcc returned 1 exit status

I believe you get that message when the multilib is configured but absent. If it's just plain not configured then the compiler will attempt to link in the default multilib, which results in an ELF arch mismatch error.

A better approach might be to pattern-match "gfx{9,10,11}" in the name HSA gives you for the physical device and recommend generic - march=gfx{9,10,11}-generic in those cases?

I think that will be way worse. — gfx908 and gfx90a are *not* compatible with gfx9-generic. Similarly, gfx94{0,1,2}/gfx950 are gfx9 devices but only in gfx9-4-generic and not supported by us. And for gfx10, we only support gfx10-3-generic, i.e. gfx103x (technically x = 0...6, currently only 0 and 3), but not gfx10-1-generic (gfx101{0,1,2,3}).

gfx908 and gfx90a would be on the specific list, and known to libgomp, so we could easily avoid those exceptions, I suppose, but this whole thing does seem suboptimal.

However, if the message only suggests *trying* the generic ISA then that may be OK. I don't like it suggesting options that are not options the compiler will outright reject.

Thus, I think it is way better to assume that GPUs listed for each gfx*- generic as having identical ISA than any other proposed way. We could hard code this ourselves (as done in the patch) or to do it by letting ROCm do the job.

(There are some restrictions listed, like "not all VGPR can be used on gfx1100" but as we added gfx1103, we can just use the gfx1103 settings as gfx1100 does not have those features, either.)

Thus, I still regard my proposed approach as superior.


I'm happy to add the new gfx9-generic, and improving the diagnostics is always good, but I'm not convinced about making it look like we support devices we've never tested.

As mentioned, AMD regards them as compatible. I am happy to add some wording like "(unsupported)" to the -march= documentation, in case it helps.

"Unsupported" is ambiguous; "experimental" might better explain that YMMV, but I'm still not convinced about adding them just for the diagnostics.

Instead of adding "GCN_DEVICE" for all of them, there's nothing stopping us introducing GCN_UNSUPPORTED_DEVICE which is ignored by the .opt generation, etc, but can be used for diagnostics. In fact, I had this concept in an older version of the patch series that introduced gcn-devices.def, but I removed it because maintaining a list of unsupported devices seemed like make-work and doomed to be forever out of date in any toolchain in the wild.

Andrew

Reply via email to