Re: [Patch] [GCN] Handle generic ISA names in libgomp's plugin-gcn.c

Andrew Stubbs Wed, 05 Feb 2025 05:52:13 -0800

On 05/02/2025 12:51, Tobias Burnus wrote:

Hi Andrew,
Andrew Stubbs wrote:
On 05/02/2025 11:14, Tobias Burnus wrote:
Therefore, the following GPUs are now supported in addition: gfx902,gfx904, gfx909, gfx1031, gfx1032, gfx1033, gfx1034, gfx1035, gfx1101,gfx1102, gfx1150, gfx1151, gfx1152, and gfx1153. However, themultilib config has not been touched, hence, those 14 device typesand gfx{9,10-3,11}-generic are not supported by default. Currently,the following 9 GPUs are enabled by default:gfx900, gfx906, gfx908,gfx90a, gfx90c, gfx1030, gfx1036, gfx1100, andgfx1103.
I'm not too happy about adding a whole list of specific devices thatwe have not tested. So far, whenever I have added a new device therehave been meta-data oddities and such-like that needed to be tweaked.
Well, the idea is: If AMD has collected them under the same genericname, the ISA must be compatible. The LLVM page lists some restrictions(such as not having sramecc support when using generic) but none of thelisted items match what we have.
I fail to see how an ISA that works with, e.g., gfx9-generic willsuddenly fail when compiling for it with gfx902, which except for theELF flag contains identical code.

I can imagine that the "generic" ISA is some safe subset of the actualISA, but that's not really the issue here.

The biggest source of incompatibility has been in the metadata that isattached to each kernel entrypoint. Possibly these new devices would befine, maybe even probably, but until we've proven it then I don't likemaking assumptions.

I also don't like adding knowledge of unsupported devices purely forimproving diagnostics.
I think we have the option to delegate the checking purely to ROCm. Thengfx9-generic will run on gfx909 – or we do our own checking. But then weneed to somehow know whether gfx9-generic code will run on gfx909 or not– or we bluntly reject it.

Yes, it would be a shame to refuse to run on a new device that wouldwork just fine in generic mode, only because we never added it to a list.

It's fine for the known-unsupported devices, but wait a month or soand there will be new unknown-unsupported devices, and the messagedegrades again. Worse, the new diagnostic can recommend trying -march=<name> for devices which the compiler will recognize but havenever been tested, and probably don't have multilibs configured.
The having-no-multilib-configured issue is difficult to come by, unlesswe want to filter them out when building libgomp. We could do so,however, by doing some preprocessing.
The problem is that we then need to have two checks:
(a) Whether it runs (if we don't relegate it to ROCm) – in that case,gfx902 hardware with gfx9-generic should just work, even if there isneither a gfx902 nor gfx9-generic multilib. After all, the user managedto link the executable.

The only reason libgomp has a test is because the error message given byHSA was not friendly ("wrong arguments", IIRC), and we wanted to provethat. Having a try-and-catch approach for generic devices might be better?

(b) When recompiling on the same system as running the build, suggestinga -march=gfx... that has a multilib would be better, i.e. here thefiltered-out value could be helpful.
(c) For suggesting generic, we also would need to check the ROCm versionto only propose it when ROCm is > 6.3, assuming that's the thing.


This we can do. Or we can just catch run-failures, probably.

BTW: The issue of having no multilib configured is not really new. Wehad it before with fiji or when the user configured GCC in some non-default way. (As we currently enable all GPUs by default. But I think wedidn't do so for a while for the gfx1... ones. But I don't recallwhether we did do so for a release or not.)

Not new, no, but the Fiji device was a special case. These new devicesare likely to be much more common in the wild.

Depending what the down-sides are, it might even make sense for us todeliberately *only* support the generic devices for RDNA (once ROCm 6.3+is widely deployed). We would continue to support the specific CDNAdevices that are deployed in HPC.

And I think the error is also not to illegible:

ld: error: unable to find library -lgfortran
gcn mkoffload: fatal error: .../x86_64-pc-linux-gnu-accel-amdgcn-amdhsa-gcc returned 1 exit status

I believe you get that message when the multilib is configured butabsent. If it's just plain not configured then the compiler will attemptto link in the default multilib, which results in an ELF arch mismatcherror.

A better approach might be to pattern-match "gfx{9,10,11}" in the nameHSA gives you for the physical device and recommend generic -march=gfx{9,10,11}-generic in those cases?
I think that will be way worse. — gfx908 and gfx90a are *not* compatiblewith gfx9-generic. Similarly, gfx94{0,1,2}/gfx950 are gfx9 devices butonly in gfx9-4-generic and not supported by us. And for gfx10, we onlysupport gfx10-3-generic, i.e. gfx103x (technically x = 0...6, currentlyonly 0 and 3), but not gfx10-1-generic (gfx101{0,1,2,3}).

gfx908 and gfx90a would be on the specific list, and known to libgomp,so we could easily avoid those exceptions, I suppose, but this wholething does seem suboptimal.

However, if the message only suggests *trying* the generic ISA then thatmay be OK. I don't like it suggesting options that are not options thecompiler will outright reject.

Thus, I think it is way better to assume that GPUs listed for each gfx*-generic as having identical ISA than any other proposed way. We couldhard code this ourselves (as done in the patch) or to do it by lettingROCm do the job.
(There are some restrictions listed, like "not all VGPR can be used ongfx1100" but as we added gfx1103, we can just use the gfx1103 settingsas gfx1100 does not have those features, either.)
Thus, I still regard my proposed approach as superior.
I'm happy to add the new gfx9-generic, and improving the diagnosticsis always good, but I'm not convinced about making it look like wesupport devices we've never tested.
As mentioned, AMD regards them as compatible. I am happy to add somewording like "(unsupported)" to the -march= documentation, in case ithelps.

"Unsupported" is ambiguous; "experimental" might better explain thatYMMV, but I'm still not convinced about adding them just for thediagnostics.

Instead of adding "GCN_DEVICE" for all of them, there's nothing stoppingus introducing GCN_UNSUPPORTED_DEVICE which is ignored by the .optgeneration, etc, but can be used for diagnostics. In fact, I had thisconcept in an older version of the patch series that introducedgcn-devices.def, but I removed it because maintaining a list ofunsupported devices seemed like make-work and doomed to be forever outof date in any toolchain in the wild.


Andrew

Re: [Patch] [GCN] Handle generic ISA names in libgomp's plugin-gcn.c

Reply via email to