On Thu, 15 Feb 2024, Andrew Stubbs wrote: > On 15/02/2024 07:49, Richard Biener wrote: > > On Wed, 14 Feb 2024, Andrew Stubbs wrote: > > > >> On 14/02/2024 13:43, Richard Biener wrote: > >>> On Wed, 14 Feb 2024, Andrew Stubbs wrote: > >>> > >>>> On 14/02/2024 13:27, Richard Biener wrote: > >>>>> On Wed, 14 Feb 2024, Andrew Stubbs wrote: > >>>>> > >>>>>> On 13/02/2024 08:26, Richard Biener wrote: > >>>>>>> On Mon, 12 Feb 2024, Thomas Schwinge wrote: > >>>>>>> > >>>>>>>> Hi! > >>>>>>>> > >>>>>>>> On 2023-10-20T12:51:03+0100, Andrew Stubbs <a...@codesourcery.com> > >>>>>>>> wrote: > >>>>>>>>> I've committed this patch > >>>>>>>> > >>>>>>>> ... as commit c7ec7bd1c6590cf4eed267feab490288e0b8d691 > >>>>>>>> "amdgcn: add -march=gfx1030 EXPERIMENTAL". > >>>>>>>> > >>>>>>>> The RDNA2 ISA variant doesn't support certain instructions previous > >>>>>>>> implemented in GCC/GCN, so a number of patterns etc. had to be > >>>>>>>> disabled: > >>>>>>>> > >>>>>>>>> [...] Vector > >>>>>>>>> reductions will need to be reworked for RDNA2. [...] > >>>>>>>> > >>>>>>>>> * config/gcn/gcn-valu.md (@dpp_move<mode>): Disable for RDNA2. > >>>>>>>>> (addc<mode>3<exec_vcc>): Add RDNA2 syntax variant. > >>>>>>>>> (subc<mode>3<exec_vcc>): Likewise. > >>>>>>>>> (<convop><mode><vndi>2_exec): Add RDNA2 alternatives. > >>>>>>>>> (vec_cmp<mode>di): Likewise. > >>>>>>>>> (vec_cmp<u><mode>di): Likewise. > >>>>>>>>> (vec_cmp<mode>di_exec): Likewise. > >>>>>>>>> (vec_cmp<u><mode>di_exec): Likewise. > >>>>>>>>> (vec_cmp<mode>di_dup): Likewise. > >>>>>>>>> (vec_cmp<mode>di_dup_exec): Likewise. > >>>>>>>>> (reduc_<reduc_op>_scal_<mode>): Disable for RDNA2. > >>>>>>>>> (*<reduc_op>_dpp_shr_<mode>): Likewise. > >>>>>>>>> (*plus_carry_dpp_shr_<mode>): Likewise. > >>>>>>>>> (*plus_carry_in_dpp_shr_<mode>): Likewise. > >>>>>>>> > >>>>>>>> Etc. The expectation being that GCC middle end copes with this, and > >>>>>>>> synthesizes some less ideal yet still functional vector code, I > >>>>>>>> presume. > >>>>>>>> > >>>>>>>> The later RDNA3/gfx1100 support builds on top of this, and that's > >>>>>>>> what > >>>>>>>> I'm currently working on getting proper GCC/GCN target (not > >>>>>>>> offloading) > >>>>>>>> results for. > >>>>>>>> > >>>>>>>> I'm seeing a good number of execution test FAILs (regressions > >>>>>>>> compared > >>>>>>>> to > >>>>>>>> my earlier non-gfx1100 testing), and I've now tracked down where one > >>>>>>>> large class of those comes into existance -- not yet how to resolve, > >>>>>>>> unfortunately. But maybe, with you guys' combined vectorizer and > >>>>>>>> back > >>>>>>>> end experience, the latter will be done quickly? > >>>>>>>> > >>>>>>>> Richard, I don't know if you've ever run actual GCC/GCN target (not > >>>>>>>> offloading) testing; let me know if you have any questions about > >>>>>>>> that. > >>>>>>> > >>>>>>> I've only done offload testing - in the x86_64 build tree run > >>>>>>> check-target-libgomp. If you can tell me how to do GCN target testing > >>>>>>> (maybe document it on the wiki even!) I can try do that as well. > >>>>>>> > >>>>>>>> Given that (at least largely?) the same patterns etc. are disabled as > >>>>>>>> in > >>>>>>>> my gfx1100 configuration, I suppose your gfx1030 one would exhibit > >>>>>>>> the > >>>>>>>> same issues. You can build GCC/GCN target like you build the > >>>>>>>> offloading > >>>>>>>> one, just remove '--enable-as-accelerator-for=[...]'. Likely, you > >>>>>>>> can > >>>>>>>> even use a offloading GCC/GCN build to reproduce the issue below. > >>>>>>>> > >>>>>>>> One example is the attached 'builtin-bitops-1.c', reduced from > >>>>>>>> 'gcc.c-torture/execute/builtin-bitops-1.c', where 'my_popcount' is > >>>>>>>> miscompiled as soon as '-ftree-vectorize' is effective: > >>>>>>>> > >>>>>>>> $ build-gcc/gcc/xgcc -Bbuild-gcc/gcc/ builtin-bitops-1.c > >>>>>>>> -Bbuild-gcc/amdgcn-amdhsa/gfx1100/newlib/ > >>>>>>>> -Lbuild-gcc/amdgcn-amdhsa/gfx1100/newlib -fdump-tree-all-all > >>>>>>>> -fdump-ipa-all-all -fdump-rtl-all-all -save-temps > >>>>>>>> -march=gfx1100 > >>>>>>>> -O1 > >>>>>>>> -ftree-vectorize > >>>>>>>> > >>>>>>>> In the 'diff' of 'a-builtin-bitops-1.c.179t.vect', for example, for > >>>>>>>> '-march=gfx90a' vs. '-march=gfx1100', we see: > >>>>>>>> > >>>>>>>> +builtin-bitops-1.c:7:17: missed: reduc op not supported by > >>>>>>>> target. > >>>>>>>> > >>>>>>>> ..., and therefore: > >>>>>>>> > >>>>>>>> -builtin-bitops-1.c:7:17: note: Reduce using direct vector > >>>>>>>> reduction. > >>>>>>>> +builtin-bitops-1.c:7:17: note: Reduce using vector shifts > >>>>>>>> +builtin-bitops-1.c:7:17: note: extract scalar result > >>>>>>>> > >>>>>>>> That is, instead of one '.REDUC_PLUS' for gfx90a, for gfx1100 we > >>>>>>>> build > >>>>>>>> a > >>>>>>>> chain of summation of 'VEC_PERM_EXPR's. However, there's wrong code > >>>>>>>> generated: > >>>>>>>> > >>>>>>>> $ flock /tmp/gcn.lock build-gcc/gcc/gcn-run a.out > >>>>>>>> i=1, ints[i]=0x1 a=1, b=2 > >>>>>>>> i=2, ints[i]=0x80000000 a=1, b=2 > >>>>>>>> i=3, ints[i]=0x2 a=1, b=2 > >>>>>>>> i=4, ints[i]=0x40000000 a=1, b=2 > >>>>>>>> i=5, ints[i]=0x10000 a=1, b=2 > >>>>>>>> i=6, ints[i]=0x8000 a=1, b=2 > >>>>>>>> i=7, ints[i]=0xa5a5a5a5 a=16, b=32 > >>>>>>>> i=8, ints[i]=0x5a5a5a5a a=16, b=32 > >>>>>>>> i=9, ints[i]=0xcafe0000 a=11, b=22 > >>>>>>>> i=10, ints[i]=0xcafe00 a=11, b=22 > >>>>>>>> i=11, ints[i]=0xcafe a=11, b=22 > >>>>>>>> i=12, ints[i]=0xffffffff a=32, b=64 > >>>>>>>> > >>>>>>>> (I can't tell if the 'b = 2 * a' pattern is purely coincidental?) > >>>>>>>> > >>>>>>>> I don't speak enough "vectorization" to fully understand the generic > >>>>>>>> vectorized algorithm and its implementation. It appears that the > >>>>>>>> "Reduce using vector shifts" code has been around for a very long > >>>>>>>> time, > >>>>>>>> but also has gone through a number of changes. I can't tell which > >>>>>>>> GCC > >>>>>>>> targets/configurations it's actually used for (in the same way as for > >>>>>>>> GCN gfx1100), and thus whether there's an issue in that vectorizer > >>>>>>>> code, > >>>>>>>> or rather in the GCN back end, or GCN back end parameterizing the > >>>>>>>> generic > >>>>>>>> code? > >>>>>>> > >>>>>>> The "shift" reduction is basically doing reduction by repeatedly > >>>>>>> adding the upper to the lower half of the vector (each time halving > >>>>>>> the vector size). > >>>>>>> > >>>>>>>> Manually working through the 'a-builtin-bitops-1.c.265t.optimized' > >>>>>>>> code: > >>>>>>>> > >>>>>>>> int my_popcount (unsigned int x) > >>>>>>>> { > >>>>>>>> int stmp__12.12; > >>>>>>>> vector(64) int vect__12.11; > >>>>>>>> vector(64) unsigned int vect__1.8; > >>>>>>>> vector(64) unsigned int _13; > >>>>>>>> vector(64) unsigned int vect_cst__18; > >>>>>>>> vector(64) int [all others]; > >>>>>>>> > >>>>>>>> <bb 2> [local count: 32534376]: > >>>>>>>> vect_cst__18 = { [all 'x_8(D)'] }; > >>>>>>>> vect__1.8_19 = vect_cst__18 >> { 0, 1, 2, [...], 61, 62, 63 > >>>>>>>> }; > >>>>>>>> _13 = .COND_AND ({ [32 x '-1'], [32 x '0'] }, vect__1.8_19, > >>>>>>>> { > >>>>>>>> [all > >>>>>>>> '1'] }, { [all '0'] }); > >>>>>>>> vect__12.11_24 = VIEW_CONVERT_EXPR<vector(64) int>(_13); > >>>>>>>> _26 = VEC_PERM_EXPR <vect__12.11_24, { [all '0'] }, { 32, > >>>>>>>> 33, > >>>>>>>> 34, > >>>>>>>> [...], 93, 94, 95 }>; > >>>>>>>> _27 = vect__12.11_24 + _26; > >>>>>>>> _28 = VEC_PERM_EXPR <_27, { [all '0'] }, { 16, 17, 18, > >>>>>>>> [...], > >>>>>>>> 77, > >>>>>>>> 78, 79 }>; > >>>>>>>> _29 = _27 + _28; > >>>>>>>> _30 = VEC_PERM_EXPR <_29, { [all '0'] }, { 8, 9, 10, [...], > >>>>>>>> 69, > >>>>>>>> 70, > >>>>>>>> 71 }>; > >>>>>>>> _31 = _29 + _30; > >>>>>>>> _32 = VEC_PERM_EXPR <_31, { [all '0'] }, { 4, 5, 6, [...], > >>>>>>>> 65, > >>>>>>>> 66, > >>>>>>>> 67 }>; > >>>>>>>> _33 = _31 + _32; > >>>>>>>> _34 = VEC_PERM_EXPR <_33, { [all '0'] }, { 2, 3, 4, [...], > >>>>>>>> 63, > >>>>>>>> 64, > >>>>>>>> 65 }>; > >>>>>>>> _35 = _33 + _34; > >>>>>>>> _36 = VEC_PERM_EXPR <_35, { [all '0'] }, { 1, 2, 3, [...], > >>>>>>>> 62, > >>>>>>>> 63, > >>>>>>>> 64 }>; > >>>>>>>> _37 = _35 + _36; > >>>>>>>> stmp__12.12_38 = BIT_FIELD_REF <_37, 32, 0>; > >>>>>>>> return stmp__12.12_38; > >>>>>>>> > >>>>>>>> ..., for example, for 'x = 7', we get: > >>>>>>>> > >>>>>>>> vect_cst__18 = { [all '7'] }; > >>>>>>>> vect__1.8_19 = { 7, 3, 1, 0, 0, 0, [...] }; > >>>>>>>> _13 = { 1, 1, 1, 0, 0, 0, [...] }; > >>>>>>>> vect__12.11_24 = { 1, 1, 1, 0, 0, 0, [...] }; > >>>>>>>> _26 = { [all '0'] }; > >>>>>>>> _27 = { 1, 1, 1, 0, 0, 0, [...] }; > >>>>>>>> _28 = { [all '0'] }; > >>>>>>>> _29 = { 1, 1, 1, 0, 0, 0, [...] }; > >>>>>>>> _30 = { [all '0'] }; > >>>>>>>> _31 = { 1, 1, 1, 0, 0, 0, [...] }; > >>>>>>>> _32 = { [all '0'] }; > >>>>>>>> _33 = { 1, 1, 1, 0, 0, 0, [...] }; > >>>>>>>> _34 = { 1, 0, 0, 0, [...] }; > >>>>>>>> _35 = { 2, 1, 1, 0, 0, 0, [...] }; > >>>>>>>> _36 = { 1, 1, 0, 0, 0, [...] }; > >>>>>>>> _37 = { 3, 2, 1, 0, 0, 0, [...] }; > >>>>>>>> stmp__12.12_38 = 3; > >>>>>>>> return 3; > >>>>>>>> > >>>>>>>> ..., so the algorithm would appear to synthesize correct code for > >>>>>>>> that > >>>>>>>> case. Adding '7' to 'builtin-bitops-1.c', we however again get: > >>>>>>>> > >>>>>>>> i=13, ints[i]=0x7 a=3, b=6 > >>>>>>>> > >>>>>>>> > >>>>>>>> With the following hack applied to 'gcc/tree-vect-loop.cc': > >>>>>>>> > >>>>>>>> @@ -6687,8 +6687,9 @@ vect_create_epilog_for_reduction > >>>>>>>> (loop_vec_info > >>>>>>>> loop_vinfo, > >>>>>>>> reduce_with_shift = have_whole_vector_shift (mode1); > >>>>>>>> if (!VECTOR_MODE_P (mode1) > >>>>>>>> || !directly_supported_p (code, vectype1)) > >>>>>>>> reduce_with_shift = false; > >>>>>>>> + reduce_with_shift = false; > >>>>>>>> > >>>>>>>> ..., I'm able to work around those regressions: by means of forcing > >>>>>>>> "Reduce using scalar code" instead of "Reduce using vector shifts". > >>>>>>> > >>>>>>> I would say it somewhere gets broken between the vectorizer and the > >>>>>>> GPU > >>>>>>> which means likely in the target? Can you point out an issue in the > >>>>>>> actual generated GCN code? > >>>>>>> > >>>>>>> Iff this kind of reduction is the issue you'd see quite a lot of > >>>>>>> vectorzer execute FAILs. I'm seeing a .COND_AND above - could it > >>>>>>> be that the "mask" is still set wrong when doing the reduction > >>>>>>> steps? > >>>>>> > >>>>>> It looks like the ds_bpermute_b32 instruction works differently on > >>>>>> RDNA3 > >>>>>> (vs. > >>>>>> GCN/CDNA and even RDNA2). > >>>>>> > >>>>>> From the pseudocode in the documentation: > >>>>>> > >>>>>> for i in 0 : WAVE64 ? 63 : 31 do > >>>>>> // ADDR needs to be divided by 4. > >>>>>> // High-order bits are ignored. > >>>>>> // NOTE: destination lane is MOD 32 regardless of wave size. > >>>>>> src_lane = 32'I(VGPR[i][ADDR] + OFFSET.b) / 4 % 32; > >>>>>> // EXEC is applied to the source VGPR reads. > >>>>>> if EXEC[src_lane].u1 then > >>>>>> tmp[i] = VGPR[src_lane][DATA0] > >>>>>> endif > >>>>>> endfor; > >>>>>> > >>>>>> The key detail is the "mod 32"; the other architectures have "mod 64" > >>>>>> there. > >>>>>> > >>>>>> So, the last 32 lanes are discarded, and the first 32 lanes are > >>>>>> duplicated > >>>>>> into the last, and this explains why my_popcount returns double the > >>>>>> expected > >>>>>> value for smaller inputs. > >>>>>> > >>>>>> Richi, can you confirm that this testcase works properly on your card, > >>>>>> please? > >>>>>> > >>>>>> To test, assuming you only have the offload toolchain built, compile > >>>>>> using > >>>>>> x86_64-none-linux-gnu-accel-amdgcn-amdhsa-gcc, which should produce a > >>>>>> raw > >>>>>> AMD > >>>>>> ELF file. Then you run it using "gcn-run a.out" (you can find gcn-run > >>>>>> under > >>>>>> libexec). > >>>>> > >>>>> I'm getting > >>>>> > >>>>> i=1, ints[i]=0x1 a=1, b=2 > >>>>> i=2, ints[i]=0x80000000 a=1, b=2 > >>>>> i=3, ints[i]=0x2 a=1, b=2 > >>>>> i=4, ints[i]=0x40000000 a=1, b=2 > >>>>> i=5, ints[i]=0x10000 a=1, b=2 > >>>>> i=6, ints[i]=0x8000 a=1, b=2 > >>>>> i=7, ints[i]=0xa5a5a5a5 a=16, b=32 > >>>>> i=8, ints[i]=0x5a5a5a5a a=16, b=32 > >>>>> i=9, ints[i]=0xcafe0000 a=11, b=22 > >>>>> i=10, ints[i]=0xcafe00 a=11, b=22 > >>>>> i=11, ints[i]=0xcafe a=11, b=22 > >>>>> i=12, ints[i]=0xffffffff a=32, b=64 > >>>>> > >>>>> which I think is the same as Thomas output and thus wrong? > >>>>> > >>>>> When building with -O0 I get no output. > >>>>> > >>>>> I'm of course building with -march=gfx1030 > >>>> > >>>> OK, please try this example, just to check my expectation that your > >>>> permute > >>>> works: > >>>> > >>>> typedef int v64si __attribute__ ((vector_size (256))); > >>>> > >>>> int main() > >>>> { > >>>> v64si permute = { > >>>> 40, 40, 40, 40, 40, 40, 40, 40, > >>>> 40, 40, 40, 40, 40, 40, 40, 40, > >>>> 40, 40, 40, 40, 40, 40, 40, 40, > >>>> 40, 40, 40, 40, 40, 40, 40, 40, > >>>> 40, 40, 40, 40, 40, 40, 40, 40, > >>>> 40, 40, 40, 40, 40, 40, 40, 40, > >>>> 40, 40, 40, 40, 40, 40, 40, 40, > >>>> 40, 40, 40, 40, 40, 40, 40, 40 > >>>> }; > >>>> v64si result; > >>>> > >>>> asm ("ds_bpermute_b32 %0, %1, v1" : "=v"(result) : "v"(permute), > >>>> "e"(-1L)); > >>>> > >>>> for (int i=0; i<63; i++) > >>>> __builtin_printf ("%d ", result[i]); > >>>> __builtin_printf ("\n"); > >>>> > >>>> return 0; > >>>> } > >>>> > >>>> On GCN/CDNA devices I expect this to print "10" 64 times. On RDNA3 it > >>>> prints > >>>> "10" 32 times, and "42" 32 times (which doesn't quite match what I'd > >>>> expect > >>>> from the pseudocode, but does match the written description). Which do > >>>> you > >>>> get? > >>> > >>> 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 > >>> 10 10 10 10 10 10 10 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 > >>> 42 42 42 42 42 42 42 42 42 42 42 42 42 > >>> > >>> so RDNA2 matches RDNA3 here. > >> > >> OK, that probably is the problem with both our reductions then. The RDNA2 > >> manual has the 32-lane wording in the description, but the instruction > >> pseudocode lies. :( > >> > >> I'm now not sure how to implement permute without actually hitting memory? > >> The > >> permutation vector is exactly what we'd need to do a gather load from > >> memory > >> (not a coincident), but we'd need to find a memory location to do it, > >> ideally > >> in the low-latency LDS memory, and it'd have to be thread-safe. > >> > >> The attached not-well-tested patch should allow only valid permutations. > >> Hopefully we go back to working code, but there'll be things that won't > >> vectorize. That said, the new "dump" output code has fewer and probably > >> cheaper instructions, so hmmm. > > > > This fixes the reduced builtin-bitops-1.c on RDNA2. > > > > I suppse if RDNA really only has 32 lane vectors (it sounds like it, > > even if it can "simulate" 64 lane ones?) then it might make sense to > > vectorize for 32 lanes? That said, with variable-length it likely > > doesn't matter but I'd not expose fixed-size modes with 64 lanes then? > > For most operations, wavefrontsize=64 works just fine; the GPU runs each > instruction twice and presents a pair of hardware registers as a logical > 64-lane register. This breaks down for permutations and reductions, and is > obviously inefficient when they vectors are not fully utilized, but is > otherwise compatible with the GCN/CDNA compiler. > > I didn't want to invest all the effort it would take to support > wavefrontsize=32, which would be the natural mode for these devices; the > number of places that have "64" hard-coded is just too big. Not only that, but > the EXEC and VCC registers change from DImode to SImode and that's going to > break a lot of stuff. (And we have no paying customer for this.) > > I'm open to patch submissions. :)
OK, I see ;) As said for fully masked that's a good answer. I'd probably still not expose V64mode modes in the RTL expanders for the vect_* patterns? Or, what happens if you change gcn_vectorize_preferred_simd_mode to return 32 lane modes for RDNA and omit 64 lane modes from gcn_autovectorize_vector_modes for RDNA? Does that possibly leave performance on the plate? (not sure if there's any documents about choosing wavefrontsize=64 vs 32 with regard to performance) Note it would entirely forbit the vectorizer from using larger modes, it just makes it prefer the smaller ones. OTOH if you then run wavefrontsize=64 ontop of it it's probably wasting the 2nd instruction by always masking it? So yeah. Guess a s/64/wavefrontsize/ would be a first step towards allowing 32 there ... Anyway, the fix works, so that's the most important thing ;) Richard.