Re: GCN RDNA2+ vs. GCC vectorizer "Reduce using vector shifts"

Richard Biener Thu, 15 Feb 2024 02:21:46 -0800

On Thu, 15 Feb 2024, Andrew Stubbs wrote:

> On 15/02/2024 07:49, Richard Biener wrote:
> > On Wed, 14 Feb 2024, Andrew Stubbs wrote:
> > 
> >> On 14/02/2024 13:43, Richard Biener wrote:
> >>> On Wed, 14 Feb 2024, Andrew Stubbs wrote:
> >>>
> >>>> On 14/02/2024 13:27, Richard Biener wrote:
> >>>>> On Wed, 14 Feb 2024, Andrew Stubbs wrote:
> >>>>>
> >>>>>> On 13/02/2024 08:26, Richard Biener wrote:
> >>>>>>> On Mon, 12 Feb 2024, Thomas Schwinge wrote:
> >>>>>>>
> >>>>>>>> Hi!
> >>>>>>>>
> >>>>>>>> On 2023-10-20T12:51:03+0100, Andrew Stubbs <a...@codesourcery.com>
> >>>>>>>> wrote:
> >>>>>>>>> I've committed this patch
> >>>>>>>>
> >>>>>>>> ... as commit c7ec7bd1c6590cf4eed267feab490288e0b8d691
> >>>>>>>> "amdgcn: add -march=gfx1030 EXPERIMENTAL".
> >>>>>>>>
> >>>>>>>> The RDNA2 ISA variant doesn't support certain instructions previous
> >>>>>>>> implemented in GCC/GCN, so a number of patterns etc. had to be
> >>>>>>>> disabled:
> >>>>>>>>
> >>>>>>>>> [...] Vector
> >>>>>>>>> reductions will need to be reworked for RDNA2.  [...]
> >>>>>>>>
> >>>>>>>>>     * config/gcn/gcn-valu.md (@dpp_move<mode>): Disable for RDNA2.
> >>>>>>>>>     (addc<mode>3<exec_vcc>): Add RDNA2 syntax variant.
> >>>>>>>>>     (subc<mode>3<exec_vcc>): Likewise.
> >>>>>>>>>     (<convop><mode><vndi>2_exec): Add RDNA2 alternatives.
> >>>>>>>>>     (vec_cmp<mode>di): Likewise.
> >>>>>>>>>     (vec_cmp<u><mode>di): Likewise.
> >>>>>>>>>     (vec_cmp<mode>di_exec): Likewise.
> >>>>>>>>>     (vec_cmp<u><mode>di_exec): Likewise.
> >>>>>>>>>     (vec_cmp<mode>di_dup): Likewise.
> >>>>>>>>>     (vec_cmp<mode>di_dup_exec): Likewise.
> >>>>>>>>>     (reduc_<reduc_op>_scal_<mode>): Disable for RDNA2.
> >>>>>>>>>     (*<reduc_op>_dpp_shr_<mode>): Likewise.
> >>>>>>>>>     (*plus_carry_dpp_shr_<mode>): Likewise.
> >>>>>>>>>     (*plus_carry_in_dpp_shr_<mode>): Likewise.
> >>>>>>>>
> >>>>>>>> Etc.  The expectation being that GCC middle end copes with this, and
> >>>>>>>> synthesizes some less ideal yet still functional vector code, I
> >>>>>>>> presume.
> >>>>>>>>
> >>>>>>>> The later RDNA3/gfx1100 support builds on top of this, and that's
> >>>>>>>> what
> >>>>>>>> I'm currently working on getting proper GCC/GCN target (not
> >>>>>>>> offloading)
> >>>>>>>> results for.
> >>>>>>>>
> >>>>>>>> I'm seeing a good number of execution test FAILs (regressions
> >>>>>>>> compared
> >>>>>>>> to
> >>>>>>>> my earlier non-gfx1100 testing), and I've now tracked down where one
> >>>>>>>> large class of those comes into existance -- not yet how to resolve,
> >>>>>>>> unfortunately.  But maybe, with you guys' combined vectorizer and
> >>>>>>>> back
> >>>>>>>> end experience, the latter will be done quickly?
> >>>>>>>>
> >>>>>>>> Richard, I don't know if you've ever run actual GCC/GCN target (not
> >>>>>>>> offloading) testing; let me know if you have any questions about
> >>>>>>>> that.
> >>>>>>>
> >>>>>>> I've only done offload testing - in the x86_64 build tree run
> >>>>>>> check-target-libgomp.  If you can tell me how to do GCN target testing
> >>>>>>> (maybe document it on the wiki even!) I can try do that as well.
> >>>>>>>
> >>>>>>>> Given that (at least largely?) the same patterns etc. are disabled as
> >>>>>>>> in
> >>>>>>>> my gfx1100 configuration, I suppose your gfx1030 one would exhibit
> >>>>>>>> the
> >>>>>>>> same issues.  You can build GCC/GCN target like you build the
> >>>>>>>> offloading
> >>>>>>>> one, just remove '--enable-as-accelerator-for=[...]'.  Likely, you
> >>>>>>>> can
> >>>>>>>> even use a offloading GCC/GCN build to reproduce the issue below.
> >>>>>>>>
> >>>>>>>> One example is the attached 'builtin-bitops-1.c', reduced from
> >>>>>>>> 'gcc.c-torture/execute/builtin-bitops-1.c', where 'my_popcount' is
> >>>>>>>> miscompiled as soon as '-ftree-vectorize' is effective:
> >>>>>>>>
> >>>>>>>>         $ build-gcc/gcc/xgcc -Bbuild-gcc/gcc/ builtin-bitops-1.c
> >>>>>>>>         -Bbuild-gcc/amdgcn-amdhsa/gfx1100/newlib/
> >>>>>>>>         -Lbuild-gcc/amdgcn-amdhsa/gfx1100/newlib -fdump-tree-all-all
> >>>>>>>>         -fdump-ipa-all-all -fdump-rtl-all-all -save-temps
> >>>>>>>>         -march=gfx1100
> >>>>>>>>         -O1
> >>>>>>>>         -ftree-vectorize
> >>>>>>>>
> >>>>>>>> In the 'diff' of 'a-builtin-bitops-1.c.179t.vect', for example, for
> >>>>>>>> '-march=gfx90a' vs. '-march=gfx1100', we see:
> >>>>>>>>
> >>>>>>>>         +builtin-bitops-1.c:7:17: missed:   reduc op not supported by
> >>>>>>>>         target.
> >>>>>>>>
> >>>>>>>> ..., and therefore:
> >>>>>>>>
> >>>>>>>>         -builtin-bitops-1.c:7:17: note:  Reduce using direct vector
> >>>>>>>>         reduction.
> >>>>>>>>         +builtin-bitops-1.c:7:17: note:  Reduce using vector shifts
> >>>>>>>>         +builtin-bitops-1.c:7:17: note:  extract scalar result
> >>>>>>>>
> >>>>>>>> That is, instead of one '.REDUC_PLUS' for gfx90a, for gfx1100 we
> >>>>>>>> build
> >>>>>>>> a
> >>>>>>>> chain of summation of 'VEC_PERM_EXPR's.  However, there's wrong code
> >>>>>>>> generated:
> >>>>>>>>
> >>>>>>>>         $ flock /tmp/gcn.lock build-gcc/gcc/gcn-run a.out
> >>>>>>>>         i=1, ints[i]=0x1 a=1, b=2
> >>>>>>>>         i=2, ints[i]=0x80000000 a=1, b=2
> >>>>>>>>         i=3, ints[i]=0x2 a=1, b=2
> >>>>>>>>         i=4, ints[i]=0x40000000 a=1, b=2
> >>>>>>>>         i=5, ints[i]=0x10000 a=1, b=2
> >>>>>>>>         i=6, ints[i]=0x8000 a=1, b=2
> >>>>>>>>         i=7, ints[i]=0xa5a5a5a5 a=16, b=32
> >>>>>>>>         i=8, ints[i]=0x5a5a5a5a a=16, b=32
> >>>>>>>>         i=9, ints[i]=0xcafe0000 a=11, b=22
> >>>>>>>>         i=10, ints[i]=0xcafe00 a=11, b=22
> >>>>>>>>         i=11, ints[i]=0xcafe a=11, b=22
> >>>>>>>>         i=12, ints[i]=0xffffffff a=32, b=64
> >>>>>>>>
> >>>>>>>> (I can't tell if the 'b = 2 * a' pattern is purely coincidental?)
> >>>>>>>>
> >>>>>>>> I don't speak enough "vectorization" to fully understand the generic
> >>>>>>>> vectorized algorithm and its implementation.  It appears that the
> >>>>>>>> "Reduce using vector shifts" code has been around for a very long
> >>>>>>>> time,
> >>>>>>>> but also has gone through a number of changes.  I can't tell which
> >>>>>>>> GCC
> >>>>>>>> targets/configurations it's actually used for (in the same way as for
> >>>>>>>> GCN gfx1100), and thus whether there's an issue in that vectorizer
> >>>>>>>> code,
> >>>>>>>> or rather in the GCN back end, or GCN back end parameterizing the
> >>>>>>>> generic
> >>>>>>>> code?
> >>>>>>>
> >>>>>>> The "shift" reduction is basically doing reduction by repeatedly
> >>>>>>> adding the upper to the lower half of the vector (each time halving
> >>>>>>> the vector size).
> >>>>>>>
> >>>>>>>> Manually working through the 'a-builtin-bitops-1.c.265t.optimized'
> >>>>>>>> code:
> >>>>>>>>
> >>>>>>>>         int my_popcount (unsigned int x)
> >>>>>>>>         {
> >>>>>>>>           int stmp__12.12;
> >>>>>>>>           vector(64) int vect__12.11;
> >>>>>>>>           vector(64) unsigned int vect__1.8;
> >>>>>>>>           vector(64) unsigned int _13;
> >>>>>>>>           vector(64) unsigned int vect_cst__18;
> >>>>>>>>           vector(64) int [all others];
> >>>>>>>>         
> >>>>>>>>           <bb 2> [local count: 32534376]:
> >>>>>>>>           vect_cst__18 = { [all 'x_8(D)'] };
> >>>>>>>>           vect__1.8_19 = vect_cst__18 >> { 0, 1, 2, [...], 61, 62, 63
> >>>>>>>>           };
> >>>>>>>>           _13 = .COND_AND ({ [32 x '-1'], [32 x '0'] }, vect__1.8_19,
> >>>>>>>>           {
> >>>>>>>>           [all
> >>>>>>>>           '1'] }, { [all '0'] });
> >>>>>>>>           vect__12.11_24 = VIEW_CONVERT_EXPR<vector(64) int>(_13);
> >>>>>>>>           _26 = VEC_PERM_EXPR <vect__12.11_24, { [all '0'] }, { 32,
> >>>>>>>>           33,
> >>>>>>>>           34,
> >>>>>>>>           [...], 93, 94, 95 }>;
> >>>>>>>>           _27 = vect__12.11_24 + _26;
> >>>>>>>>           _28 = VEC_PERM_EXPR <_27, { [all '0'] }, { 16, 17, 18,
> >>>>>>>>           [...],
> >>>>>>>>           77,
> >>>>>>>>           78, 79 }>;
> >>>>>>>>           _29 = _27 + _28;
> >>>>>>>>           _30 = VEC_PERM_EXPR <_29, { [all '0'] }, { 8, 9, 10, [...],
> >>>>>>>>           69,
> >>>>>>>>           70,
> >>>>>>>>           71 }>;
> >>>>>>>>           _31 = _29 + _30;
> >>>>>>>>           _32 = VEC_PERM_EXPR <_31, { [all '0'] }, { 4, 5, 6, [...],
> >>>>>>>>           65,
> >>>>>>>>           66,
> >>>>>>>>           67 }>;
> >>>>>>>>           _33 = _31 + _32;
> >>>>>>>>           _34 = VEC_PERM_EXPR <_33, { [all '0'] }, { 2, 3, 4, [...],
> >>>>>>>>           63,
> >>>>>>>>           64,
> >>>>>>>>           65 }>;
> >>>>>>>>           _35 = _33 + _34;
> >>>>>>>>           _36 = VEC_PERM_EXPR <_35, { [all '0'] }, { 1, 2, 3, [...],
> >>>>>>>>           62,
> >>>>>>>>           63,
> >>>>>>>>           64 }>;
> >>>>>>>>           _37 = _35 + _36;
> >>>>>>>>           stmp__12.12_38 = BIT_FIELD_REF <_37, 32, 0>;
> >>>>>>>>           return stmp__12.12_38;
> >>>>>>>>
> >>>>>>>> ..., for example, for 'x = 7', we get:
> >>>>>>>>
> >>>>>>>>           vect_cst__18 = { [all '7'] };
> >>>>>>>>           vect__1.8_19 = { 7, 3, 1, 0, 0, 0, [...] };
> >>>>>>>>           _13 = { 1, 1, 1, 0, 0, 0, [...] };
> >>>>>>>>           vect__12.11_24 = { 1, 1, 1, 0, 0, 0, [...] };
> >>>>>>>>           _26 = { [all '0'] };
> >>>>>>>>           _27 = { 1, 1, 1, 0, 0, 0, [...] };
> >>>>>>>>           _28 = { [all '0'] };
> >>>>>>>>           _29 = { 1, 1, 1, 0, 0, 0, [...] };
> >>>>>>>>           _30 = { [all '0'] };
> >>>>>>>>           _31 = { 1, 1, 1, 0, 0, 0, [...] };
> >>>>>>>>           _32 = { [all '0'] };
> >>>>>>>>           _33 = { 1, 1, 1, 0, 0, 0, [...] };
> >>>>>>>>           _34 = { 1, 0, 0, 0, [...] };
> >>>>>>>>           _35 = { 2, 1, 1, 0, 0, 0, [...] };
> >>>>>>>>           _36 = { 1, 1, 0, 0, 0, [...] };
> >>>>>>>>           _37 = { 3, 2, 1, 0, 0, 0, [...] };
> >>>>>>>>           stmp__12.12_38 = 3;
> >>>>>>>>           return 3;
> >>>>>>>>
> >>>>>>>> ..., so the algorithm would appear to synthesize correct code for
> >>>>>>>> that
> >>>>>>>> case.  Adding '7' to 'builtin-bitops-1.c', we however again get:
> >>>>>>>>
> >>>>>>>>         i=13, ints[i]=0x7 a=3, b=6
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> With the following hack applied to 'gcc/tree-vect-loop.cc':
> >>>>>>>>
> >>>>>>>>         @@ -6687,8 +6687,9 @@ vect_create_epilog_for_reduction
> >>>>>>>>         (loop_vec_info
> >>>>>>>>         loop_vinfo,
> >>>>>>>>                reduce_with_shift = have_whole_vector_shift (mode1);
> >>>>>>>>                if (!VECTOR_MODE_P (mode1)
> >>>>>>>>                   || !directly_supported_p (code, vectype1))
> >>>>>>>>                 reduce_with_shift = false;
> >>>>>>>>         +      reduce_with_shift = false;
> >>>>>>>>
> >>>>>>>> ..., I'm able to work around those regressions: by means of forcing
> >>>>>>>> "Reduce using scalar code" instead of "Reduce using vector shifts".
> >>>>>>>
> >>>>>>> I would say it somewhere gets broken between the vectorizer and the
> >>>>>>> GPU
> >>>>>>> which means likely in the target?  Can you point out an issue in the
> >>>>>>> actual generated GCN code?
> >>>>>>>
> >>>>>>> Iff this kind of reduction is the issue you'd see quite a lot of
> >>>>>>> vectorzer execute FAILs.  I'm seeing a .COND_AND above - could it
> >>>>>>> be that the "mask" is still set wrong when doing the reduction
> >>>>>>> steps?
> >>>>>>
> >>>>>> It looks like the ds_bpermute_b32 instruction works differently on
> >>>>>> RDNA3
> >>>>>> (vs.
> >>>>>> GCN/CDNA and even RDNA2).
> >>>>>>
> >>>>>>    From the pseudocode in the documentation:
> >>>>>>
> >>>>>>      for i in 0 : WAVE64 ? 63 : 31 do
> >>>>>>        // ADDR needs to be divided by 4.
> >>>>>>        // High-order bits are ignored.
> >>>>>>        // NOTE: destination lane is MOD 32 regardless of wave size.
> >>>>>>        src_lane = 32'I(VGPR[i][ADDR] + OFFSET.b) / 4 % 32;
> >>>>>>        // EXEC is applied to the source VGPR reads.
> >>>>>>        if EXEC[src_lane].u1 then
> >>>>>>          tmp[i] = VGPR[src_lane][DATA0]
> >>>>>>        endif
> >>>>>>      endfor;
> >>>>>>
> >>>>>> The key detail is the "mod 32"; the other architectures have "mod 64"
> >>>>>> there.
> >>>>>>
> >>>>>> So, the last 32 lanes are discarded, and the first 32 lanes are
> >>>>>> duplicated
> >>>>>> into the last, and this explains why my_popcount returns double the
> >>>>>> expected
> >>>>>> value for smaller inputs.
> >>>>>>
> >>>>>> Richi, can you confirm that this testcase works properly on your card,
> >>>>>> please?
> >>>>>>
> >>>>>> To test, assuming you only have the offload toolchain built, compile
> >>>>>> using
> >>>>>> x86_64-none-linux-gnu-accel-amdgcn-amdhsa-gcc, which should produce a
> >>>>>> raw
> >>>>>> AMD
> >>>>>> ELF file. Then you run it using "gcn-run a.out" (you can find gcn-run
> >>>>>> under
> >>>>>> libexec).
> >>>>>
> >>>>> I'm getting
> >>>>>
> >>>>> i=1, ints[i]=0x1 a=1, b=2
> >>>>> i=2, ints[i]=0x80000000 a=1, b=2
> >>>>> i=3, ints[i]=0x2 a=1, b=2
> >>>>> i=4, ints[i]=0x40000000 a=1, b=2
> >>>>> i=5, ints[i]=0x10000 a=1, b=2
> >>>>> i=6, ints[i]=0x8000 a=1, b=2
> >>>>> i=7, ints[i]=0xa5a5a5a5 a=16, b=32
> >>>>> i=8, ints[i]=0x5a5a5a5a a=16, b=32
> >>>>> i=9, ints[i]=0xcafe0000 a=11, b=22
> >>>>> i=10, ints[i]=0xcafe00 a=11, b=22
> >>>>> i=11, ints[i]=0xcafe a=11, b=22
> >>>>> i=12, ints[i]=0xffffffff a=32, b=64
> >>>>>
> >>>>> which I think is the same as Thomas output and thus wrong?
> >>>>>
> >>>>> When building with -O0 I get no output.
> >>>>>
> >>>>> I'm of course building with -march=gfx1030
> >>>>
> >>>> OK, please try this example, just to check my expectation that your
> >>>> permute
> >>>> works:
> >>>>
> >>>> typedef int v64si __attribute__ ((vector_size (256)));
> >>>>
> >>>> int main()
> >>>> {
> >>>>     v64si permute = {
> >>>>       40, 40, 40, 40, 40, 40, 40, 40,
> >>>>       40, 40, 40, 40, 40, 40, 40, 40,
> >>>>       40, 40, 40, 40, 40, 40, 40, 40,
> >>>>       40, 40, 40, 40, 40, 40, 40, 40,
> >>>>       40, 40, 40, 40, 40, 40, 40, 40,
> >>>>       40, 40, 40, 40, 40, 40, 40, 40,
> >>>>       40, 40, 40, 40, 40, 40, 40, 40,
> >>>>       40, 40, 40, 40, 40, 40, 40, 40
> >>>>     };
> >>>>     v64si result;
> >>>>
> >>>>     asm ("ds_bpermute_b32 %0, %1, v1" : "=v"(result) : "v"(permute),
> >>>>     "e"(-1L));
> >>>>
> >>>>     for (int i=0; i<63; i++)
> >>>>       __builtin_printf ("%d ", result[i]);
> >>>>     __builtin_printf ("\n");
> >>>>
> >>>>     return 0;
> >>>> }
> >>>>
> >>>> On GCN/CDNA devices I expect this to print "10" 64 times. On RDNA3 it
> >>>> prints
> >>>> "10" 32 times, and "42" 32 times (which doesn't quite match what I'd
> >>>> expect
> >>>> from the pseudocode, but does match the written description). Which do
> >>>> you
> >>>> get?
> >>>
> >>> 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10
> >>> 10 10 10 10 10 10 10 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42
> >>> 42 42 42 42 42 42 42 42 42 42 42 42 42
> >>>
> >>> so RDNA2 matches RDNA3 here.
> >>
> >> OK, that probably is the problem with both our reductions then. The RDNA2
> >> manual has the 32-lane wording in the description, but the instruction
> >> pseudocode lies. :(
> >>
> >> I'm now not sure how to implement permute without actually hitting memory?
> >> The
> >> permutation vector is exactly what we'd need to do a gather load from
> >> memory
> >> (not a coincident), but we'd need to find a memory location to do it,
> >> ideally
> >> in the low-latency LDS memory, and it'd have to be thread-safe.
> >>
> >> The attached not-well-tested patch should allow only valid permutations.
> >> Hopefully we go back to working code, but there'll be things that won't
> >> vectorize. That said, the new "dump" output code has fewer and probably
> >> cheaper instructions, so hmmm.
> > 
> > This fixes the reduced builtin-bitops-1.c on RDNA2.
> > 
> > I suppse if RDNA really only has 32 lane vectors (it sounds like it,
> > even if it can "simulate" 64 lane ones?) then it might make sense to
> > vectorize for 32 lanes?  That said, with variable-length it likely
> > doesn't matter but I'd not expose fixed-size modes with 64 lanes then?
> 
> For most operations, wavefrontsize=64 works just fine; the GPU runs each
> instruction twice and presents a pair of hardware registers as a logical
> 64-lane register. This breaks down for permutations and reductions, and is
> obviously inefficient when they vectors are not fully utilized, but is
> otherwise compatible with the GCN/CDNA compiler.
> 
> I didn't want to invest all the effort it would take to support
> wavefrontsize=32, which would be the natural mode for these devices; the
> number of places that have "64" hard-coded is just too big. Not only that, but
> the EXEC and VCC registers change from DImode to SImode and that's going to
> break a lot of stuff. (And we have no paying customer for this.)
> 
> I'm open to patch submissions. :)


OK, I see ;)  As said for fully masked that's a good answer.  I'd
probably still not expose V64mode modes in the RTL expanders for the
vect_* patterns?  Or, what happens if you change
gcn_vectorize_preferred_simd_mode to return 32 lane modes for RDNA
and omit 64 lane modes from gcn_autovectorize_vector_modes for RDNA?

Does that possibly leave performance on the plate? (not sure if there's
any documents about choosing wavefrontsize=64 vs 32 with regard to
performance)

Note it would entirely forbit the vectorizer from using larger modes,
it just makes it prefer the smaller ones.  OTOH if you then run
wavefrontsize=64 ontop of it it's probably wasting the 2nd instruction
by always masking it?

So yeah.  Guess a s/64/wavefrontsize/ would be a first step towards
allowing 32 there ...

Anyway, the fix works, so that's the most important thing ;)

Richard.

Re: GCN RDNA2+ vs. GCC vectorizer "Reduce using vector shifts"

Reply via email to