Re: GCN RDNA2+ vs. GCC vectorizer "Reduce using vector shifts"

Andrew Stubbs Wed, 14 Feb 2024 07:24:22 -0800

On 14/02/2024 13:43, Richard Biener wrote:

On Wed, 14 Feb 2024, Andrew Stubbs wrote:

On 14/02/2024 13:27, Richard Biener wrote:

On Wed, 14 Feb 2024, Andrew Stubbs wrote:

On 13/02/2024 08:26, Richard Biener wrote:

On Mon, 12 Feb 2024, Thomas Schwinge wrote:

Hi!

On 2023-10-20T12:51:03+0100, Andrew Stubbs <a...@codesourcery.com> wrote:

I've committed this patch


... as commit c7ec7bd1c6590cf4eed267feab490288e0b8d691
"amdgcn: add -march=gfx1030 EXPERIMENTAL".

The RDNA2 ISA variant doesn't support certain instructions previous
implemented in GCC/GCN, so a number of patterns etc. had to be disabled:

[...] Vector
reductions will need to be reworked for RDNA2.  [...]

   * config/gcn/gcn-valu.md (@dpp_move<mode>): Disable for RDNA2.
   (addc<mode>3<exec_vcc>): Add RDNA2 syntax variant.
   (subc<mode>3<exec_vcc>): Likewise.
   (<convop><mode><vndi>2_exec): Add RDNA2 alternatives.
   (vec_cmp<mode>di): Likewise.
   (vec_cmp<u><mode>di): Likewise.
   (vec_cmp<mode>di_exec): Likewise.
   (vec_cmp<u><mode>di_exec): Likewise.
   (vec_cmp<mode>di_dup): Likewise.
   (vec_cmp<mode>di_dup_exec): Likewise.
   (reduc_<reduc_op>_scal_<mode>): Disable for RDNA2.
   (*<reduc_op>_dpp_shr_<mode>): Likewise.
   (*plus_carry_dpp_shr_<mode>): Likewise.
   (*plus_carry_in_dpp_shr_<mode>): Likewise.


Etc.  The expectation being that GCC middle end copes with this, and
synthesizes some less ideal yet still functional vector code, I presume.

The later RDNA3/gfx1100 support builds on top of this, and that's what
I'm currently working on getting proper GCC/GCN target (not offloading)
results for.

I'm seeing a good number of execution test FAILs (regressions compared to
my earlier non-gfx1100 testing), and I've now tracked down where one
large class of those comes into existance -- not yet how to resolve,
unfortunately.  But maybe, with you guys' combined vectorizer and back
end experience, the latter will be done quickly?

Richard, I don't know if you've ever run actual GCC/GCN target (not
offloading) testing; let me know if you have any questions about that.


I've only done offload testing - in the x86_64 build tree run
check-target-libgomp.  If you can tell me how to do GCN target testing
(maybe document it on the wiki even!) I can try do that as well.

Given that (at least largely?) the same patterns etc. are disabled as in
my gfx1100 configuration, I suppose your gfx1030 one would exhibit the
same issues.  You can build GCC/GCN target like you build the offloading
one, just remove '--enable-as-accelerator-for=[...]'.  Likely, you can
even use a offloading GCC/GCN build to reproduce the issue below.

One example is the attached 'builtin-bitops-1.c', reduced from
'gcc.c-torture/execute/builtin-bitops-1.c', where 'my_popcount' is
miscompiled as soon as '-ftree-vectorize' is effective:

       $ build-gcc/gcc/xgcc -Bbuild-gcc/gcc/ builtin-bitops-1.c
       -Bbuild-gcc/amdgcn-amdhsa/gfx1100/newlib/
       -Lbuild-gcc/amdgcn-amdhsa/gfx1100/newlib -fdump-tree-all-all
       -fdump-ipa-all-all -fdump-rtl-all-all -save-temps -march=gfx1100
       -O1
       -ftree-vectorize

In the 'diff' of 'a-builtin-bitops-1.c.179t.vect', for example, for
'-march=gfx90a' vs. '-march=gfx1100', we see:

       +builtin-bitops-1.c:7:17: missed:   reduc op not supported by
       target.

..., and therefore:

       -builtin-bitops-1.c:7:17: note:  Reduce using direct vector
       reduction.
       +builtin-bitops-1.c:7:17: note:  Reduce using vector shifts
       +builtin-bitops-1.c:7:17: note:  extract scalar result

That is, instead of one '.REDUC_PLUS' for gfx90a, for gfx1100 we build a
chain of summation of 'VEC_PERM_EXPR's.  However, there's wrong code
generated:

       $ flock /tmp/gcn.lock build-gcc/gcc/gcn-run a.out
       i=1, ints[i]=0x1 a=1, b=2
       i=2, ints[i]=0x80000000 a=1, b=2
       i=3, ints[i]=0x2 a=1, b=2
       i=4, ints[i]=0x40000000 a=1, b=2
       i=5, ints[i]=0x10000 a=1, b=2
       i=6, ints[i]=0x8000 a=1, b=2
       i=7, ints[i]=0xa5a5a5a5 a=16, b=32
       i=8, ints[i]=0x5a5a5a5a a=16, b=32
       i=9, ints[i]=0xcafe0000 a=11, b=22
       i=10, ints[i]=0xcafe00 a=11, b=22
       i=11, ints[i]=0xcafe a=11, b=22
       i=12, ints[i]=0xffffffff a=32, b=64

(I can't tell if the 'b = 2 * a' pattern is purely coincidental?)

I don't speak enough "vectorization" to fully understand the generic
vectorized algorithm and its implementation.  It appears that the
"Reduce using vector shifts" code has been around for a very long time,
but also has gone through a number of changes.  I can't tell which GCC
targets/configurations it's actually used for (in the same way as for
GCN gfx1100), and thus whether there's an issue in that vectorizer code,
or rather in the GCN back end, or GCN back end parameterizing the generic
code?


The "shift" reduction is basically doing reduction by repeatedly
adding the upper to the lower half of the vector (each time halving
the vector size).

Manually working through the 'a-builtin-bitops-1.c.265t.optimized' code:

       int my_popcount (unsigned int x)
       {
         int stmp__12.12;
         vector(64) int vect__12.11;
         vector(64) unsigned int vect__1.8;
         vector(64) unsigned int _13;
         vector(64) unsigned int vect_cst__18;
         vector(64) int [all others];

<bb 2> [local count: 32534376]:

         vect_cst__18 = { [all 'x_8(D)'] };
         vect__1.8_19 = vect_cst__18 >> { 0, 1, 2, [...], 61, 62, 63 };
         _13 = .COND_AND ({ [32 x '-1'], [32 x '0'] }, vect__1.8_19, {
         [all
         '1'] }, { [all '0'] });
         vect__12.11_24 = VIEW_CONVERT_EXPR<vector(64) int>(_13);
         _26 = VEC_PERM_EXPR <vect__12.11_24, { [all '0'] }, { 32, 33, 34,
         [...], 93, 94, 95 }>;
         _27 = vect__12.11_24 + _26;
         _28 = VEC_PERM_EXPR <_27, { [all '0'] }, { 16, 17, 18, [...], 77,
         78, 79 }>;
         _29 = _27 + _28;
         _30 = VEC_PERM_EXPR <_29, { [all '0'] }, { 8, 9, 10, [...], 69,
         70,
         71 }>;
         _31 = _29 + _30;
         _32 = VEC_PERM_EXPR <_31, { [all '0'] }, { 4, 5, 6, [...], 65,
         66,
         67 }>;
         _33 = _31 + _32;
         _34 = VEC_PERM_EXPR <_33, { [all '0'] }, { 2, 3, 4, [...], 63,
         64,
         65 }>;
         _35 = _33 + _34;
         _36 = VEC_PERM_EXPR <_35, { [all '0'] }, { 1, 2, 3, [...], 62,
         63,
         64 }>;
         _37 = _35 + _36;
         stmp__12.12_38 = BIT_FIELD_REF <_37, 32, 0>;
         return stmp__12.12_38;

..., for example, for 'x = 7', we get:

         vect_cst__18 = { [all '7'] };
         vect__1.8_19 = { 7, 3, 1, 0, 0, 0, [...] };
         _13 = { 1, 1, 1, 0, 0, 0, [...] };
         vect__12.11_24 = { 1, 1, 1, 0, 0, 0, [...] };
         _26 = { [all '0'] };
         _27 = { 1, 1, 1, 0, 0, 0, [...] };
         _28 = { [all '0'] };
         _29 = { 1, 1, 1, 0, 0, 0, [...] };
         _30 = { [all '0'] };
         _31 = { 1, 1, 1, 0, 0, 0, [...] };
         _32 = { [all '0'] };
         _33 = { 1, 1, 1, 0, 0, 0, [...] };
         _34 = { 1, 0, 0, 0, [...] };
         _35 = { 2, 1, 1, 0, 0, 0, [...] };
         _36 = { 1, 1, 0, 0, 0, [...] };
         _37 = { 3, 2, 1, 0, 0, 0, [...] };
         stmp__12.12_38 = 3;
         return 3;

..., so the algorithm would appear to synthesize correct code for that
case.  Adding '7' to 'builtin-bitops-1.c', we however again get:

       i=13, ints[i]=0x7 a=3, b=6


With the following hack applied to 'gcc/tree-vect-loop.cc':

       @@ -6687,8 +6687,9 @@ vect_create_epilog_for_reduction
       (loop_vec_info
       loop_vinfo,
              reduce_with_shift = have_whole_vector_shift (mode1);
              if (!VECTOR_MODE_P (mode1)
                 || !directly_supported_p (code, vectype1))
               reduce_with_shift = false;
       +      reduce_with_shift = false;

..., I'm able to work around those regressions: by means of forcing
"Reduce using scalar code" instead of "Reduce using vector shifts".


I would say it somewhere gets broken between the vectorizer and the GPU
which means likely in the target?  Can you point out an issue in the
actual generated GCN code?

Iff this kind of reduction is the issue you'd see quite a lot of
vectorzer execute FAILs.  I'm seeing a .COND_AND above - could it
be that the "mask" is still set wrong when doing the reduction
steps?


It looks like the ds_bpermute_b32 instruction works differently on RDNA3
(vs.
GCN/CDNA and even RDNA2).

  From the pseudocode in the documentation:

    for i in 0 : WAVE64 ? 63 : 31 do
      // ADDR needs to be divided by 4.
      // High-order bits are ignored.
      // NOTE: destination lane is MOD 32 regardless of wave size.
      src_lane = 32'I(VGPR[i][ADDR] + OFFSET.b) / 4 % 32;
      // EXEC is applied to the source VGPR reads.
      if EXEC[src_lane].u1 then
        tmp[i] = VGPR[src_lane][DATA0]
      endif
    endfor;

The key detail is the "mod 32"; the other architectures have "mod 64"
there.

So, the last 32 lanes are discarded, and the first 32 lanes are duplicated
into the last, and this explains why my_popcount returns double the
expected
value for smaller inputs.

Richi, can you confirm that this testcase works properly on your card,
please?

To test, assuming you only have the offload toolchain built, compile using
x86_64-none-linux-gnu-accel-amdgcn-amdhsa-gcc, which should produce a raw
AMD
ELF file. Then you run it using "gcn-run a.out" (you can find gcn-run under
libexec).


I'm getting

i=1, ints[i]=0x1 a=1, b=2
i=2, ints[i]=0x80000000 a=1, b=2
i=3, ints[i]=0x2 a=1, b=2
i=4, ints[i]=0x40000000 a=1, b=2
i=5, ints[i]=0x10000 a=1, b=2
i=6, ints[i]=0x8000 a=1, b=2
i=7, ints[i]=0xa5a5a5a5 a=16, b=32
i=8, ints[i]=0x5a5a5a5a a=16, b=32
i=9, ints[i]=0xcafe0000 a=11, b=22
i=10, ints[i]=0xcafe00 a=11, b=22
i=11, ints[i]=0xcafe a=11, b=22
i=12, ints[i]=0xffffffff a=32, b=64

which I think is the same as Thomas output and thus wrong?

When building with -O0 I get no output.

I'm of course building with -march=gfx1030


OK, please try this example, just to check my expectation that your permute
works:

typedef int v64si __attribute__ ((vector_size (256)));

int main()
{
   v64si permute = {
     40, 40, 40, 40, 40, 40, 40, 40,
     40, 40, 40, 40, 40, 40, 40, 40,
     40, 40, 40, 40, 40, 40, 40, 40,
     40, 40, 40, 40, 40, 40, 40, 40,
     40, 40, 40, 40, 40, 40, 40, 40,
     40, 40, 40, 40, 40, 40, 40, 40,
     40, 40, 40, 40, 40, 40, 40, 40,
     40, 40, 40, 40, 40, 40, 40, 40
   };
   v64si result;

   asm ("ds_bpermute_b32 %0, %1, v1" : "=v"(result) : "v"(permute), "e"(-1L));

   for (int i=0; i<63; i++)
     __builtin_printf ("%d ", result[i]);
   __builtin_printf ("\n");

   return 0;
}

On GCN/CDNA devices I expect this to print "10" 64 times. On RDNA3 it prints
"10" 32 times, and "42" 32 times (which doesn't quite match what I'd expect
from the pseudocode, but does match the written description). Which do you
get?


10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10
10 10 10 10 10 10 10 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42
42 42 42 42 42 42 42 42 42 42 42 42 42

so RDNA2 matches RDNA3 here.

OK, that probably is the problem with both our reductions then. TheRDNA2 manual has the 32-lane wording in the description, but theinstruction pseudocode lies. :(

I'm now not sure how to implement permute without actually hittingmemory? The permutation vector is exactly what we'd need to do a gatherload from memory (not a coincident), but we'd need to find a memorylocation to do it, ideally in the low-latency LDS memory, and it'd haveto be thread-safe.

The attached not-well-tested patch should allow only valid permutations.Hopefully we go back to working code, but there'll be things that won'tvectorize. That said, the new "dump" output code has fewer and probablycheaper instructions, so hmmm.


Andrew

amdgcn: Disallow unsupported permute on RDNA devices

The RDNA architecture has limited support for permute operations.  This should
allow use of the permutations that do work, and fall back to linear code for
other cases.

gcc/ChangeLog:

        * config/gcn/gcn-valu.md
        (vec_extract<V_MOV:mode><V_MOV_ALT:mode>): Add conditions for RDNA.
        * config/gcn/gcn.cc (gcn_vectorize_vec_perm_const): Check permutation
        details are supported on RDNA devices.

diff --git a/gcc/config/gcn/gcn-valu.md b/gcc/config/gcn/gcn-valu.md
index 23b441f8e8b..59e27d0aed7 100644
--- a/gcc/config/gcn/gcn-valu.md
+++ b/gcc/config/gcn/gcn-valu.md
@@ -982,7 +982,8 @@
    (match_operand:V_MOV 1 "register_operand")
    (match_operand 2 "immediate_operand")]
   "MODE_VF (<V_MOV_ALT:MODE>mode) < MODE_VF (<V_MOV:MODE>mode)
-   && <V_MOV_ALT:SCALAR_MODE>mode == <V_MOV:SCALAR_MODE>mode"
+   && <V_MOV_ALT:SCALAR_MODE>mode == <V_MOV:SCALAR_MODE>mode
+   && (!TARGET_RDNA2_PLUS || MODE_VF (<V_MOV:MODE>mode) <= 32)"
   {
     int numlanes = GET_MODE_NUNITS (<V_MOV_ALT:MODE>mode);
     int firstlane = INTVAL (operands[2]) * numlanes;
diff --git a/gcc/config/gcn/gcn.cc b/gcc/config/gcn/gcn.cc
index e80de2ce056..f067743e31a 100644
--- a/gcc/config/gcn/gcn.cc
+++ b/gcc/config/gcn/gcn.cc
@@ -5110,19 +5110,24 @@ gcn_vectorize_vec_perm_const (machine_mode vmode, 
machine_mode op_mode,
   gcc_assert (nelt <= 64);
   gcc_assert (sel.length () == nelt);
 
-  if (!dst)
-    {
-      /* All vector permutations are possible on this architecture,
-         with varying degrees of efficiency depending on the permutation. */
-      return true;
-    }
-
   unsigned int perm[64];
   for (unsigned int i = 0; i < nelt; ++i)
     perm[i] = sel[i] & (2 * nelt - 1);
   for (unsigned int i = nelt; i < 64; ++i)
     perm[i] = 0;
 
+  /* RDNA devices can only to permutations within each group of 32-lanes.
+     Reject permutations that cross the boundary.  */
+  if (TARGET_RDNA2_PLUS)
+    for (unsigned int i = 0; i < nelt; i++)
+      if (i < 31 ? perm[i] > 31 : perm[i] < 32)
+       return false;
+
+  /* All vector permutations are possible on other architectures,
+     with varying degrees of efficiency depending on the permutation. */
+  if (!dst)
+    return true;
+
   src0 = force_reg (vmode, src0);
   src1 = force_reg (vmode, src1);

Re: GCN RDNA2+ vs. GCC vectorizer "Reduce using vector shifts"

Reply via email to