GCN RDNA2+ vs. GCC vectorizer "Reduce using vector shifts" (was: [committed] amdgcn: add -march=gfx1030 EXPERIMENTAL)

Thomas Schwinge Mon, 12 Feb 2024 08:36:02 -0800

Hi!

On 2023-10-20T12:51:03+0100, Andrew Stubbs <a...@codesourcery.com> wrote:
> I've committed this patch


... as commit c7ec7bd1c6590cf4eed267feab490288e0b8d691
"amdgcn: add -march=gfx1030 EXPERIMENTAL".

The RDNA2 ISA variant doesn't support certain instructions previous
implemented in GCC/GCN, so a number of patterns etc. had to be disabled:

> [...] Vector
> reductions will need to be reworked for RDNA2.  [...]

>       * config/gcn/gcn-valu.md (@dpp_move<mode>): Disable for RDNA2.
>       (addc<mode>3<exec_vcc>): Add RDNA2 syntax variant.
>       (subc<mode>3<exec_vcc>): Likewise.
>       (<convop><mode><vndi>2_exec): Add RDNA2 alternatives.
>       (vec_cmp<mode>di): Likewise.
>       (vec_cmp<u><mode>di): Likewise.
>       (vec_cmp<mode>di_exec): Likewise.
>       (vec_cmp<u><mode>di_exec): Likewise.
>       (vec_cmp<mode>di_dup): Likewise.
>       (vec_cmp<mode>di_dup_exec): Likewise.
>       (reduc_<reduc_op>_scal_<mode>): Disable for RDNA2.
>       (*<reduc_op>_dpp_shr_<mode>): Likewise.
>       (*plus_carry_dpp_shr_<mode>): Likewise.
>       (*plus_carry_in_dpp_shr_<mode>): Likewise.

Etc.  The expectation being that GCC middle end copes with this, and
synthesizes some less ideal yet still functional vector code, I presume.

The later RDNA3/gfx1100 support builds on top of this, and that's what
I'm currently working on getting proper GCC/GCN target (not offloading)
results for.

I'm seeing a good number of execution test FAILs (regressions compared to
my earlier non-gfx1100 testing), and I've now tracked down where one
large class of those comes into existance -- not yet how to resolve,
unfortunately.  But maybe, with you guys' combined vectorizer and back
end experience, the latter will be done quickly?

Richard, I don't know if you've ever run actual GCC/GCN target (not
offloading) testing; let me know if you have any questions about that.
Given that (at least largely?) the same patterns etc. are disabled as in
my gfx1100 configuration, I suppose your gfx1030 one would exhibit the
same issues.  You can build GCC/GCN target like you build the offloading
one, just remove '--enable-as-accelerator-for=[...]'.  Likely, you can
even use a offloading GCC/GCN build to reproduce the issue below.

One example is the attached 'builtin-bitops-1.c', reduced from
'gcc.c-torture/execute/builtin-bitops-1.c', where 'my_popcount' is
miscompiled as soon as '-ftree-vectorize' is effective:

    $ build-gcc/gcc/xgcc -Bbuild-gcc/gcc/ builtin-bitops-1.c 
-Bbuild-gcc/amdgcn-amdhsa/gfx1100/newlib/ 
-Lbuild-gcc/amdgcn-amdhsa/gfx1100/newlib -fdump-tree-all-all -fdump-ipa-all-all 
-fdump-rtl-all-all -save-temps -march=gfx1100 -O1 -ftree-vectorize

In the 'diff' of 'a-builtin-bitops-1.c.179t.vect', for example, for
'-march=gfx90a' vs. '-march=gfx1100', we see:

    +builtin-bitops-1.c:7:17: missed:   reduc op not supported by target.

..., and therefore:

    -builtin-bitops-1.c:7:17: note:  Reduce using direct vector reduction.
    +builtin-bitops-1.c:7:17: note:  Reduce using vector shifts
    +builtin-bitops-1.c:7:17: note:  extract scalar result

That is, instead of one '.REDUC_PLUS' for gfx90a, for gfx1100 we build a
chain of summation of 'VEC_PERM_EXPR's.  However, there's wrong code
generated:

    $ flock /tmp/gcn.lock build-gcc/gcc/gcn-run a.out
    i=1, ints[i]=0x1 a=1, b=2
    i=2, ints[i]=0x80000000 a=1, b=2
    i=3, ints[i]=0x2 a=1, b=2
    i=4, ints[i]=0x40000000 a=1, b=2
    i=5, ints[i]=0x10000 a=1, b=2
    i=6, ints[i]=0x8000 a=1, b=2
    i=7, ints[i]=0xa5a5a5a5 a=16, b=32
    i=8, ints[i]=0x5a5a5a5a a=16, b=32
    i=9, ints[i]=0xcafe0000 a=11, b=22
    i=10, ints[i]=0xcafe00 a=11, b=22
    i=11, ints[i]=0xcafe a=11, b=22
    i=12, ints[i]=0xffffffff a=32, b=64

(I can't tell if the 'b = 2 * a' pattern is purely coincidental?)

I don't speak enough "vectorization" to fully understand the generic
vectorized algorithm and its implementation.  It appears that the
"Reduce using vector shifts" code has been around for a very long time,
but also has gone through a number of changes.  I can't tell which GCC
targets/configurations it's actually used for (in the same way as for
GCN gfx1100), and thus whether there's an issue in that vectorizer code,
or rather in the GCN back end, or GCN back end parameterizing the generic
code?

Manually working through the 'a-builtin-bitops-1.c.265t.optimized' code:

    int my_popcount (unsigned int x)
    {
      int stmp__12.12;
      vector(64) int vect__12.11;
      vector(64) unsigned int vect__1.8;
      vector(64) unsigned int _13;
      vector(64) unsigned int vect_cst__18;
      vector(64) int [all others];
    
      <bb 2> [local count: 32534376]:
      vect_cst__18 = { [all 'x_8(D)'] };
      vect__1.8_19 = vect_cst__18 >> { 0, 1, 2, [...], 61, 62, 63 };
      _13 = .COND_AND ({ [32 x '-1'], [32 x '0'] }, vect__1.8_19, { [all '1'] 
}, { [all '0'] });
      vect__12.11_24 = VIEW_CONVERT_EXPR<vector(64) int>(_13);
      _26 = VEC_PERM_EXPR <vect__12.11_24, { [all '0'] }, { 32, 33, 34, [...], 
93, 94, 95 }>;
      _27 = vect__12.11_24 + _26;
      _28 = VEC_PERM_EXPR <_27, { [all '0'] }, { 16, 17, 18, [...], 77, 78, 79 
}>;
      _29 = _27 + _28;
      _30 = VEC_PERM_EXPR <_29, { [all '0'] }, { 8, 9, 10, [...], 69, 70, 71 }>;
      _31 = _29 + _30;
      _32 = VEC_PERM_EXPR <_31, { [all '0'] }, { 4, 5, 6, [...], 65, 66, 67 }>;
      _33 = _31 + _32;
      _34 = VEC_PERM_EXPR <_33, { [all '0'] }, { 2, 3, 4, [...], 63, 64, 65 }>;
      _35 = _33 + _34;
      _36 = VEC_PERM_EXPR <_35, { [all '0'] }, { 1, 2, 3, [...], 62, 63, 64 }>;
      _37 = _35 + _36;
      stmp__12.12_38 = BIT_FIELD_REF <_37, 32, 0>;
      return stmp__12.12_38;

..., for example, for 'x = 7', we get:

      vect_cst__18 = { [all '7'] };
      vect__1.8_19 = { 7, 3, 1, 0, 0, 0, [...] };
      _13 = { 1, 1, 1, 0, 0, 0, [...] };
      vect__12.11_24 = { 1, 1, 1, 0, 0, 0, [...] };
      _26 = { [all '0'] };
      _27 = { 1, 1, 1, 0, 0, 0, [...] };
      _28 = { [all '0'] };
      _29 = { 1, 1, 1, 0, 0, 0, [...] };
      _30 = { [all '0'] };
      _31 = { 1, 1, 1, 0, 0, 0, [...] };
      _32 = { [all '0'] };
      _33 = { 1, 1, 1, 0, 0, 0, [...] };
      _34 = { 1, 0, 0, 0, [...] };
      _35 = { 2, 1, 1, 0, 0, 0, [...] };
      _36 = { 1, 1, 0, 0, 0, [...] };
      _37 = { 3, 2, 1, 0, 0, 0, [...] };
      stmp__12.12_38 = 3;
      return 3;

..., so the algorithm would appear to synthesize correct code for that
case.  Adding '7' to 'builtin-bitops-1.c', we however again get:

    i=13, ints[i]=0x7 a=3, b=6


With the following hack applied to 'gcc/tree-vect-loop.cc':

    @@ -6687,8 +6687,9 @@ vect_create_epilog_for_reduction (loop_vec_info 
loop_vinfo,
           reduce_with_shift = have_whole_vector_shift (mode1);
           if (!VECTOR_MODE_P (mode1)
              || !directly_supported_p (code, vectype1))
            reduce_with_shift = false;
    +      reduce_with_shift = false;

..., I'm able to work around those regressions: by means of forcing
"Reduce using scalar code" instead of "Reduce using vector shifts".


Grüße
 Thomas

// Reduced from 'gcc.c-torture/execute/builtin-bitops-1.c'.

int my_popcount(unsigned x)
{
  int i;
  int count = 0;
  for (i = 0; i < 8 * sizeof (unsigned); i++)
    if (x & ((unsigned) 1 << i)) count++;
  return count;
}

unsigned int ints[] =
{
  0x00000000UL,
  0x00000001UL,
  0x80000000UL,
  0x00000002UL,
  0x40000000UL,
  0x00010000UL,
  0x00008000UL,
  0xa5a5a5a5UL,
  0x5a5a5a5aUL,
  0xcafe0000UL,
  0x00cafe00UL,
  0x0000cafeUL,
  0xffffffffUL
};

int
main (void)
{
  int i;

  for (i = 0; i < (sizeof (ints) / sizeof (ints[0])); i++)
    {
      int a, b;
      if ((a = __builtin_popcount (ints[i])) != (b = my_popcount (ints[i])))
	__builtin_printf("i=%d, ints[i]=0x%x a=%d, b=%d\n", i, ints[i], a, b);
    }
  return 0;
}

GCN RDNA2+ vs. GCC vectorizer "Reduce using vector shifts" (was: [committed] amdgcn: add -march=gfx1030 EXPERIMENTAL)

Reply via email to