Re: GCN RDNA2+ vs. GCC SLP vectorizer

Richard Biener Fri, 16 Feb 2024 04:26:44 -0800

On Fri, 16 Feb 2024, Andrew Stubbs wrote:

> On 16/02/2024 10:17, Richard Biener wrote:
> > On Fri, 16 Feb 2024, Thomas Schwinge wrote:
> > 
> >> Hi!
> >>
> >> On 2023-10-20T12:51:03+0100, Andrew Stubbs <a...@codesourcery.com> wrote:
> >>> I've committed this patch
> >>
> >> ... as commit c7ec7bd1c6590cf4eed267feab490288e0b8d691
> >> "amdgcn: add -march=gfx1030 EXPERIMENTAL", which the later RDNA3/gfx1100
> >> support builds on top of, and that's what I'm currently working on
> >> getting proper GCC/GCN target (not offloading) results for.
> >>
> >> Now looking at 'gcc.dg/vect/bb-slp-cond-1.c', which is reasonably simple,
> >> and hopefully representative for other SLP execution test FAILs
> >> (regressions compared to my earlier non-gfx1100 testing).
> >>
> >>      $ build-gcc/gcc/xgcc -Bbuild-gcc/gcc/
> >>      source-gcc/gcc/testsuite/gcc.dg/vect/bb-slp-cond-1.c
> >>      --sysroot=install/amdgcn-amdhsa -ftree-vectorize
> >>      -fno-tree-loop-distribute-patterns -fno-vect-cost-model -fno-common
> >>      -O2 -fdump-tree-slp-details -fdump-tree-vect-details -isystem
> >>      build-gcc/amdgcn-amdhsa/gfx1100/newlib/targ-include -isystem
> >>      source-gcc/newlib/libc/include
> >>      -Bbuild-gcc/amdgcn-amdhsa/gfx1100/newlib/
> >>      -Lbuild-gcc/amdgcn-amdhsa/gfx1100/newlib -wrapper
> >>      setarch,--addr-no-randomize -fdump-tree-all-all -fdump-ipa-all-all
> >>      -fdump-rtl-all-all -save-temps -march=gfx1100
> >>
> >> The '-march=gfx1030' 'a-bb-slp-cond-1.s' is identical (apart from
> >> 'TARGET_PACKED_WORK_ITEMS' in 'gcn_target_asm_function_prologue'), so I
> >> suppose will also exhibit the same failure mode, once again?
> >>
> >> Compared to '-march=gfx90a', the differences begin in
> >> 'a-bb-slp-cond-1.c.266r.expand' (only!), down to 'a-bb-slp-cond-1.s'.
> >>
> >> Changed like:
> >>
> >>      @@ -38,10 +38,10 @@ int main ()
> >>       #pragma GCC novector
> >>         for (i = 1; i < N; i++)
> >>           if (a[i] != i%4 + 1)
> >>      -      abort ();
> >>      +      __builtin_printf("%d %d != %d\n", i, a[i], i%4 + 1);
> >>       
> >>         if (a[0] != 5)
> >>      -    abort ();
> >>      +    __builtin_printf("%d %d != %d\n", 0, a[0], 5);
> >>
> >> ..., we see:
> >>
> >>      $ flock /tmp/gcn.lock build-gcc/gcc/gcn-run a.out
> >>      40 5 != 1
> >>      41 6 != 2
> >>      42 7 != 3
> >>      43 8 != 4
> >>      44 5 != 1
> >>      45 6 != 2
> >>      46 7 != 3
> >>      47 8 != 4
> >>
> >> '40..47' are the 'i = 10..11' in 'foo', and the expectation is
> >> 'a[i * stride + 0..3] != 0'.  So, either some earlier iteration has
> >> scribbled zero values over these (vector lane masking issue, perhaps?),
> >> or some other code generation issue?
> > 
> > So we're indeed BB vectorizing this to
> > 
> >    _54 = MEM <vector(4) int> [(int *)_14];
> >    vect_iftmp.12_56 = .VCOND (_54, { 0, 0, 0, 0 }, { 1, 2, 3, 4 }, { 5, 6,
> > 7, 8 }, 115);
> >    MEM <vector(4) int> [(int *)_14] = vect_iftmp.12_56;
> > 
> > I don't understand the assembly very well but it might be that
> > the mask computation for the .VCOND scribbles the mask used
> > to constrain operation to 4 lanes?
> > 
> > .L3:
> >          s_mov_b64       exec, 15
> >          v_add_co_u32    v4, s[22:23], s32, v3
> >          v_mov_b32       v5, s33
> >          v_add_co_ci_u32 v5, s[22:23], 0, v5, s[22:23]
> >          flat_load_dword v7, v[4:5] offset:0
> >          s_waitcnt       0
> >          flat_load_dword v0, v[10:11] offset:0
> >          s_waitcnt       0
> >          flat_load_dword v6, v[8:9] offset:0
> >          s_waitcnt       0
> >          v_cmp_ne_u32    s[18:19], v7, 0
> >          v_cndmask_b32   v0, v6, v0, s[18:19]
> >          flat_store_dword        v[4:5], v0 offset:0
> >          s_add_i32       s12, s12, 1
> >          s_add_u32       s32, s32, s28
> >          s_addc_u32      s33, s33, s29
> >          s_cmp_lg_u32    s12, s13
> >          s_cbranch_scc1  .L3
> 
> This basic block has EXEC set to 15 (4 lanes) throughout. The mask for the
> VCOND a.k.a. v_vndmask_b32 is in s[18:19]. Those things seem OK.
> 
> I see the testcase avoids vec_extract V64SI to V4SI for gfx1100, even though
> it would be a no-op conversion, because the general case requires a permute
> instruction and named pattern insns can't have non-constant conditions. Is
> vec_extract allowed to FAIL? That might give a better result in this case.
> 
> However, I must be doing something different because vect/bb-slp-cond-1.c
> passes for me, on gfx1100.


I didn't try to run it - when doing make check-gcc fails to using
gcn-run for test invocation, what's the trick to make it do that?

Richard.

Re: GCN RDNA2+ vs. GCC SLP vectorizer

Reply via email to