https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116571

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
           Keywords|                            |missed-optimization

--- Comment #3 from Richard Biener <rguenth at gcc dot gnu.org> ---
On x86-64 with -march=cascadelake the issue for slp-12a.c is that the SLP
store is fed from

   lane permutation { 0[0] 0[1] 0[2] 0[3] 0[4] 0[5] 1[0] 1[1] }

which with 8 lane vectors runs into the three vector permutation limit:

gcc.dg/vect/slp-12a.c:16:17: note:   vectorizing permutation op0[0] op0[1]
op0[2] op0[3] op0[4] op0[5] op1[0] op1[1]
gcc.dg/vect/slp-12a.c:16:17: note:   as vops0[0][0] vops0[0][1] vops0[0][2]
vops0[0][3] vops0[0][4] vops0[0][5] vops1[0][0] vops1[0][1], vops0[0][6]
vops0[0][7] vops0[1][0] vops0[1][1] vops0[1][2] vops0[1][3] vops1[0][2]
vops1[0][3], vops0[1][4] vops0[1][5] vops0[1][6] vops0[1][7] vops0[2][0]
vops0[2][1] vops1[0][4] vops1[0][5], vops0[2][2] vops0[2][3] vops0[2][4]
vops0[2][5] vops0[2][6] vops0[2][7] vops1[0][6] vops1[0][7], vops0[3][0]
vops0[3][1] vops0[3][2] vops0[3][3] vops0[3][4] vops0[3][5] vops1[1][0]
vops1[1][1], vops0[3][6] vops0[3][7] vops0[4][0] vops0[4][1] vops0[4][2]
vops0[4][3] vops1[1][2] vops1[1][3], vops0[4][4] vops0[4][5] vops0[4][6]
vops0[4][7] vops0[5][0] vops0[5][1] vops1[1][4] vops1[1][5], vops0[5][2]
vops0[5][3] vops0[5][4] vops0[5][5] vops0[5][6] vops0[5][7] vops1[1][6]
vops1[1][7]
gcc.dg/vect/slp-12a.c:16:17: missed:   permutation requires at least three
vectors

we do not yet lower this when two sub-SLP graphs are fed into the store
but we'd have to - but I'm not sure we can in the end find a suitable
way - at least in a vector length agnostic way.

The testcase will "fix" itself when the failure to SLP with 8 element
vectors will not cause non-SLP to be used but instead we re-try with 4
elements.

Not so for GCN of course.

Possibly for GCN the issue is the vect_strided8 which is implemented as

foreach N {2 3 4 5 6 7 8} {
    eval [string map [list N $N] {
        # Return 1 if the target supports 2-vector interleaving
        proc check_effective_target_vect_stridedN { } {
            return [check_cached_effective_target_indexed vect_stridedN {
                if { (N & -N) == N
                     && [check_effective_target_vect_interleave]
                     && [check_effective_target_vect_extract_even_odd] } {
                    return 1
                }
                if { ([istarget arm*-*-*]
                      || [istarget aarch64*-*-*]) && N >= 2 && N <= 4 } {
                    return 1
                }
                if { ([istarget riscv*-*-*]) && N >= 2 && N <= 8 } {
                    return 1
                }
                if [check_effective_target_vect_fully_masked] {
                    return 1
                }

not sure if gcn really supports a load/store-lane with 8 elements.

Reply via email to