https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116571
Richard Biener <rguenth at gcc dot gnu.org> changed: What |Removed |Added ---------------------------------------------------------------------------- Keywords| |missed-optimization --- Comment #3 from Richard Biener <rguenth at gcc dot gnu.org> --- On x86-64 with -march=cascadelake the issue for slp-12a.c is that the SLP store is fed from lane permutation { 0[0] 0[1] 0[2] 0[3] 0[4] 0[5] 1[0] 1[1] } which with 8 lane vectors runs into the three vector permutation limit: gcc.dg/vect/slp-12a.c:16:17: note: vectorizing permutation op0[0] op0[1] op0[2] op0[3] op0[4] op0[5] op1[0] op1[1] gcc.dg/vect/slp-12a.c:16:17: note: as vops0[0][0] vops0[0][1] vops0[0][2] vops0[0][3] vops0[0][4] vops0[0][5] vops1[0][0] vops1[0][1], vops0[0][6] vops0[0][7] vops0[1][0] vops0[1][1] vops0[1][2] vops0[1][3] vops1[0][2] vops1[0][3], vops0[1][4] vops0[1][5] vops0[1][6] vops0[1][7] vops0[2][0] vops0[2][1] vops1[0][4] vops1[0][5], vops0[2][2] vops0[2][3] vops0[2][4] vops0[2][5] vops0[2][6] vops0[2][7] vops1[0][6] vops1[0][7], vops0[3][0] vops0[3][1] vops0[3][2] vops0[3][3] vops0[3][4] vops0[3][5] vops1[1][0] vops1[1][1], vops0[3][6] vops0[3][7] vops0[4][0] vops0[4][1] vops0[4][2] vops0[4][3] vops1[1][2] vops1[1][3], vops0[4][4] vops0[4][5] vops0[4][6] vops0[4][7] vops0[5][0] vops0[5][1] vops1[1][4] vops1[1][5], vops0[5][2] vops0[5][3] vops0[5][4] vops0[5][5] vops0[5][6] vops0[5][7] vops1[1][6] vops1[1][7] gcc.dg/vect/slp-12a.c:16:17: missed: permutation requires at least three vectors we do not yet lower this when two sub-SLP graphs are fed into the store but we'd have to - but I'm not sure we can in the end find a suitable way - at least in a vector length agnostic way. The testcase will "fix" itself when the failure to SLP with 8 element vectors will not cause non-SLP to be used but instead we re-try with 4 elements. Not so for GCN of course. Possibly for GCN the issue is the vect_strided8 which is implemented as foreach N {2 3 4 5 6 7 8} { eval [string map [list N $N] { # Return 1 if the target supports 2-vector interleaving proc check_effective_target_vect_stridedN { } { return [check_cached_effective_target_indexed vect_stridedN { if { (N & -N) == N && [check_effective_target_vect_interleave] && [check_effective_target_vect_extract_even_odd] } { return 1 } if { ([istarget arm*-*-*] || [istarget aarch64*-*-*]) && N >= 2 && N <= 4 } { return 1 } if { ([istarget riscv*-*-*]) && N >= 2 && N <= 8 } { return 1 } if [check_effective_target_vect_fully_masked] { return 1 } not sure if gcn really supports a load/store-lane with 8 elements.