https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113583
--- Comment #10 from rguenther at suse dot de <rguenther at suse dot de> --- On Fri, 26 Jan 2024, rdapp at gcc dot gnu.org wrote: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113583 > > --- Comment #9 from Robin Dapp <rdapp at gcc dot gnu.org> --- > (In reply to rguent...@suse.de from comment #6) > > > t.c:47:21: missed: the size of the group of accesses is not a power of 2 > > or not equal to 3 > > t.c:47:21: missed: not falling back to elementwise accesses > > t.c:58:15: missed: not vectorized: relevant stmt not supported: _4 = > > *_3; > > t.c:47:21: missed: bad operation or unsupported loop bound. > > > > where we don't consider using gather because we have a known constant > > stride (20). Since the stores are really scatters we don't attempt > > to SLP either. > > > > Disabling the above heuristic we get this vectorized as well, avoiding > > gather/scatter by manually implementing them and using a quite high > > VF of 8 (with -mprefer-vector-width=256 you get VF 4 and likely > > faster code in the end). > > I suppose you're referring to this? > > /* FIXME: At the moment the cost model seems to underestimate the > cost of using elementwise accesses. This check preserves the > traditional behavior until that can be fixed. */ > stmt_vec_info first_stmt_info = DR_GROUP_FIRST_ELEMENT (stmt_info); > if (!first_stmt_info) > first_stmt_info = stmt_info; > if (*memory_access_type == VMAT_ELEMENTWISE > && !STMT_VINFO_STRIDED_P (first_stmt_info) > && !(stmt_info == DR_GROUP_FIRST_ELEMENT (stmt_info) > && !DR_GROUP_NEXT_ELEMENT (stmt_info) > && !pow2p_hwi (DR_GROUP_SIZE (stmt_info)))) > { > if (dump_enabled_p ()) > dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location, > "not falling back to elementwise accesses\n"); > return false; > } > > > I did some more tests on my laptop. As said above the whole loop in lbm is > larger and contains two ifs. The first one prevents clang and GCC from > vectorizing the loop, the second one > > if( TEST_FLAG_SWEEP( srcGrid, ACCEL )) { > ux = 0.005; > uy = 0.002; > uz = 0.000; > } > > seems to be if-converted? by clang or at least doesn't inhibit vectorization. > > Now if I comment out the first, larger if clang does vectorize the loop. With > the return false commented out in the above GCC snippet GCC also vectorizes, > but only when both ifs are commented out. > > Results (with both ifs commented out), -march=native (resulting in avx2), best > of 3 as lbm is notoriously fickle: > > gcc trunk vanilla: 156.04s > gcc trunk with elementwise: 132.10s > clang 17: 143.06s > > Of course even the comment already said that costing is difficult and the > change will surely cause regressions elsewhere. However the 15% improvement > with vectorization (or the 9% improvement of clang) IMHO show that it's surely > useful to look into this further. On top, the riscv clang seems to not care > about the first if either and still vectorize. I haven't looked closer what > happens there, though. Yes. I think this shows we should remove the above hack and instead try to fix the costing next stage1.