https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113583

--- Comment #10 from rguenther at suse dot de <rguenther at suse dot de> ---
On Fri, 26 Jan 2024, rdapp at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113583
> 
> --- Comment #9 from Robin Dapp <rdapp at gcc dot gnu.org> ---
> (In reply to rguent...@suse.de from comment #6)
> 
> > t.c:47:21: missed:   the size of the group of accesses is not a power of 2 
> > or not equal to 3
> > t.c:47:21: missed:   not falling back to elementwise accesses
> > t.c:58:15: missed:   not vectorized: relevant stmt not supported: _4 = 
> > *_3;
> > t.c:47:21: missed:  bad operation or unsupported loop bound.
> > 
> > where we don't consider using gather because we have a known constant
> > stride (20).  Since the stores are really scatters we don't attempt
> > to SLP either.
> > 
> > Disabling the above heuristic we get this vectorized as well, avoiding
> > gather/scatter by manually implementing them and using a quite high
> > VF of 8 (with -mprefer-vector-width=256 you get VF 4 and likely
> > faster code in the end).
> 
> I suppose you're referring to this?
> 
>   /* FIXME: At the moment the cost model seems to underestimate the
>      cost of using elementwise accesses.  This check preserves the
>      traditional behavior until that can be fixed.  */
>   stmt_vec_info first_stmt_info = DR_GROUP_FIRST_ELEMENT (stmt_info);
>   if (!first_stmt_info)
>     first_stmt_info = stmt_info;
>   if (*memory_access_type == VMAT_ELEMENTWISE
>       && !STMT_VINFO_STRIDED_P (first_stmt_info)
>       && !(stmt_info == DR_GROUP_FIRST_ELEMENT (stmt_info)
>            && !DR_GROUP_NEXT_ELEMENT (stmt_info)
>            && !pow2p_hwi (DR_GROUP_SIZE (stmt_info))))
>     {
>       if (dump_enabled_p ())
>         dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
>                          "not falling back to elementwise accesses\n");
>       return false;
>     }
> 
> 
> I did some more tests on my laptop.  As said above the whole loop in lbm is
> larger and contains two ifs.  The first one prevents clang and GCC from
> vectorizing the loop, the second one
> 
>                 if( TEST_FLAG_SWEEP( srcGrid, ACCEL )) {
>                         ux = 0.005;
>                         uy = 0.002;
>                         uz = 0.000;
>                 }
> 
> seems to be if-converted? by clang or at least doesn't inhibit vectorization.
> 
> Now if I comment out the first, larger if clang does vectorize the loop.  With
> the return false commented out in the above GCC snippet GCC also vectorizes,
> but only when both ifs are commented out.
> 
> Results (with both ifs commented out), -march=native (resulting in avx2), best
> of 3 as lbm is notoriously fickle:
> 
> gcc trunk vanilla: 156.04s
> gcc trunk with elementwise: 132.10s
> clang 17: 143.06s
> 
> Of course even the comment already said that costing is difficult and the
> change will surely cause regressions elsewhere.  However the 15% improvement
> with vectorization (or the 9% improvement of clang) IMHO show that it's surely
> useful to look into this further.  On top, the riscv clang seems to not care
> about the first if either and still vectorize.  I haven't looked closer what
> happens there, though.

Yes.  I think this shows we should remove the above hack and instead
try to fix the costing next stage1.

Reply via email to