On Fri, Jul 2, 2021 at 5:34 AM Kewen.Lin via Gcc <gcc@gcc.gnu.org> wrote:
>
> Hi,
>
> I am investigating one degradation related to SPEC2017 exchange2_r,
> with loop vectorization on at -O2, it degraded by 6%.  By some
> isolation, I found it isn't directly caused by vectorization itself,
> but exposed by vectorization, some stuffs for vectorization
> condition checks are hoisted out and they increase the register
> pressure, finally results in more spillings than before.  If I simply
> disable tree lim4, I can see the gap becomes smaller (just 40%+ of
> the original), if further disable rtl lim, it just becomes to 30% of
> the original.  It seems to indicate there is some room to improve in
> both LIMs.
>
> By quick scanning in tree LIM, I noticed that there seems no any
> considerations on register pressure, it looked intentional? I am
> wondering what's the design philosophy behind it?  Is it because that
> it's hard to model register pressure well here?  If so, it seems to
> put the burden onto late RA, which needs to have a good
> rematerialization support.

Yes, it is "intentional" in that doing any kind of prioritization based
on register pressure is hard on the GIMPLE level since most
high-level transforms try to expose followup transforms which you'd
somehow have to anticipate.  Note that LIMs "cost model" (if you can
call it such...) is too simplistic to be a good base to decide which
10 of the 20 candidates you want to move (and I've repeatedly pondered
to remove it completely).

As to putting the burden on RA - yes, that's one possibility.  The other
possibility is to use the register-pressure aware scheduler, though not
sure if that will ever move things into loop bodies.

> btw, the example loop is at line 1150 from src exchange2.fppized.f90
>
>    1150 block(rnext:9, 7, i7) = block(rnext:9, 7, i7) + 10
>
> The extra hoisted statements after the vectorization on this loop
> (cheap cost model btw) are:
>
>     _686 = (integer(kind=8)) rnext_679;
>     _1111 = (sizetype) _19;
>     _1112 = _1111 * 12;
>     _1927 = _1112 + 12;
>   * _1895 = _1927 - _2650;
>     _1113 = (unsigned long) rnext_679;
>   * niters.6220_1128 = 10 - _1113;
>   * _1021 = 9 - _1113;
>   * bnd.6221_940 = niters.6220_1128 >> 2;
>   * niters_vector_mult_vf.6222_939 = niters.6220_1128 & 18446744073709551612;
>     _144 = niters_vector_mult_vf.6222_939 + _1113;
>     tmp.6223_934 = (integer(kind=8)) _144;
>     S.823_1004 = _1021 <= 2 ? _686 : tmp.6223_934;
>   * ivtmp.6410_289 = (unsigned long) S.823_1004;
>
> PS: * indicates the one has a long live interval.

Note for the vectorizer generated conditions there's quite some room for
improvements to reduce the amount of semi-redundant computations.  I've
pointed out some to Andre, in particular suggesting to maintain a single
"remaining scalar iterations" IV across all the checks to avoid keeping
'niters' live and doing all the above masking & shifting repeatedly before
the prologue/main/vectorized epilogue/epilogue loops.  Not sure how far
he got with that idea.

Richard.

>
> BR,
> Kewen

Reply via email to