Re: Question on tree LIM

Kewen.Lin via Gcc Sun, 04 Jul 2021 19:30:12 -0700

on 2021/7/2 下午7:28, Richard Biener wrote:
> On Fri, Jul 2, 2021 at 11:05 AM Kewen.Lin <li...@linux.ibm.com> wrote:
>>
>> Hi Richard,
>>
>> on 2021/7/2 下午4:07, Richard Biener wrote:
>>> On Fri, Jul 2, 2021 at 5:34 AM Kewen.Lin via Gcc <gcc@gcc.gnu.org> wrote:
>>>>
>>>> Hi,
>>>>
>>>> I am investigating one degradation related to SPEC2017 exchange2_r,
>>>> with loop vectorization on at -O2, it degraded by 6%.  By some
>>>> isolation, I found it isn't directly caused by vectorization itself,
>>>> but exposed by vectorization, some stuffs for vectorization
>>>> condition checks are hoisted out and they increase the register
>>>> pressure, finally results in more spillings than before.  If I simply
>>>> disable tree lim4, I can see the gap becomes smaller (just 40%+ of
>>>> the original), if further disable rtl lim, it just becomes to 30% of
>>>> the original.  It seems to indicate there is some room to improve in
>>>> both LIMs.
>>>>
>>>> By quick scanning in tree LIM, I noticed that there seems no any
>>>> considerations on register pressure, it looked intentional? I am
>>>> wondering what's the design philosophy behind it?  Is it because that
>>>> it's hard to model register pressure well here?  If so, it seems to
>>>> put the burden onto late RA, which needs to have a good
>>>> rematerialization support.
>>>
>>> Yes, it is "intentional" in that doing any kind of prioritization based
>>> on register pressure is hard on the GIMPLE level since most
>>> high-level transforms try to expose followup transforms which you'd
>>> somehow have to anticipate.  Note that LIMs "cost model" (if you can
>>> call it such...) is too simplistic to be a good base to decide which
>>> 10 of the 20 candidates you want to move (and I've repeatedly pondered
>>> to remove it completely).
>>>
>>
>> Thanks for the explanation!  Do you really want to remove it completely
>> rather than just improve it with a better one?  :-\
> 
> ;)  For example the LIM cost model makes it not hoist an invariant (int)x
> but then PRE which detects invariant motion opportunities as partial
> redundances happily does (because PRE has no cost model at all - heh).
>


Got it, thanks for further clarification. :)

>> Here there are some PRs (PR96825, PR98782) related to exchange2_r which
>> seems to suffer from high register pressure and bad spillings.  Not sure
>> whether they are also somehow related to the pressure given from LIM, but
>> the trigger is commit
>> 1118a3ff9d3ad6a64bba25dc01e7703325e23d92 which adjusts prediction
>> frequency, maybe it's worth to re-visiting this idea about considering
>> BB frequency in LIM cost model:
>> https://gcc.gnu.org/pipermail/gcc/2014-November/215551.html
> 
> Note most "problems", and those which are harder to undo, stem from
> LIMs store-motion which increases register pressure inside loops by
> adding loop-carried dependences.  The BB frequency might be a way
> to order candidates when we have a way to set a better cap on the
> number of refs to move.  Note the current "cost" model is rather a
> benefit model and causes us to not move cheap things (like the above
> conversion) because it seems not worth the trouble.
> 

Yeah, I noticed it at least excludes "cheap" ones.

> Note a very simple way would be to have a --param specifying a
> maximum number of refs to move (but note there are several
> LIM/store-motion passes so any such static limit would have
> surprising effects).  For store-motion I considered a hard limit on
> the number of loop carried dependences (PHIs) and counting both
> existing and added ones (to avoid the surprise).
> 
> Note how such limits or other cost models should consider inner and
> outer loop behavior remains to be determined - at least LIM works
> at the level of whole loop nests and there's a rough idea of dependent
> transforms but simply gathering candidates and stripping some isn't
> going to work without major surgery in that area I think.
> 

Thanks for all the notes and thoughts, I might had better to visit RA remat
first, Xionghu had some interests to investigate how to consider BB freq in
LIMs, I will check its effect and further check these ideas if need then.

BR,
Kewen

>>> As to putting the burden on RA - yes, that's one possibility.  The other
>>> possibility is to use the register-pressure aware scheduler, though not
>>> sure if that will ever move things into loop bodies.
>>>
>>
>> Brandly new idea!  IIUC it requires a global scheduler, not sure how well
>> GCC global scheduler performs, generally speaking the register-pressure
>> aware scheduler will prefer the insn which has more deads (for that
>> intensive regclass), for this problem the modeling seems a bit different,
>> it has to care about total interference numbers between two "equivalent"
>> blocks (src/dest), not sure if it's easier to do than rematerialization.
> 
> No idea either but as said above undoing store-motion is harder than
> scheduling or RA remat.
> 
>>>> btw, the example loop is at line 1150 from src exchange2.fppized.f90
>>>>
>>>>    1150 block(rnext:9, 7, i7) = block(rnext:9, 7, i7) + 10
>>>>
>>>> The extra hoisted statements after the vectorization on this loop
>>>> (cheap cost model btw) are:
>>>>
>>>>     _686 = (integer(kind=8)) rnext_679;
>>>>     _1111 = (sizetype) _19;
>>>>     _1112 = _1111 * 12;
>>>>     _1927 = _1112 + 12;
>>>>   * _1895 = _1927 - _2650;
>>>>     _1113 = (unsigned long) rnext_679;
>>>>   * niters.6220_1128 = 10 - _1113;
>>>>   * _1021 = 9 - _1113;
>>>>   * bnd.6221_940 = niters.6220_1128 >> 2;
>>>>   * niters_vector_mult_vf.6222_939 = niters.6220_1128 & 
>>>> 18446744073709551612;
>>>>     _144 = niters_vector_mult_vf.6222_939 + _1113;
>>>>     tmp.6223_934 = (integer(kind=8)) _144;
>>>>     S.823_1004 = _1021 <= 2 ? _686 : tmp.6223_934;
>>>>   * ivtmp.6410_289 = (unsigned long) S.823_1004;
>>>>
>>>> PS: * indicates the one has a long live interval.
>>>
>>> Note for the vectorizer generated conditions there's quite some room for
>>> improvements to reduce the amount of semi-redundant computations.  I've
>>> pointed out some to Andre, in particular suggesting to maintain a single
>>> "remaining scalar iterations" IV across all the checks to avoid keeping
>>> 'niters' live and doing all the above masking & shifting repeatedly before
>>> the prologue/main/vectorized epilogue/epilogue loops.  Not sure how far
>>> he got with that idea.
>>>
>>
>> Great, it definitely helps to mitigate this problem.  Thanks for the 
>> information.
>>
>>
>> BR,
>> Kewen

Re: Question on tree LIM

Reply via email to