on 2021/7/2 下午7:28, Richard Biener wrote: > On Fri, Jul 2, 2021 at 11:05 AM Kewen.Lin <li...@linux.ibm.com> wrote: >> >> Hi Richard, >> >> on 2021/7/2 下午4:07, Richard Biener wrote: >>> On Fri, Jul 2, 2021 at 5:34 AM Kewen.Lin via Gcc <gcc@gcc.gnu.org> wrote: >>>> >>>> Hi, >>>> >>>> I am investigating one degradation related to SPEC2017 exchange2_r, >>>> with loop vectorization on at -O2, it degraded by 6%. By some >>>> isolation, I found it isn't directly caused by vectorization itself, >>>> but exposed by vectorization, some stuffs for vectorization >>>> condition checks are hoisted out and they increase the register >>>> pressure, finally results in more spillings than before. If I simply >>>> disable tree lim4, I can see the gap becomes smaller (just 40%+ of >>>> the original), if further disable rtl lim, it just becomes to 30% of >>>> the original. It seems to indicate there is some room to improve in >>>> both LIMs. >>>> >>>> By quick scanning in tree LIM, I noticed that there seems no any >>>> considerations on register pressure, it looked intentional? I am >>>> wondering what's the design philosophy behind it? Is it because that >>>> it's hard to model register pressure well here? If so, it seems to >>>> put the burden onto late RA, which needs to have a good >>>> rematerialization support. >>> >>> Yes, it is "intentional" in that doing any kind of prioritization based >>> on register pressure is hard on the GIMPLE level since most >>> high-level transforms try to expose followup transforms which you'd >>> somehow have to anticipate. Note that LIMs "cost model" (if you can >>> call it such...) is too simplistic to be a good base to decide which >>> 10 of the 20 candidates you want to move (and I've repeatedly pondered >>> to remove it completely). >>> >> >> Thanks for the explanation! Do you really want to remove it completely >> rather than just improve it with a better one? :-\ > > ;) For example the LIM cost model makes it not hoist an invariant (int)x > but then PRE which detects invariant motion opportunities as partial > redundances happily does (because PRE has no cost model at all - heh). >
Got it, thanks for further clarification. :) >> Here there are some PRs (PR96825, PR98782) related to exchange2_r which >> seems to suffer from high register pressure and bad spillings. Not sure >> whether they are also somehow related to the pressure given from LIM, but >> the trigger is commit >> 1118a3ff9d3ad6a64bba25dc01e7703325e23d92 which adjusts prediction >> frequency, maybe it's worth to re-visiting this idea about considering >> BB frequency in LIM cost model: >> https://gcc.gnu.org/pipermail/gcc/2014-November/215551.html > > Note most "problems", and those which are harder to undo, stem from > LIMs store-motion which increases register pressure inside loops by > adding loop-carried dependences. The BB frequency might be a way > to order candidates when we have a way to set a better cap on the > number of refs to move. Note the current "cost" model is rather a > benefit model and causes us to not move cheap things (like the above > conversion) because it seems not worth the trouble. > Yeah, I noticed it at least excludes "cheap" ones. > Note a very simple way would be to have a --param specifying a > maximum number of refs to move (but note there are several > LIM/store-motion passes so any such static limit would have > surprising effects). For store-motion I considered a hard limit on > the number of loop carried dependences (PHIs) and counting both > existing and added ones (to avoid the surprise). > > Note how such limits or other cost models should consider inner and > outer loop behavior remains to be determined - at least LIM works > at the level of whole loop nests and there's a rough idea of dependent > transforms but simply gathering candidates and stripping some isn't > going to work without major surgery in that area I think. > Thanks for all the notes and thoughts, I might had better to visit RA remat first, Xionghu had some interests to investigate how to consider BB freq in LIMs, I will check its effect and further check these ideas if need then. BR, Kewen >>> As to putting the burden on RA - yes, that's one possibility. The other >>> possibility is to use the register-pressure aware scheduler, though not >>> sure if that will ever move things into loop bodies. >>> >> >> Brandly new idea! IIUC it requires a global scheduler, not sure how well >> GCC global scheduler performs, generally speaking the register-pressure >> aware scheduler will prefer the insn which has more deads (for that >> intensive regclass), for this problem the modeling seems a bit different, >> it has to care about total interference numbers between two "equivalent" >> blocks (src/dest), not sure if it's easier to do than rematerialization. > > No idea either but as said above undoing store-motion is harder than > scheduling or RA remat. > >>>> btw, the example loop is at line 1150 from src exchange2.fppized.f90 >>>> >>>> 1150 block(rnext:9, 7, i7) = block(rnext:9, 7, i7) + 10 >>>> >>>> The extra hoisted statements after the vectorization on this loop >>>> (cheap cost model btw) are: >>>> >>>> _686 = (integer(kind=8)) rnext_679; >>>> _1111 = (sizetype) _19; >>>> _1112 = _1111 * 12; >>>> _1927 = _1112 + 12; >>>> * _1895 = _1927 - _2650; >>>> _1113 = (unsigned long) rnext_679; >>>> * niters.6220_1128 = 10 - _1113; >>>> * _1021 = 9 - _1113; >>>> * bnd.6221_940 = niters.6220_1128 >> 2; >>>> * niters_vector_mult_vf.6222_939 = niters.6220_1128 & >>>> 18446744073709551612; >>>> _144 = niters_vector_mult_vf.6222_939 + _1113; >>>> tmp.6223_934 = (integer(kind=8)) _144; >>>> S.823_1004 = _1021 <= 2 ? _686 : tmp.6223_934; >>>> * ivtmp.6410_289 = (unsigned long) S.823_1004; >>>> >>>> PS: * indicates the one has a long live interval. >>> >>> Note for the vectorizer generated conditions there's quite some room for >>> improvements to reduce the amount of semi-redundant computations. I've >>> pointed out some to Andre, in particular suggesting to maintain a single >>> "remaining scalar iterations" IV across all the checks to avoid keeping >>> 'niters' live and doing all the above masking & shifting repeatedly before >>> the prologue/main/vectorized epilogue/epilogue loops. Not sure how far >>> he got with that idea. >>> >> >> Great, it definitely helps to mitigate this problem. Thanks for the >> information. >> >> >> BR, >> Kewen