On Fri, Jul 2, 2021 at 5:34 AM Kewen.Lin via Gcc <gcc@gcc.gnu.org> wrote: > > Hi, > > I am investigating one degradation related to SPEC2017 exchange2_r, > with loop vectorization on at -O2, it degraded by 6%. By some > isolation, I found it isn't directly caused by vectorization itself, > but exposed by vectorization, some stuffs for vectorization > condition checks are hoisted out and they increase the register > pressure, finally results in more spillings than before. If I simply > disable tree lim4, I can see the gap becomes smaller (just 40%+ of > the original), if further disable rtl lim, it just becomes to 30% of > the original. It seems to indicate there is some room to improve in > both LIMs. > > By quick scanning in tree LIM, I noticed that there seems no any > considerations on register pressure, it looked intentional? I am > wondering what's the design philosophy behind it? Is it because that > it's hard to model register pressure well here? If so, it seems to > put the burden onto late RA, which needs to have a good > rematerialization support.
Yes, it is "intentional" in that doing any kind of prioritization based on register pressure is hard on the GIMPLE level since most high-level transforms try to expose followup transforms which you'd somehow have to anticipate. Note that LIMs "cost model" (if you can call it such...) is too simplistic to be a good base to decide which 10 of the 20 candidates you want to move (and I've repeatedly pondered to remove it completely). As to putting the burden on RA - yes, that's one possibility. The other possibility is to use the register-pressure aware scheduler, though not sure if that will ever move things into loop bodies. > btw, the example loop is at line 1150 from src exchange2.fppized.f90 > > 1150 block(rnext:9, 7, i7) = block(rnext:9, 7, i7) + 10 > > The extra hoisted statements after the vectorization on this loop > (cheap cost model btw) are: > > _686 = (integer(kind=8)) rnext_679; > _1111 = (sizetype) _19; > _1112 = _1111 * 12; > _1927 = _1112 + 12; > * _1895 = _1927 - _2650; > _1113 = (unsigned long) rnext_679; > * niters.6220_1128 = 10 - _1113; > * _1021 = 9 - _1113; > * bnd.6221_940 = niters.6220_1128 >> 2; > * niters_vector_mult_vf.6222_939 = niters.6220_1128 & 18446744073709551612; > _144 = niters_vector_mult_vf.6222_939 + _1113; > tmp.6223_934 = (integer(kind=8)) _144; > S.823_1004 = _1021 <= 2 ? _686 : tmp.6223_934; > * ivtmp.6410_289 = (unsigned long) S.823_1004; > > PS: * indicates the one has a long live interval. Note for the vectorizer generated conditions there's quite some room for improvements to reduce the amount of semi-redundant computations. I've pointed out some to Andre, in particular suggesting to maintain a single "remaining scalar iterations" IV across all the checks to avoid keeping 'niters' live and doing all the above masking & shifting repeatedly before the prologue/main/vectorized epilogue/epilogue loops. Not sure how far he got with that idea. Richard. > > BR, > Kewen