On Fri, Nov 29, 2024 at 2:30 AM Hongtao Liu <crazy...@gmail.com> wrote: > > On Thu, Nov 28, 2024 at 4:57 PM Richard Biener > <richard.guent...@gmail.com> wrote: > > > > On Thu, Nov 28, 2024 at 3:04 AM Hongtao Liu <crazy...@gmail.com> wrote: > > > > > > On Wed, Nov 27, 2024 at 9:43 PM Richard Biener > > > <richard.guent...@gmail.com> wrote: > > > > > > > > On Wed, Nov 27, 2024 at 4:26 AM liuhongt <hongtao....@intel.com> wrote: > > > > > > > > > > When loop requires any kind of versioning which could increase > > > > > register > > > > > pressure too much, and it's in a deeply nest big loop, don't do > > > > > vectorization. > > > > > > > > > > I tested the patch with both Ofast and O2 for SPEC2017, besides > > > > > 548.exchange_r, > > > > > other benchmarks are same binary. > > > > > > > > > > Bootstrapped and regtested 0on x86_64-pc-linux-gnu{-m32,} > > > > > Any comments? > > > > > > > > The vectorizer tries to version an outer loop when vectorizing a loop > > > > nest > > > > and the versioning condition is invariant. See vect_loop_versioning. > > > > This > > > > tries to handle such cases. Often the generated runtime alias checks > > > > are > > > > not invariant because we do not consider the outer evolutions. I think > > > > we > > > > should instead fix this there. > > > > > > > > Question below ... > > > > > > > > > gcc/ChangeLog: > > > > > > > > > > pr target/117088 > > > > > * config/i386/i386.cc > > > > > (ix86_vector_costs::ix86_vect_in_deep_nested_loop_p): New > > > > > function. > > > > > (ix86_vector_costs::finish_cost): Prevent loop vectorization > > > > > if it's in a deeply nested loop and require versioning. > > > > > * config/i386/i386.opt (--param=vect-max-loop-depth=): New > > > > > param. > > > > > --- > > > > > gcc/config/i386/i386.cc | 89 > > > > > ++++++++++++++++++++++++++++++++++++++++ > > > > > gcc/config/i386/i386.opt | 4 ++ > > > > > 2 files changed, 93 insertions(+) > > > > > > > > > > diff --git a/gcc/config/i386/i386.cc b/gcc/config/i386/i386.cc > > > > > index 526c9df7618..608f40413d2 100644 > > > > > --- a/gcc/config/i386/i386.cc > > > > > +++ b/gcc/config/i386/i386.cc > > > > > @@ -25019,6 +25019,8 @@ private: > > > > > > > > > > /* Estimate register pressure of the vectorized code. */ > > > > > void ix86_vect_estimate_reg_pressure (); > > > > > + /* Check if vect_loop is in a deeply-nested loop. */ > > > > > + bool ix86_vect_in_deep_nested_loop_p (class loop *vect_loop); > > > > > /* Number of GENERAL_REGS/SSE_REGS used in the vectorizer, it's > > > > > used for > > > > > estimation of register pressure. > > > > > ??? Currently it's only used by vec_construct/scalar_to_vec > > > > > @@ -25324,6 +25326,84 @@ > > > > > ix86_vector_costs::ix86_vect_estimate_reg_pressure () > > > > > } > > > > > } > > > > > > > > > > +/* Return true if vect_loop is in a deeply-nested loop. > > > > > + .i.e vect_loop_n in below loop structure. > > > > > +loop1 > > > > > +{ > > > > > + loop2 > > > > > + { > > > > > + loop3 > > > > > + { > > > > > + vect_loop_1; > > > > > + loop4 > > > > > + { > > > > > + vect_loop_2; > > > > > + loop5 > > > > > + { > > > > > + vect_loop_3; > > > > > + loop6 > > > > > + { > > > > > + vect_loop_4; > > > > > + loop7 > > > > > + { > > > > > + vect_loop_5; > > > > > + loop8 > > > > > + { > > > > > + loop9 > > > > > + } > > > > > + vect_loop_6; > > > > > + } > > > > > + vect_loop_7; > > > > > + } > > > > > + } > > > > > + } > > > > > + } > > > > > + It's a big hammer to fix O2 regression for 548.exchange_r after > > > > > vectorization > > > > > + is enhanced by (r15-4225-g70c3db511ba14f) */ > > > > > +bool > > > > > +ix86_vector_costs::ix86_vect_in_deep_nested_loop_p (class loop > > > > > *vect_loop) > > > > > +{ > > > > > + if (loop_depth (vect_loop) > (unsigned) ix86_vect_max_loop_depth) > > > > > + return true; > > > > > + > > > > > + if (loop_depth (vect_loop) < 2) > > > > > + return false; > > > > > + > > > > > > > > while the above two are "obvious", what you check below isn't clear to > > > > me. > > > > Is this trying to compute whether 'vect_loop' is inside of a loop nest > > > > which > > > > at any sibling of vect_loop (or even sibling of an outer loop of > > > > vect_loop, > > > > recursively) is a sub-nest with a loop depth (relative to what?) exceeds > > > > ix86_vect_max_loop_depth? > > > Yes, the function tries to find if the vect_loop is in a "big outer > > > loop" which contains an innermost loop with loop_depth > > > > ix86_vect_max_loop_depth. > > > If yes, then prevent vectorization for the loop if its tripcount is > > > not constant VF-times.(requires any kind of versioning is not > > > accurate, and yes it's a big hammer.) > > > > I'll note it also doesn't seem to look at register pressure at all or limit > > the cut-off to the very-cheap cost model? > The default parameter ix86_vect_max_loop_depth implies the register > pressure, for each layer of loop, it generally needs 2 registers: 1 iv > + 1 tripcount. > ix86_vect_max_loop_depth > 8 will run out of 16 registers. The > vectoriation for unknown tripcountl increases 1 register for the "new > tripcount of main vectorized loop", and it causes extra spill in the > outer loop.
Hmm, yeah - one of my long-term TODO is to re-do how the vectorizer generates the main IV code to use a decrement to zero remaining scalar iterations for prologue/vector/epilogue loops. The disadvantage is we'd get a decrement by VF and test >= VF in the vector loop rather than decrement by one and test for zero, but I think IVOPTs will eventually recover this IV. My experience with exchange is that LRA spills the "wrong" regs here and that IVOPTs only considers a loop in isolation rather than a set of sibling loops within an outer loop for example. At some point I thought of doing aggressive unswitching of outer loops on 'mod(row,3)' somehow, that might reduce the complexity of the nests considerably, enabling IRA/LRA to do a better job eventually. But anyway, I think exchange is special enough that we shouldn't worry about the regression too much - at least not as to invent these kind of odd heuristics for it. > if the tripcount of vect_loop is big, then the penalty can be > compensated by the vectorization, but for exchange_r, the tripcount is > small, that's why it regresses performance. > The range info didn't get real range for the tripcount since fortran > allows negative index for the array. > I'm not sure if ivopt or ira could be improved for this case, so maybe > we should live with this regression. > Restricting the cut off with the very-cheap cost model would also be a > good idea. > > > > That said, it feels like a hack specifically for 548.exchange_r, in > > particular > > vectorization by itself shouldn't increase register pressure (much), but > > exchange is known to operate on the bounds of "awful" with regard to > > register pressure. If you'd enable APX would exchange benefit from > > vectorizing? > APX would fix this(it even improves the exchange w/o vectorization) > > > > That said, I think we have to live with the regression, the change feels > > odd and a strcmp (main_file_name, "exchange") would be similar. So > > we either need to make the pattern matching more precise, like counting > > live IVs from the loop nest plus applying heuristics on how vectorization > > increases register pressure (maybe it's an IVOPTs issue in the end?), > > or defer a solution to GCC 16. > > > > Richard. > > > > > > > > > > > + class loop* outer_loop = loop_outer (vect_loop); > > > > > + > > > > > + auto_vec<class loop*> m_loop_stack; > > > > > + auto_sbitmap m_visited_loops (number_of_loops (cfun)); > > > > > + > > > > > + /* Get all sibling loops for vect_loop. */ > > > > > + class loop* next_loop = outer_loop->inner; > > > > > + for (; next_loop; next_loop = next_loop->next) > > > > > + { > > > > > + m_loop_stack.safe_push (next_loop); > > > > > + bitmap_set_bit (m_visited_loops, next_loop->num); > > > > > + } > > > > > + > > > > > + /* DFS the max depth of all sibling loop. */ > > > > > + while (!m_loop_stack.is_empty ()) > > > > > + { > > > > > + next_loop = m_loop_stack.pop (); > > > > > + if (loop_depth (next_loop) > (unsigned) > > > > > ix86_vect_max_loop_depth) > > > > > + return true; > > > > > + > > > > > + class loop* inner_loop = next_loop->inner; > > > > > + while (inner_loop) > > > > > + { > > > > > + if (!bitmap_bit_p (m_visited_loops, inner_loop->num)) > > > > > + { > > > > > + m_loop_stack.safe_push (inner_loop); > > > > > + bitmap_set_bit (m_visited_loops, inner_loop->num); > > > > > + } > > > > > + inner_loop = inner_loop->next; > > > > > + } > > > > > + } > > > > > + > > > > > + return false; > > > > > +} > > > > > + > > > > > void > > > > > ix86_vector_costs::finish_cost (const vector_costs *scalar_costs) > > > > > { > > > > > @@ -25344,6 +25424,15 @@ ix86_vector_costs::finish_cost (const > > > > > vector_costs *scalar_costs) > > > > > && (exact_log2 (LOOP_VINFO_VECT_FACTOR > > > > > (loop_vinfo).to_constant ()) > > > > > > ceil_log2 (LOOP_VINFO_INT_NITERS (loop_vinfo)))) > > > > > m_costs[vect_body] = INT_MAX; > > > > > + > > > > > + /* Prohibit vectorization when the loop requires versioning > > > > > + and loop_depth exceeds threshold. */ > > > > > + if ((LOOP_REQUIRES_VERSIONING (loop_vinfo) > > > > > + || LOOP_VINFO_PEELING_FOR_NITER (loop_vinfo) > > > > > + || LOOP_VINFO_PEELING_FOR_ALIGNMENT (loop_vinfo) > > > > > + || vect_apply_runtime_profitability_check_p (loop_vinfo)) > > > > > > > > this also applies to loops not requiring versioning - it practically > > > > applies to > > > > all loops that do not run for a constant VF-times times. > > > > > > > > > + && ix86_vect_in_deep_nested_loop_p (LOOP_VINFO_LOOP > > > > > (loop_vinfo))) > > > > > + m_costs[vect_body] = INT_MAX; > > > > > } > > > > > > > > > > ix86_vect_estimate_reg_pressure (); > > > > > diff --git a/gcc/config/i386/i386.opt b/gcc/config/i386/i386.opt > > > > > index 99e86f545e8..c5abf83473d 100644 > > > > > --- a/gcc/config/i386/i386.opt > > > > > +++ b/gcc/config/i386/i386.opt > > > > > @@ -1414,3 +1414,7 @@ Support MOVRS built-in functions and code > > > > > generation. > > > > > mamx-movrs > > > > > Target Mask(ISA2_AMX_MOVRS) Var(ix86_isa_flags2) Save > > > > > Support AMX-MOVRS built-in functions and code generation. > > > > > + > > > > > +-param=vect-max-loop-depth= > > > > > +Target Joined UInteger Var(ix86_vect_max_loop_depth) Init(8) Param > > > > > +Preversion loop vectorization when it's in a deeply nested loop and > > > > > requires versioning, since it may increase register pressure too much. > > > > > -- > > > > > 2.34.1 > > > > > > > > > > > > > > > > > -- > > > BR, > > > Hongtao > > > > -- > BR, > Hongtao