On Fri, 23 Aug 2019, Andre Vieira (lists) wrote: > Hi Richard, > > I have come up with a way to teach the vectorizer to handle sign-changing > reductions, restricted to SUM operations as I'm not sure other reductions are > equivalent with different signs. > > The main nature of this approach is to let it recognize reductions of the > form: Phi->NopConversion?->Plus/Minus-reduction->NopConversion?->Phi. Then > vectorize the statements normally, with some extra workarounds to handle the > conversions. This is mainly needed where it looks for uses of the result of > the reduction, we now need to check the uses of the result of the conversion > instead. > > I am curious to know what you think of this approach. I have regression tested > this on aarch64 and x86_64 with AVX512 and it shows no regressions. On the 1 > month old version of trunk I tested on it even seems to make > gcc.dg/vect/pr89268.c pass, where it used to fail with an ICE complaining > about a definition not dominating a use.
Aww. Yeah, I had a half-way working patch along this line as well and threw it away because of ugliness. So I was hoping we can at some point refactor the reduction detection code to use the path discovery in check_reduction_path (which is basically a lame SCC finding algorithm), massage the detected reduction path and in the reduction PHI meta-data record something like "this reduction SUMs _1, _4, _3 and _5" plus for the conversions "do the reduction in SIGN" and during code-generation just look at the PHI node and the backedge def which we'd replace. But of course I stopped short trying that because the reduction code is a big mess. And I threw away the attempt that looked like yours because I didn't want to make an even bigger mess out of it :/ On the branch throwing away the non-SLP paths I started to refactor^Wrewrite all this but got stuck as well. One thing I realized on the branch is that nested cycle handling should be more straight-forward and done in a separate vectorizable_* routine. Not sure it simplified things a lot, but well. Maybe also simply always building a SLP graph for reductions only helps. Well. Maybe you can try experimenting with amending check_reduction_path with conversion support - from a quick look your patch wouldn't handle _1 = PHI <.., _4> _2 = (unsigned) _1; _3 = _2 + ...; _4 = (signed) _3; since the last stmt you expect is still a PLUS? > The initial benchmarks I did also show a 14% improvement on x264_r on SPEC2017 > for aarch64. Interesting, on x86 IIRC I didn't see any such big effect on x264_r but it was the testcase I ran into this first. Richard.