On Mon, Nov 20, 2023 at 08:24:35AM +0100, Richard Biener wrote:
> I wouldn't expose the "fake" larger modes to the vectorizer but rather
> adjust m_suggested_unroll_factor (which you already do to some extent).

Thanks.  I figure I first need to fix the shuffle byes issue first and get a
clean test run (with the flag enabled by default), before delving into the
vectorization issues.

But testing has shown that at least in the loop I was looking at, that using
vector pair instructions (either through the built-ins I had previously posted
or with these patches), that even if I turn off unrolling completely for the
vector pair case, it still is faster than unrolling the loop 4 times for using
vector types (or auto vectorization).  Note, of course the margin is much
smaller in this case.

vector double:           (a * b) + c, unroll 4         loop time: 0.55483
vector double:           (a * b) + c, unroll default   loop time: 0.55638
vector double:           (a * b) + c, unroll 0         loop time: 0.55686
vector double:           (a * b) + c, unroll 2         loop time: 0.55772

vector32, w/vector pair: (a * b) + c, unroll 4         loop time: 0.48257
vector32, w/vector pair: (a * b) + c, unroll 2         loop time: 0.50782
vector32, w/vector pair: (a * b) + c, unroll default   loop time: 0.50864
vector32, w/vector pair: (a * b) + c, unroll 0         loop time: 0.52224

Of course being micro-benchmarks, it doesn't mean that this translates to the
behavior on actual code.


-- 
Michael Meissner, IBM
PO Box 98, Ayer, Massachusetts, USA, 01432
email: meiss...@linux.ibm.com

Reply via email to