On Mon, Nov 20, 2023 at 08:24:35AM +0100, Richard Biener wrote: > I wouldn't expose the "fake" larger modes to the vectorizer but rather > adjust m_suggested_unroll_factor (which you already do to some extent).
Thanks. I figure I first need to fix the shuffle byes issue first and get a clean test run (with the flag enabled by default), before delving into the vectorization issues. But testing has shown that at least in the loop I was looking at, that using vector pair instructions (either through the built-ins I had previously posted or with these patches), that even if I turn off unrolling completely for the vector pair case, it still is faster than unrolling the loop 4 times for using vector types (or auto vectorization). Note, of course the margin is much smaller in this case. vector double: (a * b) + c, unroll 4 loop time: 0.55483 vector double: (a * b) + c, unroll default loop time: 0.55638 vector double: (a * b) + c, unroll 0 loop time: 0.55686 vector double: (a * b) + c, unroll 2 loop time: 0.55772 vector32, w/vector pair: (a * b) + c, unroll 4 loop time: 0.48257 vector32, w/vector pair: (a * b) + c, unroll 2 loop time: 0.50782 vector32, w/vector pair: (a * b) + c, unroll default loop time: 0.50864 vector32, w/vector pair: (a * b) + c, unroll 0 loop time: 0.52224 Of course being micro-benchmarks, it doesn't mean that this translates to the behavior on actual code. -- Michael Meissner, IBM PO Box 98, Ayer, Massachusetts, USA, 01432 email: meiss...@linux.ibm.com