on 2023/11/20 16:56, Michael Meissner wrote: > On Mon, Nov 20, 2023 at 08:24:35AM +0100, Richard Biener wrote: >> I wouldn't expose the "fake" larger modes to the vectorizer but rather >> adjust m_suggested_unroll_factor (which you already do to some extent). > > Thanks. I figure I first need to fix the shuffle byes issue first and get a > clean test run (with the flag enabled by default), before delving into the > vectorization issues. > > But testing has shown that at least in the loop I was looking at, that using > vector pair instructions (either through the built-ins I had previously posted > or with these patches), that even if I turn off unrolling completely for the > vector pair case, it still is faster than unrolling the loop 4 times for using > vector types (or auto vectorization). Note, of course the margin is much > smaller in this case. > > vector double: (a * b) + c, unroll 4 loop time: 0.55483 > vector double: (a * b) + c, unroll default loop time: 0.55638 > vector double: (a * b) + c, unroll 0 loop time: 0.55686 > vector double: (a * b) + c, unroll 2 loop time: 0.55772 > > vector32, w/vector pair: (a * b) + c, unroll 4 loop time: 0.48257 > vector32, w/vector pair: (a * b) + c, unroll 2 loop time: 0.50782 > vector32, w/vector pair: (a * b) + c, unroll default loop time: 0.50864 > vector32, w/vector pair: (a * b) + c, unroll 0 loop time: 0.52224 > > Of course being micro-benchmarks, it doesn't mean that this translates to the > behavior on actual code. > >
I noticed that Ajit posted a patch for adding one new pass to replace contiguous addresses vector load lxv with lxvp: https://inbox.sourceware.org/gcc-patches/ef0c54a5-c35c-3519-f062-9ac78ee66...@linux.ibm.com/ How about making this kind of rs6000 specific pass to pair both vector load and store? Users can make more unrolling with parameters and those memory accesses from unrolling should be neat, I'd expect the pass can easily detect and pair the candidates. BR, Kewen