[Bug target/116611] Inefficient mix of contiguous and load-lane vectorization due to missing permutes

rguenth at gcc dot gnu.org via Gcc-bugs Thu, 05 Sep 2024 23:01:27 -0700

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116611


--- Comment #2 from Richard Biener <rguenth at gcc dot gnu.org> ---
There'll be a bunch of FAILs that can be attributed to this "bug" after I'll
merge the next SLP enablement patches.  There is definitely heuristic at
the vectorizers side at play here and the "old" heuristic cannot be identically
transferred (easily I think ...), so I opted for a slightly different one that
"made sense".

I do expect we're going to iterate a bit on that heuristic.

But I think evaluating options on the target side would be good as well given
I think the RVV designers didn't think of having RVV without a fast generic
permute mechanism, they just took the unusual route of calling it "gather".

I'll note that special-casing permutes that can be implemented with
"compress", slide{up,down} and blend from those requiring general "gather"
might make sense in case some uarchs have fast compress but slow gather.

All of extract even, odd, interleave lo, hi are quite fundamental building
blocks.

Note that ideally vectorizing the ia[i] store would not require increasing
the VF iff we can arrange its store to use a vector size of 1/8 size.  You'd
gather every 2 + 8*n element from the in[] vector load into a smaller vector
and store that (well, you need to somehow reduce the operating mask/len).

[Bug target/116611] Inefficient mix of contiguous and load-lane vectorization due to missing permutes

Reply via email to