https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84753
Bill Schmidt <wschmidt at gcc dot gnu.org> changed: What |Removed |Added ---------------------------------------------------------------------------- Status|RESOLVED |UNCONFIRMED Resolution|INVALID |--- --- Comment #4 from Bill Schmidt <wschmidt at gcc dot gnu.org> --- OK, I see. We optimize swapped vperm in most cases as part of a general swap-optimization algorithm. However, this algorithm is defeated when there is a mix of loads/stores accompanied by swaps and loads/stores that are not accompanied by swaps. The "big-endian" loads that are used with vshasigmaw and friends are the problem here. (This problem goes away with Power9, but doesn't help you here.) There is a slight possibility we can address this in GCC 8, but it is unlikely, as the code base is closed except for regression fixes. In any case, a solution would still keep some swap instructions in place, and thus would not be ideal. (I.e., we can fold a swap and a vperm when the result of the swap is not used elsewhere, but other swaps associated with loads and stores will still be present.) So I don't think we should go this route. The best performance will be achieved by writing this loop entirely using inline asm code, with all data loaded/stored using lxvd2x and stxvd2x (no swaps), thus in "big-endian element order" (element 0 in the high-order position of the register). Because of the big-endian nature of vshasigmaw, this is always going to be the best approach. I am still poking the bushes for a reference implementation; I thought of another person to ask while writing this note. Will let you know what I find out.