https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84753
Bill Schmidt <wschmidt at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|RESOLVED                    |UNCONFIRMED
         Resolution|INVALID                     |---

--- Comment #4 from Bill Schmidt <wschmidt at gcc dot gnu.org> ---
OK, I see.  We optimize swapped vperm in most cases as part of a general
swap-optimization algorithm.  However, this algorithm is defeated when there is
a mix of loads/stores accompanied by swaps and loads/stores that are not
accompanied by swaps.  The "big-endian" loads that are used with vshasigmaw and
friends are the problem here.  (This problem goes away with Power9, but doesn't
help you here.)

There is a slight possibility we can address this in GCC 8, but it is unlikely,
as the code base is closed except for regression fixes.  In any case, a
solution would still keep some swap instructions in place, and thus would not
be ideal.  (I.e., we can fold a swap and a vperm when the result of the swap is
not used elsewhere, but other swaps associated with loads and stores will still
be present.)  So I don't think we should go this route.

The best performance will be achieved by writing this loop entirely using
inline asm code, with all data loaded/stored using lxvd2x and stxvd2x (no
swaps), thus in "big-endian element order" (element 0 in the high-order
position of the register).  Because of the big-endian nature of vshasigmaw,
this is always going to be the best approach.

I am still poking the bushes for a reference implementation; I thought of
another person to ask while writing this note.  Will let you know what I find
out.

Reply via email to