https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84753
--- Comment #7 from Jeffrey Walton <noloader at gmail dot com> --- (In reply to Bill Schmidt from comment #4) > ... > > The best performance will be achieved by writing this loop entirely using > inline asm code, with all data loaded/stored using lxvd2x and stxvd2x (no > swaps), thus in "big-endian element order" (element 0 in the high-order > position of the register). Because of the big-endian nature of vshasigmaw, > this is always going to be the best approach. Thanks Bill. We are working on your lxvd2x suggestion using inline assembly. Related, see "GCC vec_xl_be replacement using inline assembly", https://stackoverflow.com/q/49215090/608639. ----- I'm not sure if I am doing something wrong, or this is a new issue: $ cat test.cxx ... typedef __vector unsigned int uint32x4_p8; uint32x4_p8 VEC_XL_BE(const uint8_t* data, int offset) { #if defined(__xlc__) || defined(__xlC__) return (uint32x4_p8)vec_xl_be(offset, (uint8_t*)data); #else uint32x4_p8 res; __asm(" lxvd2x %x0, %1, %2 \n\t" : "=wa" (res) : "g" (data), "g" (offset)); return res; #endif } When I use VEC_XL_BE in real life it results in: $ g++ -DTEST_MAIN -g3 -O3 -mcpu=power8 sha256-p8.cxx -o sha256-p8.exe /home/noloader/tmp/ccbDnfFr.s: Assembler messages: /home/noloader/tmp/ccbDnfFr.s:758: Error: operand out of range (32 is not between 0 and 31) /home/noloader/tmp/ccbDnfFr.s:983: Error: operand out of range (48 is not between 0 and 31)