https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117718
--- Comment #6 from Steven Munroe <munroesj at gcc dot gnu.org> --- Another issues with vector loads from .rodata Some times the compiler will generate this sequence for power8 addis 9,2,.LC69@toc@ha addi 9,9,.LC69@toc@l rldicr 9,9,0,59 lxvd2x 12,0,9 xxpermdi 12,12,12,2 GCC seems to generate this when it wants to load into a VSR (0-31) vs a VR. The latency is 13 cycles! The rldicr 9,9,0,59 (clrrdi r9,r9,4) is not required. The data is already aligned! The compiler should know this because this a vector constant and TOC relative. It is not random user data! The xxpermdi (xxswapd) is needed because lxvd2x is: - Endian enabled within the element, but - Array order across elements Unless the data is splatted (DW0 == DW1). The compiler could know this. Likely the compiler generated this via constant propagation. The compiler should know! Finally why is the address calculation a dependent sequence that guarantees the worst possible latency. addis 9,2,.LC69@toc@ha li 0,0,.LC69@toc@l lxvd2x 12,9,0 xxpermdi 12,12,12,2 This allows the addis/addi to execute in parallel and enable instruction fusion. This sequence is 9 cycles (7 cycles without the xxswapd). See Section of 10.1.12 Instruction Fusion of the POWER8 Processor User’s Manual. The addi/lxvd2x pair can be treated as a (Power8 Tuned) prefix instruction which is effectively a D-from lxvd2. This fusion form applies to {lxvd2x, lxvw4x, lxvdsx, lvebx, lvehx, lvewx, lvx, lxsdx} instructions. Yes this clobbers another register (R0 for 2 instructions) but the faster sequence can actually reduce register pressure.