https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117718

--- Comment #6 from Steven Munroe <munroesj at gcc dot gnu.org> ---
Another issues with vector loads from .rodata 

Some times the compiler will generate this sequence for power8

        addis 9,2,.LC69@toc@ha
        addi 9,9,.LC69@toc@l
        rldicr 9,9,0,59
        lxvd2x 12,0,9
        xxpermdi 12,12,12,2

GCC seems to generate this when it wants to load into a VSR (0-31) vs a VR. The
latency is 13 cycles!

The rldicr 9,9,0,59 (clrrdi  r9,r9,4) is not required. The data is already
aligned!

The compiler should know this because this a vector constant and TOC relative.
It is not random user data!

The xxpermdi (xxswapd) is needed because lxvd2x is:
 - Endian enabled within the element, but
 - Array order across elements

Unless the data is splatted (DW0 == DW1). The compiler could know this. Likely
the compiler generated this via constant propagation. The compiler should know!

Finally why is the address calculation a dependent sequence that guarantees the
worst possible latency.

        addis 9,2,.LC69@toc@ha
        li 0,0,.LC69@toc@l
        lxvd2x 12,9,0
        xxpermdi 12,12,12,2

This allows the addis/addi to execute in parallel and enable instruction
fusion. This sequence is 9 cycles (7 cycles without the xxswapd).

See Section of 10.1.12 Instruction Fusion of the POWER8 Processor User’s
Manual.

The addi/lxvd2x pair can be treated as a (Power8 Tuned) prefix instruction
which is effectively a D-from lxvd2. This fusion form applies to {lxvd2x,
lxvw4x, lxvdsx, lvebx, lvehx, lvewx, lvx, lxsdx} instructions.

Yes this clobbers another register (R0 for 2 instructions) but the faster
sequence can actually reduce register pressure.

Reply via email to