https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106770
--- Comment #11 from Segher Boessenkool <segher at gcc dot gnu.org> --- (In reply to Jens Seifert from comment #6) > The left part of VSX registers overlaps with floating point registers, that > is why no register xxpermdi is required and mfvsrd can access all (left) > parts of VSX registers directly. The mfvsrd instruction was invented before ELFv2 (at the same time as mfvsrwz). Everything in common use was big-endian then. The insns to move GPR->VSR that initially existed were mtvstrd and mtvsrw[az], all of which write to dword 0 of the target VSR. Dword 0 of vector regs is where 64-bit entities in vector regs are stored in the ABIs, sure, and that corresponds to the FPRs in the ISA. mtvsrdd and mtvsrws were added in ISA 3.0 (p9), together with mfvsrld, to make little-endian work better with little-endian ELFv2. > The xxpermdi x,y,y,3 indicates to me that gcc prefers right part of register > which might also cause the xxpermdi at the beginning. And with -mbig you get ,2 here. It is accidental. > At the end the mystery > is why gcc adds 3 xxpermdi to the code. As I said, this is constructed during expand, to make correct code. That is all that expand should do: make correct (and well-optimisable, "open structured", easy to transform, code). We should be able to optimise this to something better in later passes that *are* supposed to make faster code. Like the p8 swaps pass, which mostly zaps unnecessary pairs of swaps, or the swiss army bazooka combine, or even many earlier passes if such an xxpermdi insn is truly superfluous. It usually is not, we are dealing with the full 128-bit VSRs there, there is no way of saying we do not care about part of the register contents. Making infra for that is big work. We can make things easier by expressing things as 64 bit earlier. We can (and should) also investigate why the mfvsrd is not combined (as in, what the instruction combiner pass does) with the xxpermdi. There are many things not quite perfect here.