https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87064
--- Comment #13 from Jakub Jelinek <jakub at gcc dot gnu.org> --- So, both the following patches should fix it IMHO, but no idea which one if any is right. With --- gcc/config/rs6000/vsx.md.jj 2019-01-01 12:37:44.305529527 +0100 +++ gcc/config/rs6000/vsx.md 2019-01-18 18:07:37.194899062 +0100 @@ -4356,7 +4356,9 @@ "" [(const_int 0)] { - rtx hi = gen_highpart (DFmode, operands[1]); + rtx hi = (BYTES_BIG_ENDIAN + ? gen_highpart (DFmode, operands[1]) + : gen_lowpart (DFmode, operands[1])); rtx lo = (GET_CODE (operands[2]) == SCRATCH) ? gen_reg_rtx (DFmode) : operands[2]; the assembly changes: --- reduction-3.s1 2019-01-18 18:05:14.313229730 +0100 +++ reduction-3.s2 2019-01-18 18:10:20.617233358 +0100 @@ -27,7 +27,7 @@ MAIN__._omp_fn.0: addi 9,9,16 bdnz .L2 # vec_extract to same register - lfd 12,-8(1) + lfd 12,-16(1) xsmaxdp 0,12,0 stfd 0,0(10) blr with: --- gcc/config/rs6000/vsx.md.jj 2019-01-01 12:37:44.305529527 +0100 +++ gcc/config/rs6000/vsx.md 2019-01-18 18:16:30.680186709 +0100 @@ -4361,7 +4361,9 @@ ? gen_reg_rtx (DFmode) : operands[2]; - emit_insn (gen_vsx_extract_v2df (lo, operands[1], const1_rtx)); + emit_insn (gen_vsx_extract_v2df (lo, operands[1], + BYTES_BIG_ENDIAN + ? const1_rtx : const0_rtx)); emit_insn (gen_<VEC_reduc_rtx>df3 (operands[0], hi, lo)); DONE; } the assembly changes: --- reduction-3.s1 2019-01-18 18:05:14.313229730 +0100 +++ reduction-3.s3 2019-01-18 18:17:18.977397458 +0100 @@ -26,7 +26,7 @@ MAIN__._omp_fn.0: xxpermdi 0,0,0,2 addi 9,9,16 bdnz .L2 - # vec_extract to same register + xxpermdi 0,0,0,3 lfd 12,-8(1) xsmaxdp 0,12,0 stfd 0,0(10) So just judging from this exact testcase, the first patch seems to be more efficient, though still unsure about that, because it goes through memory in either case, wouldn't it be better to emit a xxpermdi from 0 to 12 that swaps the two elements instead of loading it from memory?