https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94298
--- Comment #1 from Uroš Bizjak <ubizjak at gmail dot com> --- The situation is a bit more complicated. IRA DTRT: 8: r85:V2DF=[r86:DI+`y'] REG_EQUIV [r86:DI+`y'] 11: r89:V2DF=vec_select(vec_concat(r85:V2DF,r85:V2DF),parallel) 12: r90:V2DF=vec_select(vec_concat(r85:V2DF,r85:V2DF),parallel) REG_DEAD r85:V2DF Later, LRA propagates memory operand into the insn. Since the insn clobbers its input, multiple loads are emitted: 26: xmm1:V2DF=[ax:DI+`y'] 11: xmm1:V2DF=vec_select(vec_concat(xmm1:V2DF,[ax:DI+`y']),parallel) 28: xmm0:V2DF=[ax:DI+`y'] 12: xmm0:V2DF=vec_select(vec_concat([ax:DI+`y'],xmm0:V2DF),parallel) which is further "optimized" in postreload pass: 26: xmm1:V2DF=[ax:DI+`y'] 11: xmm1:V2DF=vec_select(vec_concat(xmm1:V2DF,xmm1:V2DF),parallel) 28: xmm0:V2DF=[ax:DI+`y'] 12: xmm0:V2DF=vec_select(vec_concat(xmm0:V2DF,xmm0:V2DF),parallel) It looks to me that a heuristics is missing in LRA, where memory operand shouldn't propagate into insn, if there are multiple uses of a register.