https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93943
Bug ID: 93943 Summary: IRA/LRA happily rematerialize (un-CSEs) loads without register pressure Product: gcc Version: 10.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: rtl-optimization Assignee: unassigned at gcc dot gnu.org Reporter: rguenth at gcc dot gnu.org Target Milestone: --- long a[1024], b[512], c[512]; void foo () { for (int i = 0; i < 256; ++i) { b[2*i] = a[4*i]; b[2*i+1] = a[4*i+2]; c[2*i] = a[4*i+1]; c[2*i+1] = a[4*i+3]; } } at -O3 is vectorized with SSE2 V2DImode vectors doing two vector loads, two shuffles and two vector stores. But we then emit .L2: movdqa a(%rax,%rax), %xmm0 movdqa %xmm0, %xmm1 punpckhqdq a+16(%rax,%rax), %xmm0 punpcklqdq a+16(%rax,%rax), %xmm1 addq $16, %rax movaps %xmm1, b-16(%rax) movaps %xmm0, c-16(%rax) cmpq $4096, %rax jne .L2 so took advantage of the memory op variant of the punpck instructions enlarging the code and using more load uops.