https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93943

            Bug ID: 93943
           Summary: IRA/LRA happily rematerialize (un-CSEs) loads without
                    register pressure
           Product: gcc
           Version: 10.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: rtl-optimization
          Assignee: unassigned at gcc dot gnu.org
          Reporter: rguenth at gcc dot gnu.org
  Target Milestone: ---

long a[1024], b[512], c[512];

void foo ()
{
  for (int i = 0; i < 256; ++i)
    {
      b[2*i] = a[4*i];
      b[2*i+1] = a[4*i+2];
      c[2*i] = a[4*i+1];
      c[2*i+1] = a[4*i+3];
    }
}

at -O3 is vectorized with SSE2 V2DImode vectors doing two vector loads,
two shuffles and two vector stores.  But we then emit

.L2:
        movdqa  a(%rax,%rax), %xmm0
        movdqa  %xmm0, %xmm1
        punpckhqdq      a+16(%rax,%rax), %xmm0
        punpcklqdq      a+16(%rax,%rax), %xmm1
        addq    $16, %rax
        movaps  %xmm1, b-16(%rax)
        movaps  %xmm0, c-16(%rax)
        cmpq    $4096, %rax
        jne     .L2

so took advantage of the memory op variant of the punpck instructions
enlarging the code and using more load uops.

Reply via email to