https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117718

            Bug ID: 117718
           Summary: Inefficient address computation for d-form vector
                    loads
           Product: gcc
           Version: 15.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: target
          Assignee: unassigned at gcc dot gnu.org
          Reporter: bergner at gcc dot gnu.org
  Target Milestone: ---

If we compile some simple test cases returning the value from a global vector
array, we fail to fold the low 16-bits of the offset into the (new to power9)
lxv's offset and instead do the full offset computation outside of the load and
then use an offset of zero for the lxv.

bergner@c643n10lp1:~$ cat vectorlong.c
#include <altivec.h>

vector long var[16];

vector long
foo (void)
{
  return var[0];
}

vector long
bar (void)
{
  return var[1];
}
bergner@c643n10lp1:~$ gcc -S -O2 -mcpu=power9 vectorlong.c                      
bergner@c643n10lp1:~$ cat vectorlong.s
foo:
        [snip toc setup]
        addis 9,2,.LANCHOR0@toc@ha
        addi 9,9,.LANCHOR0@toc@l
        lxv 34,0(9)
        blr
bar:
        [snip toc setup]
        addis 9,2,.LANCHOR0@toc@ha
        addi 9,9,.LANCHOR0@toc@l
        lxv 34,16(9)
        blr

However, for an equivalent test case using scalars (integer or fp), we do fold
the offset into the load, reducing the number of instructions from three to
two:

bergner@c643n10lp1:~$ cat long.c
long var[16];

long
foo (void)
{
  return var[0];
}

long
bar (void)
{
  return var[1];
}
bergner@c643n10lp1:~$ gcc -S -O2 -mcpu=power9 long.c                           
bergner@c643n10lp1:~$ cat long.s
foo:
        [snip toc setup]
        addis 9,2,.LANCHOR0@toc@ha
        ld 3,.LANCHOR0@toc@l(9)
        blr
bar:
        [snip toc setup]
        addis 9,2,.LANCHOR0+8@toc@ha
        ld 3,.LANCHOR0+8@toc@l(9)
        blr

We should perform the same optimization for vector loads/stores as we do for
scalar loads/stores.

Reply via email to