https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67323

--- Comment #5 from Richard Biener <rguenth at gcc dot gnu.org> ---
I note that the efficiency you gain is only by a reduced number of loads/store
instructions.  vld3 instead of six vldr (huh, appearantly vld3 can load 16
byte vectors but vldr only 8 byte ones?).  I assume vld3 has no penalty
for the lane-split itself so the code-size reduction is always wanted.
Thus we'd want to always use a lane load/store even if the permutation is
pointless as soon as we'd otherwise would issue more than one SLP load, say for

void
t5 (int len, int * __restrict p, int * __restrict q)
{
  for (int i = 0; i < len; i+=8) {
      p[i] = q[i] * 2;
      p[i+1] = q[i+1] * 2;
      p[i+2] = q[i+2] * 2;
      p[i+3] = q[i+3] * 2;
      p[i+4] = q[i+4] * 2;
      p[i+5] = q[i+5] * 2;
      p[i+6] = q[i+6] * 2;
      p[i+7] = q[i+7] * 2;
  }
}

instead of

.L4:
        vldr    d18, [r2, #-16]
        vldr    d19, [r2, #-8]
        vldr    d16, [r2, #-32]
        vldr    d17, [r2, #-24]
        vshl.i32        q9, q9, #1
        vshl.i32        q8, q8, #1
        add     r3, r3, #1
        cmp     r0, r3
        vstr    d18, [r1, #-16]
        vstr    d19, [r1, #-8]
        vstr    d16, [r1, #-32]
        vstr    d17, [r1, #-24]
        add     r2, r2, #32
        add     r1, r1, #32
        bhi     .L4

use vld2.32 / vst2.32?  Generally for SLP the implicit permute performed
by those instructions could be modeled properly (and the SLP chain
permuted accordingly).

Reply via email to