[Bug target/116611] New: Inefficient mix of contiguous and load-lane vectorization due to missing permutes

rguenth at gcc dot gnu.org via Gcc-bugs Thu, 05 Sep 2024 05:17:05 -0700

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116611


            Bug ID: 116611
           Summary: Inefficient mix of contiguous and load-lane
                    vectorization due to missing permutes
           Product: gcc
           Version: 15.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: target
          Assignee: unassigned at gcc dot gnu.org
          Reporter: rguenth at gcc dot gnu.org
  Target Milestone: ---

It appears multiple times in the testsuite that on RISC-V we vectorize
a contiguous load both as such and using load-lanes because we cannot
permute.  For example gcc.dg/vect/slp-19a.c is

  for (i = 0; i < N; i++)
    {
      out[i*8] = in[i*8];
      out[i*8 + 1] = in[i*8 + 1];
      out[i*8 + 2] = in[i*8 + 2];
      out[i*8 + 3] = in[i*8 + 3];
      out[i*8 + 4] = in[i*8 + 4];
      out[i*8 + 5] = in[i*8 + 5];
      out[i*8 + 6] = in[i*8 + 6];
      out[i*8 + 7] = in[i*8 + 7];

      ia[i] = in[i*8 + 2];
    }

and re-loads in[i*8 + 2] using .MASK_LEN_LOAD_LANES.

When trying to use only SLP for this loop we do not consider to use
load-lanes here because two SLP instances are using the same load.

Instead we try to extract the single vector using an interleaving scheme
that mimics what the hardware would need to be able to do with load-lanes,
namely reduce the { 0 1 2 3 4 5 6 7 } vector in three steps via
{ 0 2 4 6 } and { 0 2 } to { 2 } (another possibility would be
in the second step to use { 2 6 }).  But the constant permutes riscv
can do are overly restrictive - as far as I can see riscv can do
arbitrary permutes using the register gather instruction and the
only "problem" is constructing the constant permutation vector which
for extract-even and extract-odd should be able to use a simple series.
You'd have one 0 + 2*n and one VL/2 + 2*n to combine the even elements
of two registers into one full element.

[Bug target/116611] New: Inefficient mix of contiguous and load-lane vectorization due to missing permutes

Reply via email to