https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116573

            Bug ID: 116573
           Summary: [15 Regression] Recent SLP work appears to generate
                    significantly worse code on RISC-V
           Product: gcc
           Version: 15.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: tree-optimization
          Assignee: unassigned at gcc dot gnu.org
          Reporter: law at gcc dot gnu.org
  Target Milestone: ---

This change:

commit 9aaedfc4146c5e4b8412913a6ca4092a2731c35c (HEAD)
Author: Richard Biener <rguent...@suse.de>
Date:   Fri Jul 5 10:35:08 2024 +0200

    load and store-lanes with SLP

    The following is a prototype for how to represent load/store-lanes
    within SLP.  I've for now settled with having a single load node
    with multiple permute nodes acting as selection, one for each loaded lane
    and a single store node fed from all stored lanes.  For
[ ... ]

Is causing a multiple scan failures in the testsuite.  Several are "don't care"
changes in code generation.  But one class seems to indicate a notable
regression in the quality of the generated code:

Before the change for the attached testcase we would generate:

.L3:
        vsetvli a5,a3,e8,m1,ta,ma
        vle8.v  v2,0(a0)
        vle8.v  v3,0(a1)
        slli    a4,a5,1
        sub     a3,a3,a5
        add     a0,a0,a5
        add     a1,a1,a5
        vsseg2e8.v      v2,(a2)
        add     a2,a2,a4
        bne     a3,zero,.L3

Nothing really of note there.  Load up two values, then store them elsewhere
with a segmented store and the usual pointer updates.

After the change:
.L4:
        mv      a6,a3
        mv      a4,a3
        bleu    a3,a5,.L3
        csrr    a4,vlenb
.L3:
        vsetvli zero,a4,e8,m1,ta,ma
        vle8.v  v2,0(a0)
        vle8.v  v3,0(a1)
        sub     a3,a3,a5
        add     a0,a0,a5
        add     a1,a1,a5
        vsseg2e8.v      v2,(a2)
        add     a2,a2,a7
        bgtu    a6,a5,.L4

Ugh.  We've got a conditional branch in the middle of the loop, a CSR read and
a bit of sillyness with those extra move instructions.  Not really sure what
went wrong, but it's a reasonable assumption that this code is less performant
than the original.

Compile with "  -O3 -ftree-vectorize -std=c99 -march=rv32gcv_zvfh -mabi=ilp32d
-mrvv-vector-bits=scalable -fno-vect-cost-model"



typedef unsigned char uint8_t;
typedef signed char int8_t;
#ifndef TYPE
#define TYPE uint8_t
#define ITYPE int8_t
#endif

void __attribute__ ((noinline, noclone))
g2 (TYPE *__restrict a, TYPE *__restrict b, TYPE *__restrict c, ITYPE n)
{
  for (ITYPE i = 0; i < n; ++i)
    {
      c[i * 2] = a[i];
      c[i * 2 + 1] = b[i];
    }
}

Reply via email to