https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116573
Bug ID: 116573 Summary: [15 Regression] Recent SLP work appears to generate significantly worse code on RISC-V Product: gcc Version: 15.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: law at gcc dot gnu.org Target Milestone: --- This change: commit 9aaedfc4146c5e4b8412913a6ca4092a2731c35c (HEAD) Author: Richard Biener <rguent...@suse.de> Date: Fri Jul 5 10:35:08 2024 +0200 load and store-lanes with SLP The following is a prototype for how to represent load/store-lanes within SLP. I've for now settled with having a single load node with multiple permute nodes acting as selection, one for each loaded lane and a single store node fed from all stored lanes. For [ ... ] Is causing a multiple scan failures in the testsuite. Several are "don't care" changes in code generation. But one class seems to indicate a notable regression in the quality of the generated code: Before the change for the attached testcase we would generate: .L3: vsetvli a5,a3,e8,m1,ta,ma vle8.v v2,0(a0) vle8.v v3,0(a1) slli a4,a5,1 sub a3,a3,a5 add a0,a0,a5 add a1,a1,a5 vsseg2e8.v v2,(a2) add a2,a2,a4 bne a3,zero,.L3 Nothing really of note there. Load up two values, then store them elsewhere with a segmented store and the usual pointer updates. After the change: .L4: mv a6,a3 mv a4,a3 bleu a3,a5,.L3 csrr a4,vlenb .L3: vsetvli zero,a4,e8,m1,ta,ma vle8.v v2,0(a0) vle8.v v3,0(a1) sub a3,a3,a5 add a0,a0,a5 add a1,a1,a5 vsseg2e8.v v2,(a2) add a2,a2,a7 bgtu a6,a5,.L4 Ugh. We've got a conditional branch in the middle of the loop, a CSR read and a bit of sillyness with those extra move instructions. Not really sure what went wrong, but it's a reasonable assumption that this code is less performant than the original. Compile with " -O3 -ftree-vectorize -std=c99 -march=rv32gcv_zvfh -mabi=ilp32d -mrvv-vector-bits=scalable -fno-vect-cost-model" typedef unsigned char uint8_t; typedef signed char int8_t; #ifndef TYPE #define TYPE uint8_t #define ITYPE int8_t #endif void __attribute__ ((noinline, noclone)) g2 (TYPE *__restrict a, TYPE *__restrict b, TYPE *__restrict c, ITYPE n) { for (ITYPE i = 0; i < n; ++i) { c[i * 2] = a[i]; c[i * 2 + 1] = b[i]; } }