https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116611
Bug ID: 116611 Summary: Inefficient mix of contiguous and load-lane vectorization due to missing permutes Product: gcc Version: 15.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: rguenth at gcc dot gnu.org Target Milestone: --- It appears multiple times in the testsuite that on RISC-V we vectorize a contiguous load both as such and using load-lanes because we cannot permute. For example gcc.dg/vect/slp-19a.c is for (i = 0; i < N; i++) { out[i*8] = in[i*8]; out[i*8 + 1] = in[i*8 + 1]; out[i*8 + 2] = in[i*8 + 2]; out[i*8 + 3] = in[i*8 + 3]; out[i*8 + 4] = in[i*8 + 4]; out[i*8 + 5] = in[i*8 + 5]; out[i*8 + 6] = in[i*8 + 6]; out[i*8 + 7] = in[i*8 + 7]; ia[i] = in[i*8 + 2]; } and re-loads in[i*8 + 2] using .MASK_LEN_LOAD_LANES. When trying to use only SLP for this loop we do not consider to use load-lanes here because two SLP instances are using the same load. Instead we try to extract the single vector using an interleaving scheme that mimics what the hardware would need to be able to do with load-lanes, namely reduce the { 0 1 2 3 4 5 6 7 } vector in three steps via { 0 2 4 6 } and { 0 2 } to { 2 } (another possibility would be in the second step to use { 2 6 }). But the constant permutes riscv can do are overly restrictive - as far as I can see riscv can do arbitrary permutes using the register gather instruction and the only "problem" is constructing the constant permutation vector which for extract-even and extract-odd should be able to use a simple series. You'd have one 0 + 2*n and one VL/2 + 2*n to combine the even elements of two registers into one full element.