https://gcc.gnu.org/bugzilla/show_bug.cgi?id=119373
--- Comment #5 from Robin Dapp <rdapp at gcc dot gnu.org> --- > The analysis of SPEC2017's 510.parest_r shows that the topmost basic block > is a tight loop (see attached reducer). Once vectorised, by unrolling and > mutualising 4 instructions, AArch64 achieves a 22% reduction in dynamic > instruction count (DIC) within the block. However, RISC-V still vectorises > but misses the opportunity to further unroll. > > The vectoriser dump for RISC-V shows the analysis fails for the natural mode > RVVM1DF (and chooses RVVMF8QI instead) because it requires a "conversion not > supported by target". It turns out this is caused by two missing standard > named patterns: vec_unpacku_hi and vec_unpacku_lo. Why do you consider RVVM1DF a "natural" mode and not RVVMF8QI? As far as I can see we do vectorize at full vector size vsetvli a5,a4,e64,m1,tu,ma (Tail undisturbed is unexpected as there is no masked operation in the loop body but that looks like a separate issue). Apart from that I don't see too many redundant instructions. We, deliberately, don't define unpack_hi and unpack_lo because we don't have directly matching instructions and because we prefer widening/narrowing with the same number of elements rather than the same vector size. I suppose much of the icount difference is due to aarch64's complex addressing modes. All of the loads here include an offset and a shift while we need to do that explicitly. If we had similar addressing modes our icount would surely be reduced by >30%. Regarding unrolling: We cannot/do no unroll those length-controlled VLA loops. If we wanted unrolling we would need a VLS-like loop. Could you detail what aarch64 gains by unrolling, i.e. which instructions get elided?