https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111153
--- Comment #1 from Robin Dapp <rdapp at gcc dot gnu.org> --- We seem to decide that a slightly more expensive loop (one instruction more) without an epilogue is better than a loop with an epilogue. This looks intentional in the vectorizer cost estimation and is not specific to our lack of a costing model. Hmm.. The main loops are (VLA): .L3: vsetvli a5,a1,e32,m1,tu,ma slli a4,a5,2 sub a1,a1,a5 vle32.v v2,0(a0) add a0,a0,a4 vadd.vv v1,v2,v1 bne a1,zero,.L3 vs (VLS): .L4: vle32.v v1,0(a5) vle32.v v2,0(sp) addi a5,a5,16 vadd.vv v1,v2,v1 vse32.v v1,0(sp) bne a4,a5,.L4 This is doubly weird because of the spill of the accumulator. We shouldn't be generating this sequence but even if so, it should be more expensive. This can be achieved e.g. by the following example vectorizer cost function: static int riscv_builtin_vectorization_cost (enum vect_cost_for_stmt type_of_cost, tree vectype, int misalign ATTRIBUTE_UNUSED) { unsigned elements; switch (type_of_cost) { case scalar_stmt: case scalar_load: case scalar_store: case vector_stmt: case vector_gather_load: case vector_scatter_store: case vec_to_scalar: case scalar_to_vec: case cond_branch_not_taken: case vec_perm: case vec_promote_demote: case unaligned_load: case unaligned_store: return 1; case vector_load: case vector_store: return 3; case cond_branch_taken: return 3; case vec_construct: elements = estimated_poly_value (TYPE_VECTOR_SUBPARTS (vectype)); return elements / 2 + 1; default: gcc_unreachable (); } } For a proper loop like vle32.v v2,0(sp) .L4: vle32.v v1,0(a5) addi a5,a5,16 vadd.vv v1,v2,v1 bne a4,a5,.L4 vse32.v v1,0(sp) I'm not so sure anymore. For large n this could be preferable depending on the vectorization factor and other things.