https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111153

--- Comment #1 from Robin Dapp <rdapp at gcc dot gnu.org> ---
We seem to decide that a slightly more expensive loop (one instruction more)
without an epilogue is better than a loop with an epilogue.  This looks
intentional in the vectorizer cost estimation and is not specific to our lack
of a costing model.  Hmm..

The main loops are (VLA):
.L3:
        vsetvli a5,a1,e32,m1,tu,ma
        slli    a4,a5,2
        sub     a1,a1,a5
        vle32.v v2,0(a0)
        add     a0,a0,a4
        vadd.vv v1,v2,v1
        bne     a1,zero,.L3

vs (VLS):
.L4:
        vle32.v v1,0(a5)
        vle32.v v2,0(sp)
        addi    a5,a5,16
        vadd.vv v1,v2,v1
        vse32.v v1,0(sp)
        bne     a4,a5,.L4

This is doubly weird because of the spill of the accumulator.  We shouldn't be
generating this sequence but even if so, it should be more expensive.  This can
be achieved e.g. by the following example vectorizer cost function:

static int
riscv_builtin_vectorization_cost (enum vect_cost_for_stmt type_of_cost,
                                 tree vectype,
                                 int misalign ATTRIBUTE_UNUSED)
{
  unsigned elements;

  switch (type_of_cost)
    {
      case scalar_stmt:
      case scalar_load:
      case scalar_store:
      case vector_stmt:
      case vector_gather_load:
      case vector_scatter_store:
      case vec_to_scalar:
      case scalar_to_vec:
      case cond_branch_not_taken:
      case vec_perm:
      case vec_promote_demote:
      case unaligned_load:
      case unaligned_store:
        return 1;

      case vector_load:
      case vector_store:
        return 3;

      case cond_branch_taken:
        return 3;

      case vec_construct:
        elements = estimated_poly_value (TYPE_VECTOR_SUBPARTS (vectype));
        return elements / 2 + 1;

      default:
        gcc_unreachable ();
    }
}

For a proper loop like
        vle32.v v2,0(sp)
.L4:
        vle32.v v1,0(a5)
        addi    a5,a5,16
        vadd.vv v1,v2,v1
        bne     a4,a5,.L4
        vse32.v v1,0(sp)
I'm not so sure anymore.  For large n this could be preferable depending on the
vectorization factor and other things.

Reply via email to