https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110625

--- Comment #2 from Hao Liu <hliu at amperecomputing dot com> ---
To my understanding, "reduction latency" is the least number of cycles needed
to do the reduction calculation for 1 iteration of loop.  It is calcualted by
the extra instruction issue-info of the new cost models in AArch64 backend.

Usually, the reduction latency of vectorized loop should be smaller than the
scalar loop.  If the latency of vectorized loop is larger than the scalar loop,
it thinks maybe not beneficial to do vectorization, so it increases the
vect-body costs by the scale of vect_reduct_latency/scalar_reduct_latency in
the above case.

For the above case, it thinks the scalar loop needs 4 cycles (2*VF=4) to
calculate "results.m += rhs", while the vectorized loop needs 8 cycles
(2*count=8).  As a result, the vect-body costs are doubled from originial value
of 51 to 102.  It seems not true for the vectorized loop, which should only
need 2 cycles to calculate the SIMD version of "results.m += rhs".

Reply via email to