https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110625
--- Comment #2 from Hao Liu <hliu at amperecomputing dot com> --- To my understanding, "reduction latency" is the least number of cycles needed to do the reduction calculation for 1 iteration of loop. It is calcualted by the extra instruction issue-info of the new cost models in AArch64 backend. Usually, the reduction latency of vectorized loop should be smaller than the scalar loop. If the latency of vectorized loop is larger than the scalar loop, it thinks maybe not beneficial to do vectorization, so it increases the vect-body costs by the scale of vect_reduct_latency/scalar_reduct_latency in the above case. For the above case, it thinks the scalar loop needs 4 cycles (2*VF=4) to calculate "results.m += rhs", while the vectorized loop needs 8 cycles (2*count=8). As a result, the vect-body costs are doubled from originial value of 51 to 102. It seems not true for the vectorized loop, which should only need 2 cycles to calculate the SIMD version of "results.m += rhs".