https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110625
--- Comment #6 from Hao Liu <hliu at amperecomputing dot com> ---
Thanks for the confirmation about the reduction latency. I'll create a simple
patch to fix this.
> Discounting the loads, we do have 15 general operations.
That's true, and there are indeed 8 general operations for scalar loop. As the
count_ops() is accurate, it seems maybe the Cost of Vector Body is too large
(Vector inside of loop cost: 51):
*k_48 4 times vec_perm costs 12 in body
*k_48 1 times unaligned_load (misalign -1) costs 4 in body
_5->m1 1 times vec_perm costs 3 in body
_5->m4 1 times unaligned_load (misalign -1) costs 4 in body
(int) _24 2 times vec_promote_demote costs 4 in body
(double) _25 4 times vec_promote_demote costs 8 in body
_2 * _26 4 times vector_stmt costs 8 in body
If it is small enough, even the vect-body cost is increased according to the
issue-info, SLP is still profitable. I'm not quite familiar with this part and
it may affect all aarch64 targets, so I think it's hard to fix by me. It would
be great if you will look at how to fix this.