https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88760

--- Comment #11 from ktkachov at gcc dot gnu.org ---
Thank you all for the input.

Just to add a bit of data.
I've instrumented 510.parest_r to count the number of loop iterations to get a
feel for how much of the unrolled loop is spent in the actual unrolled part
rather than the prologue/peeled part. Overall, the hot function itself is
entered 290M times. The distribution of loop iteration counts is:

Frequency iter:
92438870  36
87028560  54
20404571  24
17312960  62
14237184  72
13403904  108
7574437   102
7574420   70
5564881   40
4328249   64
4328240   46
3142656   48
2666496   124
1248176   8
1236641   16
1166592   204
1166592   140
1134392   4
 857088   80
 666624   92
 666624   128
 618320   30
 613056   1
 234464   2
 190464   32
  95232   60
  84476   20
  48272   10
   6896   5

So the two most common iteration counts are 36 and 54. For an 8x unrolled loop
that's 4 and 6 iterations spent in the prologue with 4 and 6 times going around
the 8x unrolled loop respectively.

As an experiment I hacked the AArch64 assembly of the function generated with
-funroll-loops to replace the peeled prologue version with a simple
non-unrolled loop. That gave a sizeable speedup on two AArch64 platforms: >7%.

So beyond the vectorisation point Richard S. made above, maybe it's worth
considering replacing the peeled prologue with a simple loop instead?
Or at least add that as a distinct unrolling strategy and work to come up with
an analysis that would allow us to choose one over the other?

Reply via email to