Hi all,

in the course of doing some benchmarks on arm with -Os, I noticed that
some list traversion code became significantly slower since gcc 5.3 when
instruction caches are cold.

Reproducer with relevant defines copied from the Linux kernel:

  struct list_head {
        struct list_head *next, *prev;
  };

  #define list_for_each(pos, head) \
        for (pos = (head)->next; pos != (head); pos = pos->next)


  void g(struct list_head *l);
  void h(void);

  void f(struct list_head *head)
  {
    struct list_head *l;

    list_for_each(l, head) {
      g(l);
    }

    h();
  }


This behaviour can be tracked down to
r228318 ("bb-reorder: Add -freorder-blocks-algorithm= and wire it up")

Since this commit up to and including svn trunk, the output of the
above snippet compiled with -Os looks like

  00000000 <f>:
     0: e92d4070        push    {r4, r5, r6, lr}
     4: e1a05000        mov     r5, r0
     8: e5904000        ldr     r4, [r0]
     c: e1540005        cmp     r4, r5
    10: 1a000001        bne     1c <f+0x1c>
    14: e8bd4070        pop     {r4, r5, r6, lr}
    18: eafffffe        b       0 <h>
    1c: e1a00004        mov     r0, r4
    20: ebfffffe        bl      0 <g>
    24: e5944000        ldr     r4, [r4]
    28: eafffff7        b       c <f+0xc>


Observe how the bb corresponding to the loop's body, i.e. the one
invoking g(), gets jumped to by a conditional *forward* jump which is
bad for static branch prediction and thus, prefetching.

Compare this to gcc 4.9's output with -Os or alternatively, trunk's
output with -Os -freorder-blocks-algorithm=stc:

  00000000 <f>:
     0: e92d4070        push    {r4, r5, r6, lr}
     4: e1a05000        mov     r5, r0
     8: e5904000        ldr     r4, [r0]
     c: e1540005        cmp     r4, r5
    10: 0a000003        beq     24 <f+0x24>
    14: e1a00004        mov     r0, r4
    18: ebfffffe        bl      0 <g>
    1c: e5944000        ldr     r4, [r4]
    20: eafffff9        b       c <f+0xc>
    24: e8bd4070        pop     {r4, r5, r6, lr}
    28: eafffffe        b       0 <h>

Here, the loop body's bb is the fallthrough of the loop's conditional
branch which is good.


Measurements on a memory stressed system, i.e. one where the instruction
caches can be considered cold at the loop's first iteration, do indeed
show runtime differences of ~0.5us on average.

I believe that this drop in performance is due to the code of g() not
getting prefetched properly.


That being said, I could certainly go and submit a patch to the Linux
kernel setting -freorder-blocks-algorithm=stc for the -Os case.

Before I do this, I'd be glad to hear your opinions about this though as
I can't really quantify the implications.

>From the discussion on gcc-patches [1] of what is now the aforementioned
r228318 ("bb-reorder: Add -freorder-blocks-algorithm= and wire it up"),
it is not clear to me whether this change can actually reduce code size
beyond those 0.1% given there for -Os.

A sloppy test with an, admittedly heavily stripped down, Linux kernel
gives me exactly these 0.1%, too.

So first question:
Do you guys know of any code where there are more significant code size
savings achieved?

And second question:
If that isn't the case, would it possibly make sense to partly revert
gcc's behaviour and set -freorder-blocks-algorithm=stc at -Os?

Thank you very much!

Nicolai


[1] https://gcc.gnu.org/ml/gcc-patches/2015-09/msg01869.html

Reply via email to