Hi all, in the course of doing some benchmarks on arm with -Os, I noticed that some list traversion code became significantly slower since gcc 5.3 when instruction caches are cold.
Reproducer with relevant defines copied from the Linux kernel: struct list_head { struct list_head *next, *prev; }; #define list_for_each(pos, head) \ for (pos = (head)->next; pos != (head); pos = pos->next) void g(struct list_head *l); void h(void); void f(struct list_head *head) { struct list_head *l; list_for_each(l, head) { g(l); } h(); } This behaviour can be tracked down to r228318 ("bb-reorder: Add -freorder-blocks-algorithm= and wire it up") Since this commit up to and including svn trunk, the output of the above snippet compiled with -Os looks like 00000000 <f>: 0: e92d4070 push {r4, r5, r6, lr} 4: e1a05000 mov r5, r0 8: e5904000 ldr r4, [r0] c: e1540005 cmp r4, r5 10: 1a000001 bne 1c <f+0x1c> 14: e8bd4070 pop {r4, r5, r6, lr} 18: eafffffe b 0 <h> 1c: e1a00004 mov r0, r4 20: ebfffffe bl 0 <g> 24: e5944000 ldr r4, [r4] 28: eafffff7 b c <f+0xc> Observe how the bb corresponding to the loop's body, i.e. the one invoking g(), gets jumped to by a conditional *forward* jump which is bad for static branch prediction and thus, prefetching. Compare this to gcc 4.9's output with -Os or alternatively, trunk's output with -Os -freorder-blocks-algorithm=stc: 00000000 <f>: 0: e92d4070 push {r4, r5, r6, lr} 4: e1a05000 mov r5, r0 8: e5904000 ldr r4, [r0] c: e1540005 cmp r4, r5 10: 0a000003 beq 24 <f+0x24> 14: e1a00004 mov r0, r4 18: ebfffffe bl 0 <g> 1c: e5944000 ldr r4, [r4] 20: eafffff9 b c <f+0xc> 24: e8bd4070 pop {r4, r5, r6, lr} 28: eafffffe b 0 <h> Here, the loop body's bb is the fallthrough of the loop's conditional branch which is good. Measurements on a memory stressed system, i.e. one where the instruction caches can be considered cold at the loop's first iteration, do indeed show runtime differences of ~0.5us on average. I believe that this drop in performance is due to the code of g() not getting prefetched properly. That being said, I could certainly go and submit a patch to the Linux kernel setting -freorder-blocks-algorithm=stc for the -Os case. Before I do this, I'd be glad to hear your opinions about this though as I can't really quantify the implications. >From the discussion on gcc-patches [1] of what is now the aforementioned r228318 ("bb-reorder: Add -freorder-blocks-algorithm= and wire it up"), it is not clear to me whether this change can actually reduce code size beyond those 0.1% given there for -Os. A sloppy test with an, admittedly heavily stripped down, Linux kernel gives me exactly these 0.1%, too. So first question: Do you guys know of any code where there are more significant code size savings achieved? And second question: If that isn't the case, would it possibly make sense to partly revert gcc's behaviour and set -freorder-blocks-algorithm=stc at -Os? Thank you very much! Nicolai [1] https://gcc.gnu.org/ml/gcc-patches/2015-09/msg01869.html