Hi Segher, Segher Boessenkool <seg...@kernel.crashing.org> writes:
> On Fri, Nov 11, 2016 at 02:16:18AM +0100, Nicolai Stange wrote: >> >> >From the discussion on gcc-patches [1] of what is now the aforementioned >> >> r228318 ("bb-reorder: Add -freorder-blocks-algorithm= and wire it up"), >> >> it is not clear to me whether this change can actually reduce code size >> >> beyond those 0.1% given there for -Os. >> > >> > There is r228692 as well. >> >> Ok, summarizing, that changelog says that the simple algorithm >> potentially produced even bigger code with -Os than stc did. From that >> commit on, this remains true only on x86 and mn10300. Right? > > x86 and mn10300 use STC at -Os by default. Ah! This explains why I didn't see such a BB ordering on x86. >> Sure. Let me restate my original question: assume for a moment that it >> is true that -Os with simple never produces code smaller than 0.1% of >> what is created by -Os with stc. I haven't got any idea what the "other >> things" are able to achieve w.r.t code size savings, but to me, 0.1% >> doesn't appear to be that huge. Don't get me wrong: I *really* can't >> judge on whether 0.1% is a significant improvement or not. I'm just >> assuming that it's not. With this assumption, the question of whether >> those saved 0.1% are really worth the significantly decreased >> performance encountered in some situations seemed just natural... > > It all depends on the tradeoff you want. There are many knobs you can > turn -- for example the inlining params, that has quite some effect on > code size. > > -Os is mostly -O2 except those things that increase code size. > > What is the tradeoff in your case? What is a realistic number for the > slowdown of your kernel? Do you see hotspots in there that should be > handled better anyhow? Etc. Well, I do see hotspots in my crappy RFC code not yet upstreamed, namely those 0.5 us extra latency on a memory stressed system as initially mentioned. Problem is, this is in interrupt context... I'm quite certain that this is due to a mispredicted branch in for(...) { g() } -- g() lives on another page and the TLB is cold. However, once my test hardware is free again, I'll try to identify some hotspots and get some numbers for a vanilla kernel. For example, __hrtimer_runqueues() suffers from the same BB ordering, but its code doesn't cross pages. I'd really have to measure it... >> No, I want small, possibly at the cost of performance to the extent of >> what's sensible. What sensible actually is is what my question is about. > > It is different for every use case I'm afraid. > >> So, summarizing, I'm not asking whether I should use -O2 or -Os or >> whatever, but whether the current behaviour I'm seeing with -Os is >> intended/expected quantitatively. > > With simple you get smaller code than with STC, so -Os uses simple. > If that is ridiculously slower then you won't hear me complaining if > you propose defaulting it the other way; but you haven't shown any > convincing numbers yet? Hmm, maybe there's a better way than changing the default. If I read r228692 correctly, for the -Os case, reorder_basic_blocks_simple() is made to retain the edge order as given by the BB order (whatever this is?) because this reduces code size (because of reasons). I think, the gain in code size savings is due to the favoring of the total fallthrough edge count -- these won't lead to jump insns in the output. Is this correct? However, I believe that this edge ordering can be relaxed: it's only a certain type of edges that must come before all the others. The "all the others" can then be sorted by frequency again, thus leading to better static branch prediction, especially in the loop case. Thinking locally, if we have A |\ | B |/ C we want to have the output order A, B, C, not A, C, B, because: - A will have a single branch insn at its end either way, - with the second ordering, B will need one, too. Now, if B-C is considered first by the chain assembling part in reorder_basic_blocks_simple(), this will produce the desired output. I've appended a patch that does just this: it puts all the edges originating from BB's with only one outgoing edge first and the rest, sorted by frequency, after them. The results for that single test case I've tried, namely the kernel with some config on ARM, look surprisingly good: W/o patch: 0 .head.text 0000026c c0008000 c0008000 00008000 2**2 1 .text 002ab370 c0100000 c0100000 00010000 2**5 16 .init.text 00027270 c06002e0 c06002e0 003f02e0 2**5 17 .exit.text 00000594 c0627550 c0627550 00417550 2**2 W/ patch: 0 .head.text 0000026c c0008000 c0008000 00008000 2**2 1 .text 002aaeac c0100000 c0100000 00010000 2**5 16 .init.text 0002723c c06002e0 c06002e0 003f02e0 2**5 17 .exit.text 00000594 c062751c c062751c 0041751c 2**2 So even slightly smaller code is produced. But more importantly, it gets the fallthrough for the loops right, thus improving static branch prediction. I have to admit that this POC patch is the result of some guesswork and that I don't understand everything fully: for example, I don't know what the order given by FOR_EACH_BB_FN() actually is and how retaining it helped reducing code size in r228692. Nevertheless, with the above results, can I convince you to look into this and tell me whether it's actually as promising as I think it is? Of course I'm willing to run further tests, but if this approach is just obviously wrong, it would be good if I could avoid this in the first place. In case I should indeed do some further tests, do you have some set of standard packages to do those -Os tests against? Thank you very much! Nicolai
Index: gcc/bb-reorder.c =================================================================== --- gcc/bb-reorder.c (revision 242047) +++ gcc/bb-reorder.c (working copy) @@ -2305,14 +2305,42 @@ if (dump_file) fprintf (dump_file, "\nReordering with the \"simple\" algorithm.\n\n"); - edge *edges = new edge[2 * n_basic_blocks_for_fn (cfun)]; + int edges_size = 2 * n_basic_blocks_for_fn (cfun); + edge *edges = new edge[edges_size]; /* First, collect all edges that can be optimized by reordering blocks: - simple jumps and conditional jumps, as well as the function entry edge. */ + simple jumps and conditional jumps, as well as the function entry edge. - int n = 0; - edges[n++] = EDGE_SUCC (ENTRY_BLOCK_PTR_FOR_FN (cfun), 0); + When optimizing for size, certain edges should come first in the + greedy chain assembling algorithm below: this will allow for more + fallthrough and thus, jump elision. + Consider + A + | \ + | B + | / + C + This shall be ordered as A B C, not A C B. + + Considering all edges of the B-C type first, this will force the + greedy algorithm to do just the right thing. + + OTOH, all other edges shall be sorted by their frequency in order + to favor branch prediction. + + In order to achieve this, the edges array is split into two + regions, the first growing upwards and containing all edges which + are a BB's only ones, i.e. those possibly of the B-C type, and the other + region growing downwards and containing the rest. + + When optimizing for size, only the second region will be sorted by + edge frequency. + */ + + int n_high_priority = 0, n = 0; + edges[edges_size - ++n] = EDGE_SUCC (ENTRY_BLOCK_PTR_FOR_FN (cfun), 0); + basic_block bb; FOR_EACH_BB_FN (bb, cfun) { @@ -2326,24 +2354,29 @@ continue; if (single_succ_p (bb)) - edges[n++] = EDGE_SUCC (bb, 0); + edges[n_high_priority++] = EDGE_SUCC (bb, 0); else if (any_condjump_p (end)) { edge e0 = EDGE_SUCC (bb, 0); edge e1 = EDGE_SUCC (bb, 1); - /* When optimizing for size it is best to keep the original - fallthrough edges. */ - if (e1->flags & EDGE_FALLTHRU) - std::swap (e0, e1); - edges[n++] = e0; - edges[n++] = e1; + edges[edges_size - ++n] = e0; + edges[edges_size - ++n] = e1; } } - /* Sort the edges, the most desirable first. When optimizing for size - all edges are equally desirable. */ + /* Move the second region to the end of the first one. */ + if (n_high_priority + n != edges_size) + { + std::copy (edges + (edges_size - n), edges + edges_size, + edges + n_high_priority); + } + n += n_high_priority; - if (optimize_function_for_speed_p (cfun)) + /* Sort the edges, the most desirable first. When optimizing for size, + the first, high priority region must not get sorted. */ + if (optimize_function_for_size_p (cfun)) + std::stable_sort (edges + n_high_priority, edges + n, edge_order); + else std::stable_sort (edges, edges + n, edge_order); /* Now decide which of those edges to make fallthrough edges. We set