Re: Suboptimal bb ordering with -Os on arm

Nicolai Stange Mon, 14 Nov 2016 17:21:07 -0800

Hi Segher,

Segher Boessenkool <seg...@kernel.crashing.org> writes:

> On Fri, Nov 11, 2016 at 02:16:18AM +0100, Nicolai Stange wrote:
>> >> >From the discussion on gcc-patches [1] of what is now the aforementioned
>> >> r228318 ("bb-reorder: Add -freorder-blocks-algorithm= and wire it up"),
>> >> it is not clear to me whether this change can actually reduce code size
>> >> beyond those 0.1% given there for -Os.
>> >
>> > There is r228692 as well.
>> 
>> Ok, summarizing, that changelog says that the simple algorithm
>> potentially produced even bigger code with -Os than stc did. From that
>> commit on, this remains true only on x86 and mn10300. Right?
>
> x86 and mn10300 use STC at -Os by default.

Ah! This explains why I didn't see such a BB ordering on x86.

>> Sure. Let me restate my original question: assume for a moment that it
>> is true that -Os with simple never produces code smaller than 0.1% of
>> what is created by -Os with stc. I haven't got any idea what the "other
>> things" are able to achieve w.r.t code size savings, but to me, 0.1%
>> doesn't appear to be that huge. Don't get me wrong: I *really* can't
>> judge on whether 0.1% is a significant improvement or not. I'm just
>> assuming that it's not. With this assumption, the question of whether
>> those saved 0.1% are really worth the significantly decreased
>> performance encountered in some situations seemed just natural...
>
> It all depends on the tradeoff you want.  There are many knobs you can
> turn -- for example the inlining params, that has quite some effect on
> code size.
>
> -Os is mostly -O2 except those things that increase code size.
>
> What is the tradeoff in your case?  What is a realistic number for the
> slowdown of your kernel?  Do you see hotspots in there that should be
> handled better anyhow?  Etc.

Well, I do see hotspots in my crappy RFC code not yet upstreamed, namely
those 0.5 us extra latency on a memory stressed system as initially
mentioned. Problem is, this is in interrupt context...

I'm quite certain that this is due to a mispredicted branch in
  for(...) {
    g()
  }
-- g() lives on another page and the TLB is cold.

However, once my test hardware is free again, I'll try to identify some
hotspots and get some numbers for a vanilla kernel.

For example, __hrtimer_runqueues() suffers from the same BB ordering,
but its code doesn't cross pages. I'd really have to measure it...

>> No, I want small, possibly at the cost of performance to the extent of
>> what's sensible. What sensible actually is is what my question is about.
>
> It is different for every use case I'm afraid.
>
>> So, summarizing, I'm not asking whether I should use -O2 or -Os or
>> whatever, but whether the current behaviour I'm seeing with -Os is
>> intended/expected quantitatively.
>
> With simple you get smaller code than with STC, so -Os uses simple.
> If that is ridiculously slower then you won't hear me complaining if
> you propose defaulting it the other way; but you haven't shown any
> convincing numbers yet?

Hmm, maybe there's a better way than changing the default.

If I read r228692 correctly, for the -Os case,
reorder_basic_blocks_simple() is made to retain the edge order as given
by the BB order (whatever this is?)  because this reduces code size
(because of reasons).

I think, the gain in code size savings is due to the favoring of the
total fallthrough edge count -- these won't lead to jump insns in the
output. Is this correct?

However, I believe that this edge ordering can be relaxed: it's only a
certain type of edges that must come before all the others.

The "all the others" can then be sorted by frequency again, thus leading
to better static branch prediction, especially in the loop case.

Thinking locally, if we have

  A
  |\
  | B
  |/
  C

we want to have the output order A, B, C, not A, C, B, because:
- A will have a single branch insn at its end either way,
- with the second ordering, B will need one, too.

Now, if B-C is considered first by the chain assembling part in
reorder_basic_blocks_simple(), this will produce the desired output.

I've appended a patch that does just this: it puts all the edges
originating from BB's with only one outgoing edge first and the rest,
sorted by frequency, after them.

The results for that single test case I've tried, namely the kernel
with some config on ARM, look surprisingly good:

W/o patch:
  0 .head.text    0000026c  c0008000  c0008000  00008000  2**2
  1 .text         002ab370  c0100000  c0100000  00010000  2**5
 16 .init.text    00027270  c06002e0  c06002e0  003f02e0  2**5
 17 .exit.text    00000594  c0627550  c0627550  00417550  2**2

W/ patch:
  0 .head.text    0000026c  c0008000  c0008000  00008000  2**2
  1 .text         002aaeac  c0100000  c0100000  00010000  2**5
 16 .init.text    0002723c  c06002e0  c06002e0  003f02e0  2**5
 17 .exit.text    00000594  c062751c  c062751c  0041751c  2**2

So even slightly smaller code is produced. But more importantly, it gets
the fallthrough for the loops right, thus improving static branch
prediction.

I have to admit that this POC patch is the result of some guesswork and
that I don't understand everything fully: for example, I don't know what
the order given by FOR_EACH_BB_FN() actually is and how retaining it
helped reducing code size in r228692.

Nevertheless, with the above results, can I convince you to look into
this and tell me whether it's actually as promising as I think it is?

Of course I'm willing to run further tests, but if this approach is just
obviously wrong, it would be good if I could avoid this in the first
place.

In case I should indeed do some further tests, do you have some set of
standard packages to do those -Os tests against?

Thank you very much!

Nicolai

Index: gcc/bb-reorder.c
===================================================================
--- gcc/bb-reorder.c	(revision 242047)
+++ gcc/bb-reorder.c	(working copy)
@@ -2305,14 +2305,42 @@
   if (dump_file)
     fprintf (dump_file, "\nReordering with the \"simple\" algorithm.\n\n");
 
-  edge *edges = new edge[2 * n_basic_blocks_for_fn (cfun)];
+  int edges_size = 2 * n_basic_blocks_for_fn (cfun);
+  edge *edges = new edge[edges_size];
 
   /* First, collect all edges that can be optimized by reordering blocks:
-     simple jumps and conditional jumps, as well as the function entry edge.  */
+     simple jumps and conditional jumps, as well as the function entry edge.
 
-  int n = 0;
-  edges[n++] = EDGE_SUCC (ENTRY_BLOCK_PTR_FOR_FN (cfun), 0);
+     When optimizing for size, certain edges should come first in the
+     greedy chain assembling algorithm below: this will allow for more
+     fallthrough and thus, jump elision.
 
+     Consider
+        A
+        | \
+        |  B
+        | /
+        C
+    This shall be ordered as A B C, not A C B.
+
+    Considering all edges of the B-C type first, this will force the
+    greedy algorithm to do just the right thing.
+
+    OTOH, all other edges shall be sorted by their frequency in order
+    to favor branch prediction.
+
+    In order to achieve this, the edges array is split into two
+    regions, the first growing upwards and containing all edges which
+    are a BB's only ones, i.e. those possibly of the B-C type, and the other
+    region growing downwards and containing the rest.
+
+    When optimizing for size, only the second region will be sorted by
+    edge frequency.
+  */
+
+  int n_high_priority = 0, n = 0;
+  edges[edges_size - ++n] = EDGE_SUCC (ENTRY_BLOCK_PTR_FOR_FN (cfun), 0);
+
   basic_block bb;
   FOR_EACH_BB_FN (bb, cfun)
     {
@@ -2326,24 +2354,29 @@
 	continue;
 
       if (single_succ_p (bb))
-	edges[n++] = EDGE_SUCC (bb, 0);
+	  edges[n_high_priority++] = EDGE_SUCC (bb, 0);
       else if (any_condjump_p (end))
 	{
 	  edge e0 = EDGE_SUCC (bb, 0);
 	  edge e1 = EDGE_SUCC (bb, 1);
-	  /* When optimizing for size it is best to keep the original
-	     fallthrough edges.  */
-	  if (e1->flags & EDGE_FALLTHRU)
-	    std::swap (e0, e1);
-	  edges[n++] = e0;
-	  edges[n++] = e1;
+	  edges[edges_size - ++n] = e0;
+	  edges[edges_size - ++n] = e1;
 	}
     }
 
-  /* Sort the edges, the most desirable first.  When optimizing for size
-     all edges are equally desirable.  */
+  /* Move the second region to the end of the first one. */
+  if (n_high_priority + n != edges_size)
+    {
+      std::copy (edges + (edges_size - n), edges + edges_size,
+		 edges + n_high_priority);
+    }
+  n += n_high_priority;
 
-  if (optimize_function_for_speed_p (cfun))
+  /* Sort the edges, the most desirable first. When optimizing for size,
+     the first, high priority region must not get sorted. */
+  if (optimize_function_for_size_p (cfun))
+    std::stable_sort (edges + n_high_priority, edges + n, edge_order);
+  else
     std::stable_sort (edges, edges + n, edge_order);
 
   /* Now decide which of those edges to make fallthrough edges.  We set

Re: Suboptimal bb ordering with -Os on arm

Reply via email to