On 12-10-05 5:53 PM, Steven Bosscher wrote:
On Thu, Oct 4, 2012 at 2:59 PM, Steven Bosscher <stevenb....@gmail.com> wrote:
On Thu, Oct 4, 2012 at 1:30 PM, Richard Guenther
<richard.guent...@gmail.com> wrote:
Isn't _REVERSE vs. non-_RESERVE still kind-of "random" order?
Not at this stage. For cfglayout mode I would answer yes, but IRA/LRA
operates in cfgrtl mode, so the sequence of insns and basic blocks
must match. Therefore, if you walk the basic blocks in reverse, and
the insns in each basic block in reverse, you effectively work on a,
let's say, "reverse extended basic block" (??) in that your program
points are sequential across fallthrough edges.

  Thus, doesn't
the above show there exists an optimal order for processing which we could use?
There may be a smarter order: Could even walk blocks in that order if
you know a priori what path through the CFG minimizes the length of
the live range chains. But to know that order, you have to build the
chains. So chicken-and-egg...
Richi, you had me inspired, and I remembered your (still disgusting
;-)) my_rev_post_order_compute hack.

A better order than walking basic blocks in reverse, is to walk them
in topological order on the reverse CFG. This results in very long
live ranges being compressed into very short chains because the
program points become almost sequential for them.

So here's one more patch:

@@ -994,8 +1092,16 @@ lra_create_live_ranges (bool all_p)
    curr_point = 0;
    point_freq_vec = VEC_alloc (int, heap, get_max_uid () * 2);
    lra_point_freq = VEC_address (int, point_freq_vec);
-  FOR_EACH_BB (bb)
-    process_bb_lives (bb);
+  int *post_order_rev_cfg = XNEWVEC (int, last_basic_block);
+  int n_blocks_inverted = inverted_post_order_compute (post_order_rev_cfg);
+  lra_assert (n_blocks_inverted == n_basic_blocks);
+  for (i = n_blocks_inverted - 1; i >= 0; --i)
+    {
+      bb = BASIC_BLOCK (post_order_rev_cfg[i]);
+      if (bb == EXIT_BLOCK_PTR || bb == ENTRY_BLOCK_PTR)
+       continue;
+      process_bb_lives (bb, curr_point);
+    } // HACK13 must add free (post_order_rev_cfg); here
    lra_live_max_point = curr_point;
    create_start_finish_chains ();
    if (lra_dump_file != NULL)


Without this patch:
Compressing live ranges: from 700458 to 391665 - 55%, pre_count
40730653, post_count 34363983
max per-reg pre_count 12978 (228090, 2 defs, 2 uses) (reg/f:DI 228090
[ SR.25009 ])
max per-reg post_count 10967 (228090, 2 defs, 2 uses) (reg/f:DI 228090
[ SR.25009 ])

With this patch:
Compressing live ranges: from 700458 to 372585 - 53%, pre_count
283937, post_count 271120
max per-reg pre_count 545 (230653, 542 defs, 542 uses) (reg/f:DI
230653 [ SR.13303 ])
max per-reg post_count 544 (230649, 542 defs, 542 uses) (reg/f:DI
230649 [ SR.13305 ])

(the per-reg counts are the lengths of the live range chains for the
mentioned regno).
Yes, that is impressive. But I think, #points in a live range is a real parameter of the complexity.
So the patch results in fewer program points, and way, waaaaay shorter
live range chains. Needless to say, really, compile time benefits
significantly also. Time spent in create_start_finish_chains:
Without this patch, 19.75s (2%) usr
With this patch, 0.14s (0%)

Vlad, is this something you also already tried? If not, then I'm going
to bootstrap&test this (on top of my pending other patches on
lra-lives.c).


No, I was busy with fixing a bug in LRA. I played a bit with your patch (including FOR_EACH_BB_REVERSE) on Intel machine. But I did not see a visible improvement on slow.cc.

The code size is a bit different too and that I can not explain because #points or #element in live range list are not present in assignment decisions. I have no time to find why it is different. In any case I think it is ok and there is reasonable explanation for this.

As on AMD machines, the improvement is really significant. I have no objections to the previous patch (including FOR_EACH_BB_REVERSE) and the patch you put here. Of course if it is committed with a proper changelog.

Thanks for working on this, Steven. I'd like to attack find_hard_regno_for when I have time. I should say work on improving compilation speed for this particular case is a bit addictive for me.

Reply via email to