About the number of DOM's iterations.

Kazu Hirata Tue, 12 Apr 2005 11:37:12 -0700

Hi,

Note that with -O1, we limit the number of iterations of DOM to 1 so
that we can get resonable compile time performance.  I was wondering
what happens if we do the same with -O2 now that we have passes like
copy-prop, VRP, and improved FRE as well as good and old CCP.


Numbers go first.  All of the following numbers are obtained from
compiling cc1-i files with -O2.

version #iters #blocks  %cov #thread
------------------------------------
     v0   9097  514501 94.5%    9438
     v1   8602  553685 95.5%   20868
     v2   8609  510618 96.3%   23236
     v3   8490  510734 97.2%   23262
     v4   7193  188793   N/A   18450

Legend
------

v0 : Vanilla mainline with the second and third runs of DOM disabled.
v1 : v0 with Jeff's patch in PR 19794
v2 : v1 with the following TCB-like tweak to the pass ordering

  NEXT_PASS (pass_ccp);
  NEXT_PASS (pass_copy_prop);
  NEXT_PASS (pass_fre);
  NEXT_PASS (pass_dce);
  NEXT_PASS (pass_vrp);
  NEXT_PASS (pass_merge_phi);
  NEXT_PASS (pass_dominator);

v3 : v2 with my queued VRP improvements
v4 : v3 with the number of DOM's iterations limited to 1

#iters : the number of times we iterate in the do-while loop in
tree-ssa-dom.c:tree_ssa_dominator_optimize.

#blocks : the number of blocks scanned by DOM.  This is done by adding
one line at the entrance of the aforementioned do-while loop.

  num_cfg_blocks += n_basic_blocks;

%cov : the percentage of invocations of DOM that finished its work
within 2 iterations.

#threaded : the number of edges threaded

Disclaimers
-----------

o I have not measured compile time performance.
o I have not done benchmark on generated code.
o I have not tried any other pass ordering.
o I have not considered other testcases or languages.

Justifications for v0, ..., v4 and columns
------------------------------------------

v0 is because I wanted to stay simple and focus on the initial basic
propagation.  v1 is because Jeff will soon remove various restrictions
on the jump threader that have been limiting jump threading.  For v2,
note that ccp, copy-prop, FRE, and VRP are all something that DOM does
to some extent.  I put my merge_phi right before DOM so that DOM can
find more jump threading opportunities in its first iteration.

#iter is mostly out of curiosity.  Functions that we compile come in
different size, so the number isn't that important.  For the same
reason, #blocks is more interesting.  It's a rough measure of how much
time DOM takes.

Now %cov.  Why 2 iterations?  Well, DOM has a fixed point operation,
meaning that the last iteration does not do anything.  More
specifically, the last iteration does not perform any jump threading.
Whenever jump threading opportunities are taken, cleanup_tree_cfg
called from tree_ssa_dominator_optimize always returns 1, triggering
the next iteration.

Observations
------------

Even with v0, the vast majority of functions require at most two
iterations, so we are in diminishing returns territory to begin with.
With v3, we get impressive 97.2%, meaning that 97.2% of time, DOM
iterates at most two times, which in turn means that 97.2% of time, we
don't take any jump threading opportunity in DOM's second iteration.

Note that with v4, the number of blocks that DOM looks at goes down
significantly.  At the same time, the number of edges threaded also
goes down but is still close to the level of v1.

It's interesting that even if we let DOM iterate as many times as it
likes to iterate, there is a sizable difference between v1 and v2 as
far as the number of edges threaded.  Certain things cannot be
propagated using DOM.

Future work
-----------

o Address things listed in Disclaimers.
o Submit my queued VRP patches as soon as mainline gets a bit more
  stable.
o Speed up things like the incremental SSA update, and the propagator.
o Try out some ideas to get more out of the first iteration of DOM.
o Take advantage of value ranges in thread_across_edge.

Kazu Hirata

About the number of DOM's iterations.

Reply via email to