Hi,
while looking into problems of current and pretty-ipa's inlining heuristics
implementation, I noticed that we have relatively important problems with EH
overhead that confuse inliner code size metrics. 

Looking deeper into EH problems, I think main issues are:
  - Inliner tends to produce exponential amount of destructor calls on programs
    with high abstraction penalty (pooma/boost/DLV all seem to have this).
    Every scope block such as { class a a;.....} contains implicit cleanup.
    Destructor of class A is called twice, once for ned of block, once for
    cleanup.

    Now it is common for destructors to call destuctors of contained objects
    and local iterators, this all happens in same way causing EH cleanup code
    that often exceed main function body.
  - Not inlining destructors in cleanups cause otherwise fully inlined object
    to miss SRA/store sinking and other optimizations.
  - We tend to block optimization because variable along hot path has abnormal
    PHI.  Interestingly this usually limits to creating more copies, PHIs,
    conditionals and BBs that gets almost completely cleaned up later on RTL
    land, but I don't think it is good excuse to ignore these ;)
  - Cleanups calling destructors are often fully removable but we
    don't do that very well.  We remove completely dead destructor by means
    of new local-pureconst pass.  As soon as destructor has some stores
    or loops, we realize the fact that they are dead very late in compilation.
    Also destructors calling delete() on some fields are not simplified.  This
    is partly doable via Martin's IPA-SRA.

I thus started to look into improvements in EH machinery and elimination
of code in the cleanups

What is already done on mainline:
  - New ehcleanup pass:  when EH handler becomes empty by
    local pureconst+DCE+DSE+complette unrolling combo, we can safely eliminate
    it.  This further eliminate local nothrow region handlers that are
    numberous.

  - There is new local pureconst pass done during early optimization.
    It tries to prove stuff nothrow and pure.  This leads to DCEing the
    destructors before they are accounted in inline metricts.

  - Some of bits from RTL EH lowering are gone now.  More should come
    We really have here partial transition from RTL EH code to gimple.
    This will hopefully simplify RTL EH code to basic part adding
    landing pads.  

What is already done on pretty-ipa:

  - Inliner herusitics can be bit smarter about code size estimates on
    how code will look after inlining.  In particular MUST_NOT_THROW
    receivers are most likely
    optimized away or commonized and can be ignored.  Also all the
    FILTER_EXPR/OBJ_REF_EXPR
    sets/reads are probably going to disappear after lowering.

What I have implementation for and would like to push to pretty-ipa and
later for mainline after some more evaulation:

  - EH edge redirection.  This is conceptually easy thing to do.  When EH edge
    needs to be redirected and it is only edge to BB, we just update label in
    tree of EH handlers.

    When it is not only edge, we walk from the throwing statement region up to
    the outermost handler region in the EH tree and duplicate everything on a
    way.  This breaks some assumptions EH code have.  In particular on EH
    handler can do RESX in other handler and handlers can share labels.
    These assumptions are made in except.c, but they don't have to be: EH tree
    is really just on-side representaiton of decision tree of what happens after
    expression and is lowered to such for for dwarf.  There is no need to know
    what happens after EH is delivered.

    While I don't expect immediate improvements in C++ codegen. 
    I benchmarked it in C++ testers and there is some improvment in
    libstdc++ tests, little improvement in boot's wave and cwcessboard.  Partly
    it is because all testcases we have do pretty much no EH except for one
    done implicitly by cleanup regions.  The testcases improving in libstdc++
    do have try....catch constructs in them.  We probably could try to find
    some new benchmarks for C++ testsuite to cover this area as well as cases
    where it is not best solution to inline everything completely.  Probably
    something like mozilla would do the job, but I don't know if there is easy
    to compile benchmark aplication for this.  Our C++ testsuite is pretty much
    testsuite for specific kind of C++ coding requiring inliner to do a lot
    of job.  This is not surprising because it was constructed with inliner 
    tunning in mind.

    I think this is important infrastructure bit.  In particular it seems to
    me that since this leaves us with abnormal edges produced by
    setjmp/longjmp, nonlocal labels and computed goto and because of computed
    goto factoring, only longjmp/setjmp edges are really unsplittable and thus
    we can get rid of abnormal phi handling and teach out-of-ssa to insert
    conditionals into abnormal edge receiving BB to resolve the unlikely case
    we will make copy on abnormal edge neccesary.

    Redirecting EH edges also allows sinking code needed only on EH path down
    to the edges that is quite common because of inlining differences above.
  - EH region merging:  after critical edge splitting we produce a lot of copies
    of EH regions by EH edge redirecition.  It is simple to merge them again
    in ehcleanup by hashing them and proving equivalence.
  - It is possible to DCE FILTER_EXPR (and OBJ_REF_EXPR).  I use simple code
    that try to verify that value stored to FILTER_EXPR is read in same BB (and
    thus provably don't change value as described bellow) and do not set the 
magic
    needed bit on it in that case.  This kills most of FILTER_EXPR sets.

Some more or less experimental stuff I have:

  - We should finish the transition and move as much as possible of EH lowering
    up to gimple world.  Producing post-landing pads in gimple is easy (they
    are really just conditionals on FILTER_EXPR/OBJ_REF_EXPR) and lowering
    RESX to goto is easy too.

    This allows scalar optimizations on the produced code.
    There is problem that after lowering it is no longer easy to do dead
    cleanup elimination.  So we probably should do this transform somewhere
    in half of our optimization queue.

    I also experimented with lowering only the trivial cases pre-inlining.

Some remaining issues:
  - FILTER_EXPR/OBJ_REF_EXPR is currently handled in quite dangerous way.
    Original Rth's code made them quite 100% volatile.  Now we can PRE them.
    The FILTER_EXPR/OBJ_REF_EXPR are really hard registers in RTL world that
    are set by EH runtime on EH edges from throwing statement to handler
    (not at RESX edges) and they are used by pre-landing pads and also by
    RESX code.

    It would be more precise if RESX instruction took FILTER_EXPR/OBJ_REF_EXPR
    value as argument (since it really uses the values) so we can kill magicness
    of the sets. Problem is that I don't think there is good SSA representation
    for register that implicitly change over edges.

    Take the example

     call ();  EH edge to handler 1:
     ....

     receiver 1:
       tmp1 = filter_expr;
       tmp2 = obj_ref_expr;
       call2 (); EH edge to handler 2
     label1:
       filter_expr = tmp1
       obj_ref_expr = tmp2
       resx (tmp1, tmp2)


     handler 2:
       tmp3 = filter_expr;
       tmp4 = obj_ref_expr;
       if (conditional)
         goto label1:
       else
         filter_expr = tmp3
         obj_ref_expr = tmp4
         resx (tmp3, tmp4);

    In this case tmp1 != tmp3 and tmp2 != tmp4 and thus it is invalid to 
optimize
    second resx to (tmp1, tmp3).  There is nothing to model this.
    I wonder if FILTER_EXPR/OBJ_REF_EXPR can't be best handled by just being
    volatile to majority of optimizations (i.e. assumed to change value all the 
time)
    and handle this in copyprop/PRE as specal case invalidating the value across
    EH edges?
  - The nature of code duplication in between cleanup at end of block and
    cleanup in EH actually brings a lot of tail merging possibilities.
    I wonder if we can handle this somehow effectivly on SSA. In RTL world
    crossjumping would catch thse cases if it was not almost 100% ineffective
    by fact that we hardly re-use same register for temporaries.

    I wonder if there is resonable SSA optimization that would have similar
    effect as tail merging here or if we want to implement tail merging on
    gimple.
  - Can we somehow conclude that structure being desturcted dies after the
    destructors so all writes to it are dead already in early optimizations?
    That would allow a lot more DSE and cheaper inlining.
  - It is possible to make EH edges redirectable on RTL too.  I wonder
    if it is worth the effort however.
  - We ought to be able to prove finitarity of simple loops so we can
    DCE more early and won't rely on full loop unrolling to get rid of
    empty loops originally initializing dead arrays.

Honza

Reply via email to