Re: My opinions on tree-level and RTL-level optimization

Steven Bosscher Mon, 18 Apr 2005 03:01:20 -0700

On Apr 18, 2005 07:41 AM, Roger Sayle <[EMAIL PROTECTED]> wrote:

> 
> On Sat, 16 Apr 2005, Richard Kenner wrote:
> >     Although, RTL expansion may introduce new loops, these tend to be
> >     rare, and the expanders have all the information they need to
> >     hoist/sink invariant expressions and unroll/peel themselves.
> >
> > I disagree.  In order to make the proper decisions about merging givs
> > and chosing which giv should represent a biv, you have to know a lot
> > about the valid addressing modes on the machine and this isn't something
> > the tree level optimizers should have to deal with.
> > ...
> >
> >     Simiarly, CSE shouldn't need to process more than a single basic
> >     blocks,
> >
> > Again, not clear.  Certainly the costly stuff I put in ages ago to
> > walk through comparisons and around loops needs to go, but there's
> > no reason to tie CSE to a basic block: it can operate until the next
> > label, like it does now.  Admittedly, the number of CSE opportunities
> > won't be great, but why restrict them to a basic block?
> >
> >     and GCSE shouldn't need to move anything other than simple
> >     expressions.
> >
> > Why would we need a GCSE at the RTL level at all?  I'd guess the number
> > of wins it would produce would be very small.
> >
> >     The quality of alias analysis at the RTL-level shouldn't be an issue.
> >
> > Here I disagree the strongest!  Instruction scheduling is rapidly
> > becoming one of the most critical optimizations and must be done at
> > the RTL level.  The quality of instruction scheduling depends quite
> > heavily on the quality of the aliasing information.
> 
> 
> Thanks for your feedback.  I agree with all your points.


But of course you do.  Unfortunately you appear to have little clue what you
are really talking about.  So let me provide you with some loud feedback as
well.


>  I had
> greatly underestimated the importance of RTL alias analysis, especially
> with respect to scheduling.

Which does not mean RTL alias analysis is important.  It just means that
having alias information available in RTL is important.  How you get that 
information there is a separate issue.

You just can not do much better for alias analysis on RTL.  There is almost
no array information, almost no type information, it is almost impossible to
do  points-to analysis, and it is definitely impossible to do flow sensitive
alias analysis.  Also, many interesting aliases GCC can not disambiguate now
are interprocedural ones.

The best approach seems to be to preserve the alias analysis done at the
tree level to assist the current RTL alias analysis in disambiguating memory
references.  Examples of how this can be done exist in the tree-profiling
branch, if you want to look for examples.

You should also read the GCC summit paper from two years ago about Debray
alias analysis (oh, horror!), and then compare the results (bot accuracy and
complexity) of that to the ideas for propagating tree alias info to RTL.


>  The compile-time hog issues with CSE are
> primarily due to deep path-following (AFAIU).

Path following, and running four times (worst case) at -O2.


>  Simple/cheap tricks are
> such as extended basic blocks are clearly a win, but their benefit depends
> on whether we keep GCSE. If we don't then EBBs of course, if we do, the
> EBBs are probably subsumed by GCSE.

You seem to have a misguided view on what CSE does.  What we call CSE is not
a common subexpression elimination pass.  It just happens to catch some CSEs
as well, but the most important thing it does right now is propagation things
forward to select better addressing modes (essentially a kind of instruction
selection), and limited folding/simplifying to clean up the sometimes terrible
things 'expand' can do.

Also, CSE does things across basic block boundaries that GCSE will _never_
be able to do, because GCSE is a PRE implementation, so it can't do expression
simplifications while eliminating redundancies, and it performs code motion
in addition to redundancy elimination. CSE uses value numbering and can only
remove fully redundant expressions.

(Kenner mentioned CSE around loops, but that is already gone.)


> And its very true at the RTL-level
> addressing modes have always been an Achilles' heel.  But whilst
> induction variable selection is very dependent upon target parameters,
> as is prefetching, but it's not yet clear whether uglifying tree-ssa or

I agree it is not clear yet, but at least we are now trying both approaches.
It looks like the TARGET_MEM_REF patch _does_ work, while clearly RTL
addressing mode selection does not (there are many open PRs about that).

It is interesting that you call Zdenek's work "uglifying". I agree that some
of the details of the machine-dependent parts of the tree optimizers are not
the most elegant.  But the reason for that is more just a complete lack of 
target modelling for trees.  If you run things like IVopts and TARGET_MEM_REF
late enough, it looks like these passes can do things no RTL optimizer can,
while catching almost everything the RTL passes do catch.

Besides, the RTL optimizers are not exactly a part of GCC to be proud of if
"ugliness" is a measure.


> continuing to improve RTL loop analysis is the better long-term strategy.

This will have to be done in any case for things like SMS and unrolling at
least. (FWIW I'm working on better RTL IV analysis with Zdenek, to simplify
the IV splitting in the unroller, and for a new RTL level prefetching pass). 


> I take it from your comments, that you are in the camp that believes
> that "the sun has not yet set" on the need for RTL optimizers. :-)

I don't think anyone has claimed that.  I certainly never did.  The question
is, how complicated should they be, and when can we safely say that part of
an RTL optimizer is no longer worthwhile.


> The intent of my post was to defend what was seen as my pro-RTL
> stance,

I don't care about your pro or contra RTL stance.  What mattered is that you
literally asked for proof that all code that cleanup patches may remove is
never executed for any port.  After calling me a lier.  Is it really so
surprising that this would upset me?


> it's interesting to see that the loudest feedback is that
> I'm underestimating/down-playing the importance of the RTL optimizers.

It is also feedback of the person who wrote or approved most of that work
in the old, pre-egcs days ;-)  I actually agree you are down-playing the RTL
_optimizations_, but I think you are overrating the RTL _optimizers_. They
are two different things.


Let me get reply to parts of your original mail:

> It's unlikely that RTL, or an instruction-based, representation of
> a program will ever go away.  And the RTL passes corresponding to
> instruction selection (combine), register allocation (reload) and
> scheduling will almost certainly be RTL-level tasks/passes.

Of course GCC will always need a low-level IR.  But, combine is instruction
selection in the worst possible way; reload is register allocation in the
worst possible way, and reload is single-handedly responsible for the fact
that many target details can not be exposed well before register allocation
(e.g. register pressure, stack slots on short displacement targets, predicated
execution, etc.); and our scheduler sucks in every respect for 21st century 
architectures (the scheduler is not even aware of the CFG, scheduling anything
larger than a single basic block is hard, and likewise for e.g. speculation).


> One might argue that the shifts/adds should be exposed at the
> tree-level.  But not only does this require detailed knowledge
> of the instructions available on a machine,

You are overestimating the "detailed knowledge" the RTL optimizers have.

You still only have named patterns and they often still produce crap initial
RTL for e.g. multiplication.  Remember one of your patches to remedy one of
the earlier outcome of my "let's simplify CSE" project?  Likewise for your
earlier patch to produce better RTL for modulo-power-of-2. The initial RTL
after that patch was still crap, but combine and CSE turn the initial RTL into
something nice.

Other compilers have "target hooks" to lower their high-level representation
to code that will be easier to "expand" to the machine instruction intermediate
representation (which RTL really is not, btw), and there is no reason why GCC
can't have that.  Just do it late in the high level optimizers, and make them
less high-level.  Lowering is the keyword.  Look at WHIRL for examples.


> but it would also
> obfuscate the multiplication from the tree-ssa optimizers, such
> as iv-opts, etc... 

You seem to think that there can be only one "tree level".  Other compilers
get away Just Fine with lowering their high level intermediate representation.

Again, the only reason why GCC can't have that right now is because there is
no target cost information for the tree optimizers available, except for that
gathered by actually producing RTL (which is what IVopts and the TARGET_MEM_REF
patch do).  That is, in fact, one of the weak points of the tree optimizers at
the moment, IMVHO.


> One goal is to have
> the initial RTL that we generate to be pretty close to it's
> final form.  Perhaps only requiring minor clean-ups.

Then you have to do things like addressing mode selection at the tree level.
Otherwise, you are _always_ going to see major changes in CSE and combine.


> I forsee a world in which RTL transformations become far more
> local.

The view in your glass sphere must be clouded.   The most interesting target
IL transformormations are highly global in nature.


>  Perhaps the RTL-level loop optimizers will disappear
> completely.

Not unless we replace RTL.
(And some have suggested this is the _only_ way to get a better back end
with proper instruction selection, scheduling, and register allocation).


>   Although, RTL expansion may introduce new loops,
> these tend to be rare,

Why do you think these tend to be rare enough that you don't need RTL loop
optimizers anymore?  Are they more rare than non-word arithmetic in CSE?  How
are you going to show for all targets that they are really rare?  Did you 
actually _count_ them, or are you guessing?


> and the expanders have all the information
> they need to hoist/sink invariant expressions and unroll/peel
> themselves.

Some people consider this to be a bug, you know.  You can not replace expand
with a proper instruction selection pass because expand is yet another example
of an I-do-everything pass, like so many others in GCC (CSE, flow, etc.).


>  Certainly there should be no need for RTL-level
> loop optimizations to do loop unrolling or other large scale
> reorganization.

Loop unrolling is exactly one of those things that probably will have to
stay in the low-level optimizers!  Think icache, register pressure, rotating 
register windows, etc.  That is the kind of target information that is
essential in deciding the profitability of unrolling.  Unrolling is not an
optimization in itself, you know.  It just enables further, often highly
target specific optimizations.


> With tree-ssa, however, these design decisions can and need to be
> re-evaluated.  Now that a "long long" for-loop will be reversed at
> tree-level,

Not true.  Wtf are you talking about.  Loops aren't reversed at the tree level.


> Whilst I have no issue with deleting cse.c in its entirety.  I believe
> that the majority of it's interesting transformations are encoded in
> cselib.c,

Not true.  For example cselib doesn't do any simplifications after substituting.


>  and GCC's GCSE pass performs a cselib sweep over instructions
> whilst doing its thing.

Not true.  GCSE only uses cselib for local constant and copy propagation.


> The reason why we can even talk about removing CSE is
> because so much effort has gone into centralizing transformations so
> they can be performed by cselib, GCSE and combine.

Not true.  Those centralized transformations are useful, that is true.  But
that does not give magic powers to the passes using them.  GCSE and combine do
completely different things than CSE.


> Whilst performing operations in narrower modes that word-mode might
> seem like an unnecessary optimization or one that's already performed
> at the tree-level, there are even several open PRs about them.

That doesn't say anything about the code that was proposed to be removed in
the patch that started this discussion, because if that code would catch the
problems of those open PRs, you would not have those PRs.  So what is your
point here?


> In all of your testing on s390*, ppc, ppc64, i386, i686,
> amd64, alpha, hppa and ia64, you never tried an 8bit or 16bit
> target for which this transformation may be more than just "nice".

In your total lack of testing on anything, you have not found one target
``that this transformation may be more than just "nice".'' for.  I am at
least trying to show that the parts I want to clean up really do nothing,
you are only guessing that they do something.


> Notice also there is nothing CSE specific about the code that is
> being deleted.

Look again.  It is completely CSE specific.


> What it does
> reveal is that (i) the old RTL loop optimizer did a better job, and
> was easier to modify for this task than the new one,

Well d'oh.  It is always easier to tune an already highly tuned pass than
to help finishing a new one.  But as I have pointed out already so many times
now (and you have conveniently ignore this all the time) is that the old loop
optimizer is a blocker for many other new and important optimizations.  Some
of these are a lot more important than prefetching.


>  and (ii) all
> attempts to move prefetching to the tree-level have (alledgedly)
> done a worse job than the current aging implementation.

No, just not better.  They need tuning too.  As you said, "alledgedly",
because you have clearly never looked into this for real.


> Being a GCC maintainer is difficult.  If simply pushing patches on
> a RedHat or SuSE build farm was sufficient, life would be much
> simpler.  Instead, because automated testing of the possible interactions
> can never be sufficient, our development practices rely on intelligent
> individuals to give each change the consideration it deserves.  The
> fact that I've been slow in doing so (or that nobody else has reviewed
> it) is an issue.  But to claim that such diligence is unnecessary is
> one of the things that distinguishes maintainers from contributors.

:-)

After calling me a lier, and literally asking me to do the impossible to
get any cleanup patches approved, now I am apparently also an unintelligent
individual and a mere mindless contributor (even though my name appears in
MAINTAINERS as well). Let me know if you have any further insults to add,
you are even better at this than I am.  (It is interesting to be the person
on the receiving end of the insults for once.   I seem to be the one on the
sending end most of the time. ;-)

In reality, a lot more has happened than just pushing patches on automated
testers.  You again conveniently ignore my earlier explanations about that,
as well as the proof of that, such as a number of fold patches, including a
patch by you to fix expand_mult for negative constants.  These patches all
fixed missed optimizations I identified by looking at what CSE was still
catching.  There was a lot more to that than pushing patches on test boxes.


> I may decide the patch is perfect as written.

I don't see how you can with such an ill-informed view of the RTL optimizers.
It's a rather arrogant thing to say anyway.  Apparently you think that only
you are diligent enough to decide what parts of the RTL optimizations must
stay, and what parts can go away.  That surprises me, given the completely
false statements you make about the RTL optimizers.

But anyway, since you are the Superior Being with approval rights here and I
don't want to be in this kind of discussion all the time, I am just not going
to work on simplifying CSE anymore.

Gr.
Steven

Re: My opinions on tree-level and RTL-level optimization

Reply via email to