On Tue, Jul 5, 2011 at 3:59 PM, William J. Schmidt
<wschm...@linux.vnet.ibm.com> wrote:
> (Sorry for the late response; yesterday was a holiday here.)
>
> On Mon, 2011-07-04 at 16:21 +0200, Richard Guenther wrote:
>> On Thu, Jun 30, 2011 at 4:39 PM, William J. Schmidt
>> <wschm...@linux.vnet.ibm.com> wrote:
>> > This is the first of three patches related to lowering addressing
>> > expressions to MEM_REFs and TARGET_MEM_REFs in late gimple.  This patch
>> > contains the new pass together with supporting changes in existing
>> > modules.  The second patch contains an independent change to the RTL
>> > forward propagator to keep it from undoing an optimization made in the
>> > first patch.  The third patch contains new test cases and changes to
>> > existing test cases.
>> >
>> > Although I've broken it up into three patches to make the review easier,
>> > it would be best to commit at least the first and third together to
>> > avoid regressions.  The second can stand alone.
>> >
>> > I've done regression tests on powerpc64 and x86_64, and have asked
>> > Andreas Krebbel to test against the IBM z (390) platform.  I've done
>> > performance regression testing on powerpc64.  The only performance
>> > regression of note is the 2% degradation to 188.ammp due to loss of
>> > field disambiguation information.  As discussed in another thread,
>> > fixing this introduces more complexity than it's worth.
>>
>> Are there also performance improvements?  What about code size?
>
> Yes, there are performance improvements.  I've been running cpu2000 on
> 32- and 64-bit powerpc64.  Thirteen tests show measurable improvements
> between 1% and 9%, with 187.facerec showing the largest improvements for
> both 32 and 64.
>
> I don't have formal code size results, but anecdotally from code
> crawling, I have seen code size either neutral or slightly improved.
> The largest code size improvements I've seen were on 32-bit code where
> the commoning allowed removal of a number of sign-extend and zero-extend
> instructions that were otherwise not seen to be redundant.
>>
>> I tried to get an understanding to what kind of optimizations this patch
>> produces based on the test of testcases you added, but I have a hard
>> time here.  Can you outline some please?
>>
>
> The primary goal is to clean up code such as is shown in the original
> post of PR46556.  In late 2007 there were some changes made to address
> canonicalization that caused the code gen to be suboptimal on powerpc64.
> At that time you and others suggested a pattern recognizer prior to
> expand as probably the best solution, similar to what IVopts is doing.

The PR46556 case looks quite simple.

> By using the same mem_ref generation machinery used by IVopts, together
> with local CSE, the goal was to ensure base registers are properly
> shared so that optimal code is generated, particularly for cases of
> shared addressability to structures and arrays.  I also observed cases
> where it was useful to extend the sharing across the dominator tree.

As you are doing IV selection per individual statement only, using
the affine combination machinery looks quite a big hammer to me.
Especially as it is hard to imagine what the side-effects are, apart
from re-associating dependencies that do not fit the MEM-REF and
making the MEM-REF as complicated as permitted by the target.

What I thought originally when suggesting to do something similar
to IVOPTs was to build a list of candidates and uses and optimize
that set using a cost function similar to how IVOPTs does.

Doing addressing-mode selection locally per statement seems like
more a task for a few pattern matchers, for example in tree-ssa-forwprop.c
(for its last invocation).  One pattern would be that of PR46556,
MEM[(p + ((n + 16)*4))] which we can transform to
TARGET_MEM_REF[x + 64] with x = p + n*4 if ((n + 16)*4)) was
a single-use.  The TARGET_MEM_REF generation can easily
re-use the address-description and target-availability checks from
tree-ssa-address.c.  I would be at least interested in whether
handling the pattern from PR46556 alone (or maybe with a few
similar other cases) is responsible for the performance improvements.

Ideally we'd of course have a cost driven machinery that considers
(part of) the whole function.

> Secondarily, I noticed that once this was cleaned up we still had the
> suboptimal code generation for the zero-offset mem refs scenario
> outlined in the code commentary.  The extra logic to clean this up helps
> keep register pressure down.  I've observed some spill code reduction
> when this is in place.  Again, using expression availability from
> dominating blocks helps here in some cases as well.

Yeah, the case is quite odd and doesn't really fit existing optimizers
given that the CSE opportunity is hidden within the TARGET_MEM_REF ...

>> I still do not like the implementation of yet another CSE machinery
>> given that we already have two.  I think most of the need for CSE
>> comes from the use of the affine combination framework and
>> force_gimple_operand.  In fact I'd be interested to see cases that
>> are optimized that could not be handled by a combine-like
>> pattern matcher?
>>
>
> I'm somewhat puzzled by this comment.  Back on 2/4, I posted a skeletal
> outline for this pass and asked for comments.  You indicated a concern
> that the affine combination expansion would un-CSE a lot of expressions,
> and that I should keep track of local candidates during this pass to
> ensure that base registers, etc., would be properly shared.  It seemed
> to me that a quick and dirty CSE of addressing expressions would be the
> best way to handle that.  I posted another RFC on 2/24 with an early
> implementation of CSE constrained to a single block, and indicated
> shortly thereafter my intent to extend this to a dominator walk.  I
> didn't receive any negative comments about using CSE at that time; but
> then I didn't get much response at all.

Sorry about that - I agree that doing a local CSE is one possible
solution, but when considering that we are only considering a statement
at a time it looks like a quite bloated approach.  A local CSE
capability would be indeed useful also for other passes in GCC, like
for example loop unrolling, which expose lots of garbage to later
passes.  I thought of re-using the machinery that DOM provides, avoding
yet another CSE implementation.

>
> I probably should have pushed harder to get comments at that point, but
> I was very new to the community and was feeling my way through the
> protocols; I didn't feel comfortable pushing.

And pushing doesn't help always either.

> In any case, yes, the primary need for CSE is as a result of the affine
> combination framework, which creates a large amount of redundant code.
> I can certainly take a look at Michael's suggestion to move the pass
> earlier and allow pass-dominator to handle the cleanup, and add the
> zero-offset mem-ref optimization to that or a subsequent pass.  Do you
> agree that this would be a preferable approach?

It would definitely preferable to a new CSE implementation.  I'm not
sure the zero-offset optimization easily fits in DOM though.

What I'd really like to better understand is what transformations are
the profitable ones and if we can handle them statement-locally
within our combine machinery (thus, tree-ssa-forwprop.c).  I can
cook up a forwprop patch to fix PR46556 in case you are interested
in more details.

Thanks,
Richard.

>> Thanks,
>> Richard.
>
>
>

Reply via email to