Re: [PATCH] Address lowering [1/3] Main patch

Richard Guenther Wed, 06 Jul 2011 07:59:03 -0700

On Wed, Jul 6, 2011 at 4:28 PM, William J. Schmidt
<wschm...@linux.vnet.ibm.com> wrote:
> On Wed, 2011-07-06 at 15:16 +0200, Richard Guenther wrote:
>> On Tue, Jul 5, 2011 at 3:59 PM, William J. Schmidt
>> <wschm...@linux.vnet.ibm.com> wrote:
>> > (Sorry for the late response; yesterday was a holiday here.)
>> >
>> > On Mon, 2011-07-04 at 16:21 +0200, Richard Guenther wrote:
>> >> On Thu, Jun 30, 2011 at 4:39 PM, William J. Schmidt
>> >> <wschm...@linux.vnet.ibm.com> wrote:
>> >> > This is the first of three patches related to lowering addressing
>> >> > expressions to MEM_REFs and TARGET_MEM_REFs in late gimple.  This patch
>> >> > contains the new pass together with supporting changes in existing
>> >> > modules.  The second patch contains an independent change to the RTL
>> >> > forward propagator to keep it from undoing an optimization made in the
>> >> > first patch.  The third patch contains new test cases and changes to
>> >> > existing test cases.
>> >> >
>> >> > Although I've broken it up into three patches to make the review easier,
>> >> > it would be best to commit at least the first and third together to
>> >> > avoid regressions.  The second can stand alone.
>> >> >
>> >> > I've done regression tests on powerpc64 and x86_64, and have asked
>> >> > Andreas Krebbel to test against the IBM z (390) platform.  I've done
>> >> > performance regression testing on powerpc64.  The only performance
>> >> > regression of note is the 2% degradation to 188.ammp due to loss of
>> >> > field disambiguation information.  As discussed in another thread,
>> >> > fixing this introduces more complexity than it's worth.
>> >>
>> >> Are there also performance improvements?  What about code size?
>> >
>> > Yes, there are performance improvements.  I've been running cpu2000 on
>> > 32- and 64-bit powerpc64.  Thirteen tests show measurable improvements
>> > between 1% and 9%, with 187.facerec showing the largest improvements for
>> > both 32 and 64.
>> >
>> > I don't have formal code size results, but anecdotally from code
>> > crawling, I have seen code size either neutral or slightly improved.
>> > The largest code size improvements I've seen were on 32-bit code where
>> > the commoning allowed removal of a number of sign-extend and zero-extend
>> > instructions that were otherwise not seen to be redundant.
>> >>
>> >> I tried to get an understanding to what kind of optimizations this patch
>> >> produces based on the test of testcases you added, but I have a hard
>> >> time here.  Can you outline some please?
>> >>
>> >
>> > The primary goal is to clean up code such as is shown in the original
>> > post of PR46556.  In late 2007 there were some changes made to address
>> > canonicalization that caused the code gen to be suboptimal on powerpc64.
>> > At that time you and others suggested a pattern recognizer prior to
>> > expand as probably the best solution, similar to what IVopts is doing.
>>
>> The PR46556 case looks quite simple.
>
> It certainly is.  I was personally curious whether there were other
> suboptimal sequences that might be hiding out there, that a more general
> approach might expose.  There was a comment at the end of the bugzilla
> about a pass to expose target addressing modes in gimple for this
> purpose.  When I first started looking at this, I looked for some
> feedback from the community about whether that should be done, and got a
> few favorable comments along with one negative one.  So that's how we
> got on this road...
>
>>
>> > By using the same mem_ref generation machinery used by IVopts, together
>> > with local CSE, the goal was to ensure base registers are properly
>> > shared so that optimal code is generated, particularly for cases of
>> > shared addressability to structures and arrays.  I also observed cases
>> > where it was useful to extend the sharing across the dominator tree.
>>
>> As you are doing IV selection per individual statement only, using
>> the affine combination machinery looks quite a big hammer to me.
>> Especially as it is hard to imagine what the side-effects are, apart
>> from re-associating dependencies that do not fit the MEM-REF and
>> making the MEM-REF as complicated as permitted by the target.
>>
>> What I thought originally when suggesting to do something similar
>> to IVOPTs was to build a list of candidates and uses and optimize
>> that set using a cost function similar to how IVOPTs does.
>>
> OK, reading back I can see that now...
>
>> Doing addressing-mode selection locally per statement seems like
>> more a task for a few pattern matchers, for example in tree-ssa-forwprop.c
>> (for its last invocation).  One pattern would be that of PR46556,
>> MEM[(p + ((n + 16)*4))] which we can transform to
>> TARGET_MEM_REF[x + 64] with x = p + n*4 if ((n + 16)*4)) was
>> a single-use.  The TARGET_MEM_REF generation can easily
>> re-use the address-description and target-availability checks from
>> tree-ssa-address.c.  I would be at least interested in whether
>> handling the pattern from PR46556 alone (or maybe with a few
>> similar other cases) is responsible for the performance improvements.
>>
> Hm, but I don't think forwprop sees the code in this form.  At the time
> the last pass of forwprop runs, the gimple for the original problem is:
>
>  D.1997_3 = p_1(D)->a[n_2(D)];
>  D.1998_4 = p_1(D)->c[n_2(D)];
>  D.1999_5 = p_1(D)->b[n_2(D)];
>  foo (D.1997_3, D.1998_4, D.1999_5);
>
> But perhaps this is just a matter of changing the patterns to recognize?


Ah, so we still have the ARRAY_REFs here.  Yeah, well - then the
issue boils down to get_inner_reference canonicalizing the offset
according to what fold-const.c implements, and we simply emit
the offset in that form (if you look at the normal_inner_ref: case
in expand_expr_real*).  Thus with the above form at expansion
time the matching would be applied there (or during our RTL
address-generation in fwprop.c ...).

Another idea is of course to lower all handled_component_p
memory accesses - something I did on the old mem-ref branch
and I believe I also suggested at some point.  Without such lowering
all the address-generation isn't exposed to the GIMPLE optimizers.

The simplest way of doing the lowering is to do sth like

  base = get_inner_reference (rhs, ..., &bitpos, &offset, ...);
  base = fold_build2 (POINTER_PLUS_EXPR, ...,
                               base, offset);
  base = force_gimple_operand (base, ... is_gimple_mem_ref_addr);
  tem = fold_build2 (MEM_REF, TREE_TYPE (rhs),
                             base,
                             build_int_cst (get_alias_ptr_type (rhs), bitpos));

at some point.  I didn't end up doing that because of the loss of
alias information.

> Without running experiments yet, my guess is that this would pick up
> some of the benefit, but not all.  As you say, the effects are difficult
> to predict; I've seen some surprisingly good results from CSEing some
> complex addressing, but I also had to implement several tweaks to avoid
> detrimental effects (such as stepping on what IVopts already got right).
> It may be that some of the surprising positives (such as removal of
> extra sign-extend instructions) indicate some cleanup work that should
> be done in the back end instead.

Indeed - or at some other point in the tree middle-end.  For example
we do not re-associate POINTER_PLUS_EXPR at all, which loses
quite some CSE opportunities for addresses (tree-ssa-reassoc.c
would be the place to enhance).

>> Ideally we'd of course have a cost driven machinery that considers
>> (part of) the whole function.
>>
>> > Secondarily, I noticed that once this was cleaned up we still had the
>> > suboptimal code generation for the zero-offset mem refs scenario
>> > outlined in the code commentary.  The extra logic to clean this up helps
>> > keep register pressure down.  I've observed some spill code reduction
>> > when this is in place.  Again, using expression availability from
>> > dominating blocks helps here in some cases as well.
>>
>> Yeah, the case is quite odd and doesn't really fit existing optimizers
>> given that the CSE opportunity is hidden within the TARGET_MEM_REF ...
>>
>> >> I still do not like the implementation of yet another CSE machinery
>> >> given that we already have two.  I think most of the need for CSE
>> >> comes from the use of the affine combination framework and
>> >> force_gimple_operand.  In fact I'd be interested to see cases that
>> >> are optimized that could not be handled by a combine-like
>> >> pattern matcher?
>> >>
>> >
>> > I'm somewhat puzzled by this comment.  Back on 2/4, I posted a skeletal
>> > outline for this pass and asked for comments.  You indicated a concern
>> > that the affine combination expansion would un-CSE a lot of expressions,
>> > and that I should keep track of local candidates during this pass to
>> > ensure that base registers, etc., would be properly shared.  It seemed
>> > to me that a quick and dirty CSE of addressing expressions would be the
>> > best way to handle that.  I posted another RFC on 2/24 with an early
>> > implementation of CSE constrained to a single block, and indicated
>> > shortly thereafter my intent to extend this to a dominator walk.  I
>> > didn't receive any negative comments about using CSE at that time; but
>> > then I didn't get much response at all.
>>
>> Sorry about that - I agree that doing a local CSE is one possible
>> solution, but when considering that we are only considering a statement
>> at a time it looks like a quite bloated approach.  A local CSE
>> capability would be indeed useful also for other passes in GCC, like
>> for example loop unrolling, which expose lots of garbage to later
>> passes.  I thought of re-using the machinery that DOM provides, avoding
>> yet another CSE implementation.
>>
>> >
>> > I probably should have pushed harder to get comments at that point, but
>> > I was very new to the community and was feeling my way through the
>> > protocols; I didn't feel comfortable pushing.
>>
>> And pushing doesn't help always either.
>
> That's been my general experience; I try to be careful about nagging
> busy people.  I think my timing was poor at the beginning of this
> project, as you guys were quite busy with closing off the 4.6 release.
>
>>
>> > In any case, yes, the primary need for CSE is as a result of the affine
>> > combination framework, which creates a large amount of redundant code.
>> > I can certainly take a look at Michael's suggestion to move the pass
>> > earlier and allow pass-dominator to handle the cleanup, and add the
>> > zero-offset mem-ref optimization to that or a subsequent pass.  Do you
>> > agree that this would be a preferable approach?
>>
>> It would definitely preferable to a new CSE implementation.  I'm not
>> sure the zero-offset optimization easily fits in DOM though.
>
> I took a look at this yesterday and I think it fits easily enough; about
> the same as it fit into my prototype pass.  The available expressions
> logic is pretty similar to what I was doing, so transplanting the code
> wouldn't be difficult.  The only issue I saw is that there isn't a pure
> lookup function that doesn't place an expression on the avail-exprs
> stack, but that's not hard to add.
>
>>
>> What I'd really like to better understand is what transformations are
>> the profitable ones and if we can handle them statement-locally
>> within our combine machinery (thus, tree-ssa-forwprop.c).  I can
>> cook up a forwprop patch to fix PR46556 in case you are interested
>> in more details.
>>
> Thanks...let me study forwprop a bit first -- I'm trying to get my head
> wrapped around as much of the middle end as I can, and this is a good
> excuse to delve into another piece.
>
> I think perhaps the best way forward may be to use my prototype as a
> baseline to compare against additions to the combine logic, so we can
> tease out what some of the unexpected gains are, and think about how
> best to achieve them more simply.

Yeah.  I will re-surrect the simple pass I had to lower all memory
accesses, that will give us another toy to play with.

Thanks,
Richard.

> I'll have a look at forwprop and pester you (lightly :) if I have
> questions.
>
> Thanks,
> Bill
>
>> Thanks,
>> Richard.
>>
>> >> Thanks,
>> >> Richard.
>> >
>> >
>> >
>
>
>

Re: [PATCH] Address lowering [1/3] Main patch

Reply via email to