On Wed, Dec 4, 2013 at 2:27 PM, Bill Schmidt
<wschm...@linux.vnet.ibm.com> wrote:
> On Wed, 2013-12-04 at 07:13 -0600, Bill Schmidt wrote:
>> On Wed, 2013-12-04 at 11:30 +0100, Richard Biener wrote:
>> > On Wed, Dec 4, 2013 at 11:26 AM, Richard Biener
>> > <richard.guent...@gmail.com> wrote:
>> > > On Tue, Dec 3, 2013 at 11:04 PM, Bill Schmidt
>> > > <wschm...@linux.vnet.ibm.com> wrote:
>> > >> On Tue, 2013-12-03 at 21:35 +0100, Richard Biener wrote:
>> > >>> Yufeng Zhang <yufeng.zh...@arm.com> wrote:
>> > >>> >On 12/03/13 14:20, Richard Biener wrote:
>> > >>> >> On Tue, Dec 3, 2013 at 1:50 PM, Yufeng Zhang<yufeng.zh...@arm.com>
>> > >>> >wrote:
>> > >>> >>> On 12/03/13 06:48, Jeff Law wrote:
>> > >>> >>>>
>> > >>> >>>> On 12/02/13 08:47, Yufeng Zhang wrote:
>> > >>> >>>>>
>> > >>> >>>>> Ping~
>> > >>> >>>>>
>> > >>> >>>>> http://gcc.gnu.org/ml/gcc-patches/2013-11/msg03360.html
>> > >>> >>>>
>> > >>> >>>>
>> > >>> >>>>>
>> > >>> >>>>> Thanks,
>> > >>> >>>>> Yufeng
>> > >>> >>>>>
>> > >>> >>>>> On 11/26/13 15:02, Yufeng Zhang wrote:
>> > >>> >>>>>>
>> > >>> >>>>>> On 11/26/13 12:45, Richard Biener wrote:
>> > >>> >>>>>>>
>> > >>> >>>>>>> On Thu, Nov 14, 2013 at 12:25 AM, Yufeng
>> > >>> >>>>>>> Zhang<yufeng.zh...@arm.com>     wrote:
>> > >>> >>>>>>>>
>> > >>> >>>>>>>> On 11/13/13 20:54, Bill Schmidt wrote:
>> > >>> >>>>>>>>>
>> > >>> >>>>>>>>> The second version of your original patch is ok with me with
>> > >>> >the
>> > >>> >>>>>>>>> following changes.  Sorry for the little side adventure into
>> > >>> >the
>> > >>> >>>>>>>>> next-interp logic; in the end that's going to hurt more than
>> > >>> >it
>> > >>> >>>>>>>>> helps in
>> > >>> >>>>>>>>> this case.  Thanks for having a look at it, anyway.  Thanks
>> > >>> >also for
>> > >>> >>>>>>>>> cleaning up this version to be less intrusive to common
>> > >>> >interfaces; I
>> > >>> >>>>>>>>> appreciate it.
>> > >>> >>>>>>>>
>> > >>> >>>>>>>>
>> > >>> >>>>>>>>
>> > >>> >>>>>>>> Thanks a lot for the review.  I've attached an updated patch
>> > >>> >with the
>> > >>> >>>>>>>> suggested changes incorporated.
>> > >>> >>>>>>>>
>> > >>> >>>>>>>> For the next-interp adventure, I was quite happy to do the
>> > >>> >>>>>>>> experiment; it's
>> > >>> >>>>>>>> a good chance of gaining insight into the pass.  Many thanks
>> > >>> >for
>> > >>> >>>>>>>> your prompt
>> > >>> >>>>>>>> replies and patience in guiding!
>> > >>> >>>>>>>>
>> > >>> >>>>>>>>
>> > >>> >>>>>>>>> Everything else looks OK to me.  Please ask Richard for final
>> > >>> >>>>>>>>> approval,
>> > >>> >>>>>>>>> as I'm not a maintainer.
>> > >>> >>>>
>> > >>> >>>> First a note, I need to check on voting for Bill as the slsr
>> > >>> >maintainer
>> > >>> >>>> from the steering committee.   Voting was in progress just before
>> > >>> >the
>> > >>> >>>> close of stage1 development so I haven't tallied the results :-)
>> > >>> >>>
>> > >>> >>>
>> > >>> >>> Looking forward to some good news! :)
>> > >>> >>>
>> > >>> >>>
>> > >>> >>>>>>
>> > >>> >>>>>> Yes, you are right about the non-trivial 'base' tree are rarely
>> > >>> >shared.
>> > >>> >>>>>>      The cached is introduced mainly because 
>> > >>> >>>>>> get_alternative_base
>> > >>> >() may
>> > >>> >>>>>> be
>> > >>> >>>>>> called twice on the same 'base' tree, once in the
>> > >>> >>>>>> find_basis_for_candidate () for look-up and the other time in
>> > >>> >>>>>> alloc_cand_and_find_basis () for record_potential_basis ().  I'm
>> > >>> >happy
>> > >>> >>>>>> to leave out the cache if you think the benefit is trivial.
>> > >>> >>>>
>> > >>> >>>> Without some sense of how expensive the lookups are vs how often
>> > >>> >the
>> > >>> >>>> cache hits it's awful hard to know if the cache is worth it.
>> > >>> >>>>
>> > >>> >>>> I'd say take it out unless you have some sense it's really saving
>> > >>> >time.
>> > >>> >>>>     It's a pretty minor implementation detail either way.
>> > >>> >>>
>> > >>> >>>
>> > >>> >>> I think the affine tree routines are generally expensive; it is
>> > >>> >worth having
>> > >>> >>> a cache to avoid calling them too many times.  I run the slsr-*.c
>> > >>> >tests
>> > >>> >>> under gcc.dg/tree-ssa/ and find out that the cache hit rates range
>> > >>> >from
>> > >>> >>> 55.6% to 90%, with 73.5% as the average.  The samples may not well
>> > >>> >represent
>> > >>> >>> the real world scenario, but they do show the fact that the 'base'
>> > >>> >tree can
>> > >>> >>> be shared to some extent.  So I'd like to have the cache in the
>> > >>> >patch.
>> > >>> >>>
>> > >>> >>>
>> > >>> >>>>
>> > >>> >>>>>>
>> > >>> >>>>>>> +/* { dg-do compile } */
>> > >>> >>>>>>> +/* { dg-options "-O2 -fdump-tree-slsr" } */
>> > >>> >>>>>>> +
>> > >>> >>>>>>> +typedef int arr_2[50][50];
>> > >>> >>>>>>> +
>> > >>> >>>>>>> +void foo (arr_2 a2, int v1)
>> > >>> >>>>>>> +{
>> > >>> >>>>>>> +  int i, j;
>> > >>> >>>>>>> +
>> > >>> >>>>>>> +  i = v1 + 5;
>> > >>> >>>>>>> +  j = i;
>> > >>> >>>>>>> +  a2 [i-10] [j] = 2;
>> > >>> >>>>>>> +  a2 [i] [j++] = i;
>> > >>> >>>>>>> +  a2 [i+20] [j++] = i;
>> > >>> >>>>>>> +  a2 [i-3] [i-1] += 1;
>> > >>> >>>>>>> +  return;
>> > >>> >>>>>>> +}
>> > >>> >>>>>>> +
>> > >>> >>>>>>> +/* { dg-final { scan-tree-dump-times "MEM" 5 "slsr" } } */
>> > >>> >>>>>>> +/* { dg-final { cleanup-tree-dump "slsr" } } */
>> > >>> >>>>>>>
>> > >>> >>>>>>> scanning for 5 MEMs looks non-sensical.  What transform do
>> > >>> >>>>>>> you expect?  I see other slsr testcases do similar non-sensical
>> > >>> >>>>>>> checking which is bad, too.
>> > >>> >>>>>>
>> > >>> >>>>>>
>> > >>> >>>>>> As the slsr optimizes CAND_REF candidates by simply lowering 
>> > >>> >>>>>> them
>> > >>> >to
>> > >>> >>>>>> MEM_REF from e.g. ARRAY_REF, I think scanning for the number of
>> > >>> >MEM_REFs
>> > >>> >>>>>> is an effective check.  Alternatively, I can add a follow-up
>> > >>> >patch to
>> > >>> >>>>>> add some dumping facility in replace_ref () to print out the
>> > >>> >replacing
>> > >>> >>>>>> actions when -fdump-tree-slsr-details is on.
>> > >>> >>>>
>> > >>> >>>> I think adding some details to the dump and scanning for them 
>> > >>> >>>> would
>> > >>> >be
>> > >>> >>>> better.  That's the only change that is required for this to move
>> > >>> >forward.
>> > >>> >>>
>> > >>> >>>
>> > >>> >>> I've updated to patch to dump more details when
>> > >>> >-fdump-tree-slsr-details is
>> > >>> >>> on.  The tests have also been updated to scan for these new dumps
>> > >>> >instead of
>> > >>> >>> MEMs.
>> > >>> >>>
>> > >>> >>>
>> > >>> >>>>
>> > >>> >>>> I suggest doing it quickly.  We're well past stage1 close at this
>> > >>> >point.
>> > >>> >>>
>> > >>> >>>
>> > >>> >>> The bootstrapping on x86_64 is still running.  OK to commit if it
>> > >>> >succeeds?
>> > >>> >>
>> > >>> >> I still don't like it.  It's using the wrong and too expensive tools
>> > >>> >to do
>> > >>> >> stuff.  What kind of bases are we ultimately interested in?  
>> > >>> >> Browsing
>> > >>> >> the code it looks like we're having
>> > >>> >>
>> > >>> >>    /* Base expression for the chain of candidates:  often, but not
>> > >>> >>       always, an SSA name.  */
>> > >>> >>    tree base_expr;
>> > >>> >>
>> > >>> >> which isn't really too informative but I suppose they are all
>> > >>> >> kind-of-gimple_val()s?  That said, I wonder if you can simply
>> > >>> >> use get_addr_base_and_unit_offset in place of get_alternative_base
>> > >>> >(),
>> > >>> >> ignoring the returned offset.
>> > >>> >
>> > >>> >'base_expr' is essentially the base address of a handled_component_p,
>> > >>> >e.g. ARRAY_REF, COMPONENT_REF, etc.  In most case, it is the address 
>> > >>> >of
>> > >>> >
>> > >>> >the object returned by get_inner_reference ().
>> > >>> >
>> > >>> >Given a test case like the following:
>> > >>> >
>> > >>> >typedef int arr_2[20][20];
>> > >>> >
>> > >>> >void foo (arr_2 a2, int i, int j)
>> > >>> >{
>> > >>> >   a2[i+10][j] = 1;
>> > >>> >   a2[i+10][j+1] = 1;
>> > >>> >   a2[i+20][j] = 1;
>> > >>> >}
>> > >>> >
>> > >>> >The IR before SLSR is (on x86_64):
>> > >>> >
>> > >>> >   _2 = (long unsigned int) i_1(D);
>> > >>> >   _3 = _2 * 80;
>> > >>> >   _4 = _3 + 800;
>> > >>> >   _6 = a2_5(D) + _4;
>> > >>> >   *_6[j_8(D)] = 1;
>> > >>> >   _10 = j_8(D) + 1;
>> > >>> >   *_6[_10] = 1;
>> > >>> >   _12 = _3 + 1600;
>> > >>> >   _13 = a2_5(D) + _12;
>> > >>> >   *_13[j_8(D)] = 1;
>> > >>> >
>> > >>> >The base_expr for the 1st and 2nd memory reference are the same, i.e.
>> > >>> >_6, while the base_expr for a2[i+20][j] is _13.
>> > >>> >
>> > >>> >_13 is essentially (_6 + 800), so all of the three memory references
>> > >>> >essentially share the same base address.  As their strides are also 
>> > >>> >the
>> > >>> >
>> > >>> >same (MULT_EXPR (j, 4)), the three references can all be lowered to
>> > >>> >MEM_REFs.  What this patch does is to use the tree affine tools to 
>> > >>> >help
>> > >>> >
>> > >>> >recognize the underlying base address expression; as it requires
>> > >>> >looking
>> > >>> >into the definitions of SSA_NAMEs, get_addr_base_and_unit_offset ()
>> > >>> >won't help here.
>> > >>> >
>> > >>> >Bill has helped me exploit other ways of achieving this in SLSR, but 
>> > >>> >so
>> > >>> >
>> > >>> >far we think this is the best way to proceed.  The use of tree affine
>> > >>> >routines has been restricted to CAND_REFs only and there is the
>> > >>> >aforementioned cache facility to help reduce the overhead.
>> > >>> >
>> > >>> >Thanks,
>> > >>> >Yufeng
>> > >>> >
>> > >>> >P.S. some more details what the patch does:
>> > >>> >
>> > >>> >The CAND_REF for the three memory references are:
>> > >>> >
>> > >>> >  6  [2] *_6[j_8(D)] = 1;
>> > >>> >      REF  : _6 + ((sizetype) j_8(D) * 4) + 0 : int[20] *
>> > >>> >      basis: 0  dependent: 8  sibling: 0
>> > >>> >      next-interp: 0  dead-savings: 0
>> > >>> >
>> > >>> >   8  [2] *_6[_10] = 1;
>> > >>> >      REF  : _6 + ((sizetype) j_8(D) * 4) + 4 : int[20] *
>> > >>> >      basis: 6  dependent: 11  sibling: 0
>> > >>> >      next-interp: 0  dead-savings: 0
>> > >>> >
>> > >>> >  11  [2] *_13[j_8(D)] = 1;
>> > >>> >      REF  : _13 + ((sizetype) j_8(D) * 4) + 0 : int[20] *
>> > >>> >      basis: 8  dependent: 0  sibling: 0
>> > >>> >      next-interp: 0  dead-savings: 0
>> > >>> >
>> > >>> >Before the patch, the strength reduction candidate chains for the 
>> > >>> >three
>> > >>> >
>> > >>> >CAND_REFs are:
>> > >>> >
>> > >>> >   _6 -> 6 -> 8
>> > >>> >   _13 -> 11
>> > >>> >
>> > >>> >i.e. SLSR recognizes the first two references share the same basis,
>> > >>> >while the last one is on it own.
>> > >>> >
>> > >>> >With the patch, an extra candidate chain can be recognized:
>> > >>> >
>> > >>> >   a2_5(D) + (sizetype) i_1(D) * 80 -> 6 -> 11 -> 8
>> > >>> >
>> > >>> >i.e. all of the three references are found to have the same basis
>> > >>> >(a2_5(D) + (sizetype) i_1(D) * 80), which is essentially the expanded
>> > >>> >_6
>> > >>> >or _13, with the immediate offset removed.  The pass is now able to
>> > >>> >lower all of the three references, instead of the first two only, to
>> > >>> >MEM_REFs.
>> > >>>
>> > >>> Ok, so slsr handles arbitrary complex bases and figures out common 
>> > >>> components? If so, then why not just use get_inner_reference? After 
>> > >>> all slsr does not use tree-affine as representation for bases (which 
>> > >>> it could?)
>> > >>
>> > >> I think that's overstating SLSR's current capabilities a bit. :)  We do
>> > >> use get_inner_reference to come up with the base expression for
>> > >> reference candidates (based on some of your suggestions a couple of
>> > >> years back).  However, in the case of multiple levels of array
>> > >> references, we miss opportunities because get_inner_reference stops at
>> > >> an SSA name that could be further expanded by following its definition
>> > >> back to a more fundamental base expression.
>> > >
>> > > Using tree-affine.c to_affine_comb / affine_comb_to_tree has exactly the
>> > > same problem.
>> >
>> > Oh, you're using affine combination expansion ... which is even more
>> > expensive.  So why isn't that then done for all ref candidates?  That is,
>> > why do two different things, get_inner_reference _and_ affine-combination
>> > dances.  And why build back trees from that instead of storing the
>> > affine combination.
>>
>> Well, the original design had no desire to use the expensive machinery
>> of affine combination expansion.  For what was envisioned, the simpler
>> mechanisms of get_inner_reference have been plenty.
>>
>> My thought, and please correct me if I'm wrong, is that once we've
>> already reduced to an SSA name from get_inner_reference, the affine
>> machinery will terminate fairly quickly -- we shouldn't get into too
>> deep a search on underlying pointer arithmetic in most cases.  But
>> compile time testing will tell us whether this is reasonable.

Indeed.  Doing this still feels backward and odd - the pass should be
able to determine this globally instead of repeating local analysis
(even with a cache).

> As a middle ground, may I suggest that we only do the extra tree_affine
> expansion at -O2 and above?  Any extra compile time should be a blip at
> those levels.  At -O1 there could be legitimate issues, though.

You should check flag_expensive_optimizations here I think.

Richard.

> Bill
>
>>
>> Bill
>>
>> >
>> > I'll bet we come back with compile-time issues after this patch
>> > went in.  I'll count on you two to fix them then.
>> >
>> > Richard.
>> >
>> > >> Part of the issue here is that reference candidates are basis for a more
>> > >> specific optimization than the mult and add candidates.  The latter have
>> > >> a more general framework for building up a recording of simple affine
>> > >> expressions that can be strength-reduced.  Ultimately we ought to be
>> > >> able to do something similar for reference candidates, building up
>> > >> simple affine expressions from base expressions, so that everything is
>> > >> done in a forward order and the tree-affine interfaces aren't needed.
>> > >> But that will take some more fundamental design changes, and since this
>> > >> provides some good improvements for important cases, I feel it's
>> > >> reasonable to get this into the release.
>> > >
>> > > But I fail to see what is special about doing the dance to affine and
>> > > then back to trees just to drop the constant offset which would be
>> > > done by get_inner_reference as well and cheaper if you just ignore
>> > > bitpos.
>> > >
>> > > ?!
>> > >
>> > > Richard.
>> > >
>> > >> Thanks,
>> > >> Bill
>> > >>
>> > >>>
>> > >>> Richard.
>> > >>>
>> > >>>
>> > >>
>> >
>>
>
>

Reply via email to