On Wed, Dec 4, 2013 at 2:27 PM, Bill Schmidt <wschm...@linux.vnet.ibm.com> wrote: > On Wed, 2013-12-04 at 07:13 -0600, Bill Schmidt wrote: >> On Wed, 2013-12-04 at 11:30 +0100, Richard Biener wrote: >> > On Wed, Dec 4, 2013 at 11:26 AM, Richard Biener >> > <richard.guent...@gmail.com> wrote: >> > > On Tue, Dec 3, 2013 at 11:04 PM, Bill Schmidt >> > > <wschm...@linux.vnet.ibm.com> wrote: >> > >> On Tue, 2013-12-03 at 21:35 +0100, Richard Biener wrote: >> > >>> Yufeng Zhang <yufeng.zh...@arm.com> wrote: >> > >>> >On 12/03/13 14:20, Richard Biener wrote: >> > >>> >> On Tue, Dec 3, 2013 at 1:50 PM, Yufeng Zhang<yufeng.zh...@arm.com> >> > >>> >wrote: >> > >>> >>> On 12/03/13 06:48, Jeff Law wrote: >> > >>> >>>> >> > >>> >>>> On 12/02/13 08:47, Yufeng Zhang wrote: >> > >>> >>>>> >> > >>> >>>>> Ping~ >> > >>> >>>>> >> > >>> >>>>> http://gcc.gnu.org/ml/gcc-patches/2013-11/msg03360.html >> > >>> >>>> >> > >>> >>>> >> > >>> >>>>> >> > >>> >>>>> Thanks, >> > >>> >>>>> Yufeng >> > >>> >>>>> >> > >>> >>>>> On 11/26/13 15:02, Yufeng Zhang wrote: >> > >>> >>>>>> >> > >>> >>>>>> On 11/26/13 12:45, Richard Biener wrote: >> > >>> >>>>>>> >> > >>> >>>>>>> On Thu, Nov 14, 2013 at 12:25 AM, Yufeng >> > >>> >>>>>>> Zhang<yufeng.zh...@arm.com> wrote: >> > >>> >>>>>>>> >> > >>> >>>>>>>> On 11/13/13 20:54, Bill Schmidt wrote: >> > >>> >>>>>>>>> >> > >>> >>>>>>>>> The second version of your original patch is ok with me with >> > >>> >the >> > >>> >>>>>>>>> following changes. Sorry for the little side adventure into >> > >>> >the >> > >>> >>>>>>>>> next-interp logic; in the end that's going to hurt more than >> > >>> >it >> > >>> >>>>>>>>> helps in >> > >>> >>>>>>>>> this case. Thanks for having a look at it, anyway. Thanks >> > >>> >also for >> > >>> >>>>>>>>> cleaning up this version to be less intrusive to common >> > >>> >interfaces; I >> > >>> >>>>>>>>> appreciate it. >> > >>> >>>>>>>> >> > >>> >>>>>>>> >> > >>> >>>>>>>> >> > >>> >>>>>>>> Thanks a lot for the review. I've attached an updated patch >> > >>> >with the >> > >>> >>>>>>>> suggested changes incorporated. >> > >>> >>>>>>>> >> > >>> >>>>>>>> For the next-interp adventure, I was quite happy to do the >> > >>> >>>>>>>> experiment; it's >> > >>> >>>>>>>> a good chance of gaining insight into the pass. Many thanks >> > >>> >for >> > >>> >>>>>>>> your prompt >> > >>> >>>>>>>> replies and patience in guiding! >> > >>> >>>>>>>> >> > >>> >>>>>>>> >> > >>> >>>>>>>>> Everything else looks OK to me. Please ask Richard for final >> > >>> >>>>>>>>> approval, >> > >>> >>>>>>>>> as I'm not a maintainer. >> > >>> >>>> >> > >>> >>>> First a note, I need to check on voting for Bill as the slsr >> > >>> >maintainer >> > >>> >>>> from the steering committee. Voting was in progress just before >> > >>> >the >> > >>> >>>> close of stage1 development so I haven't tallied the results :-) >> > >>> >>> >> > >>> >>> >> > >>> >>> Looking forward to some good news! :) >> > >>> >>> >> > >>> >>> >> > >>> >>>>>> >> > >>> >>>>>> Yes, you are right about the non-trivial 'base' tree are rarely >> > >>> >shared. >> > >>> >>>>>> The cached is introduced mainly because >> > >>> >>>>>> get_alternative_base >> > >>> >() may >> > >>> >>>>>> be >> > >>> >>>>>> called twice on the same 'base' tree, once in the >> > >>> >>>>>> find_basis_for_candidate () for look-up and the other time in >> > >>> >>>>>> alloc_cand_and_find_basis () for record_potential_basis (). I'm >> > >>> >happy >> > >>> >>>>>> to leave out the cache if you think the benefit is trivial. >> > >>> >>>> >> > >>> >>>> Without some sense of how expensive the lookups are vs how often >> > >>> >the >> > >>> >>>> cache hits it's awful hard to know if the cache is worth it. >> > >>> >>>> >> > >>> >>>> I'd say take it out unless you have some sense it's really saving >> > >>> >time. >> > >>> >>>> It's a pretty minor implementation detail either way. >> > >>> >>> >> > >>> >>> >> > >>> >>> I think the affine tree routines are generally expensive; it is >> > >>> >worth having >> > >>> >>> a cache to avoid calling them too many times. I run the slsr-*.c >> > >>> >tests >> > >>> >>> under gcc.dg/tree-ssa/ and find out that the cache hit rates range >> > >>> >from >> > >>> >>> 55.6% to 90%, with 73.5% as the average. The samples may not well >> > >>> >represent >> > >>> >>> the real world scenario, but they do show the fact that the 'base' >> > >>> >tree can >> > >>> >>> be shared to some extent. So I'd like to have the cache in the >> > >>> >patch. >> > >>> >>> >> > >>> >>> >> > >>> >>>> >> > >>> >>>>>> >> > >>> >>>>>>> +/* { dg-do compile } */ >> > >>> >>>>>>> +/* { dg-options "-O2 -fdump-tree-slsr" } */ >> > >>> >>>>>>> + >> > >>> >>>>>>> +typedef int arr_2[50][50]; >> > >>> >>>>>>> + >> > >>> >>>>>>> +void foo (arr_2 a2, int v1) >> > >>> >>>>>>> +{ >> > >>> >>>>>>> + int i, j; >> > >>> >>>>>>> + >> > >>> >>>>>>> + i = v1 + 5; >> > >>> >>>>>>> + j = i; >> > >>> >>>>>>> + a2 [i-10] [j] = 2; >> > >>> >>>>>>> + a2 [i] [j++] = i; >> > >>> >>>>>>> + a2 [i+20] [j++] = i; >> > >>> >>>>>>> + a2 [i-3] [i-1] += 1; >> > >>> >>>>>>> + return; >> > >>> >>>>>>> +} >> > >>> >>>>>>> + >> > >>> >>>>>>> +/* { dg-final { scan-tree-dump-times "MEM" 5 "slsr" } } */ >> > >>> >>>>>>> +/* { dg-final { cleanup-tree-dump "slsr" } } */ >> > >>> >>>>>>> >> > >>> >>>>>>> scanning for 5 MEMs looks non-sensical. What transform do >> > >>> >>>>>>> you expect? I see other slsr testcases do similar non-sensical >> > >>> >>>>>>> checking which is bad, too. >> > >>> >>>>>> >> > >>> >>>>>> >> > >>> >>>>>> As the slsr optimizes CAND_REF candidates by simply lowering >> > >>> >>>>>> them >> > >>> >to >> > >>> >>>>>> MEM_REF from e.g. ARRAY_REF, I think scanning for the number of >> > >>> >MEM_REFs >> > >>> >>>>>> is an effective check. Alternatively, I can add a follow-up >> > >>> >patch to >> > >>> >>>>>> add some dumping facility in replace_ref () to print out the >> > >>> >replacing >> > >>> >>>>>> actions when -fdump-tree-slsr-details is on. >> > >>> >>>> >> > >>> >>>> I think adding some details to the dump and scanning for them >> > >>> >>>> would >> > >>> >be >> > >>> >>>> better. That's the only change that is required for this to move >> > >>> >forward. >> > >>> >>> >> > >>> >>> >> > >>> >>> I've updated to patch to dump more details when >> > >>> >-fdump-tree-slsr-details is >> > >>> >>> on. The tests have also been updated to scan for these new dumps >> > >>> >instead of >> > >>> >>> MEMs. >> > >>> >>> >> > >>> >>> >> > >>> >>>> >> > >>> >>>> I suggest doing it quickly. We're well past stage1 close at this >> > >>> >point. >> > >>> >>> >> > >>> >>> >> > >>> >>> The bootstrapping on x86_64 is still running. OK to commit if it >> > >>> >succeeds? >> > >>> >> >> > >>> >> I still don't like it. It's using the wrong and too expensive tools >> > >>> >to do >> > >>> >> stuff. What kind of bases are we ultimately interested in? >> > >>> >> Browsing >> > >>> >> the code it looks like we're having >> > >>> >> >> > >>> >> /* Base expression for the chain of candidates: often, but not >> > >>> >> always, an SSA name. */ >> > >>> >> tree base_expr; >> > >>> >> >> > >>> >> which isn't really too informative but I suppose they are all >> > >>> >> kind-of-gimple_val()s? That said, I wonder if you can simply >> > >>> >> use get_addr_base_and_unit_offset in place of get_alternative_base >> > >>> >(), >> > >>> >> ignoring the returned offset. >> > >>> > >> > >>> >'base_expr' is essentially the base address of a handled_component_p, >> > >>> >e.g. ARRAY_REF, COMPONENT_REF, etc. In most case, it is the address >> > >>> >of >> > >>> > >> > >>> >the object returned by get_inner_reference (). >> > >>> > >> > >>> >Given a test case like the following: >> > >>> > >> > >>> >typedef int arr_2[20][20]; >> > >>> > >> > >>> >void foo (arr_2 a2, int i, int j) >> > >>> >{ >> > >>> > a2[i+10][j] = 1; >> > >>> > a2[i+10][j+1] = 1; >> > >>> > a2[i+20][j] = 1; >> > >>> >} >> > >>> > >> > >>> >The IR before SLSR is (on x86_64): >> > >>> > >> > >>> > _2 = (long unsigned int) i_1(D); >> > >>> > _3 = _2 * 80; >> > >>> > _4 = _3 + 800; >> > >>> > _6 = a2_5(D) + _4; >> > >>> > *_6[j_8(D)] = 1; >> > >>> > _10 = j_8(D) + 1; >> > >>> > *_6[_10] = 1; >> > >>> > _12 = _3 + 1600; >> > >>> > _13 = a2_5(D) + _12; >> > >>> > *_13[j_8(D)] = 1; >> > >>> > >> > >>> >The base_expr for the 1st and 2nd memory reference are the same, i.e. >> > >>> >_6, while the base_expr for a2[i+20][j] is _13. >> > >>> > >> > >>> >_13 is essentially (_6 + 800), so all of the three memory references >> > >>> >essentially share the same base address. As their strides are also >> > >>> >the >> > >>> > >> > >>> >same (MULT_EXPR (j, 4)), the three references can all be lowered to >> > >>> >MEM_REFs. What this patch does is to use the tree affine tools to >> > >>> >help >> > >>> > >> > >>> >recognize the underlying base address expression; as it requires >> > >>> >looking >> > >>> >into the definitions of SSA_NAMEs, get_addr_base_and_unit_offset () >> > >>> >won't help here. >> > >>> > >> > >>> >Bill has helped me exploit other ways of achieving this in SLSR, but >> > >>> >so >> > >>> > >> > >>> >far we think this is the best way to proceed. The use of tree affine >> > >>> >routines has been restricted to CAND_REFs only and there is the >> > >>> >aforementioned cache facility to help reduce the overhead. >> > >>> > >> > >>> >Thanks, >> > >>> >Yufeng >> > >>> > >> > >>> >P.S. some more details what the patch does: >> > >>> > >> > >>> >The CAND_REF for the three memory references are: >> > >>> > >> > >>> > 6 [2] *_6[j_8(D)] = 1; >> > >>> > REF : _6 + ((sizetype) j_8(D) * 4) + 0 : int[20] * >> > >>> > basis: 0 dependent: 8 sibling: 0 >> > >>> > next-interp: 0 dead-savings: 0 >> > >>> > >> > >>> > 8 [2] *_6[_10] = 1; >> > >>> > REF : _6 + ((sizetype) j_8(D) * 4) + 4 : int[20] * >> > >>> > basis: 6 dependent: 11 sibling: 0 >> > >>> > next-interp: 0 dead-savings: 0 >> > >>> > >> > >>> > 11 [2] *_13[j_8(D)] = 1; >> > >>> > REF : _13 + ((sizetype) j_8(D) * 4) + 0 : int[20] * >> > >>> > basis: 8 dependent: 0 sibling: 0 >> > >>> > next-interp: 0 dead-savings: 0 >> > >>> > >> > >>> >Before the patch, the strength reduction candidate chains for the >> > >>> >three >> > >>> > >> > >>> >CAND_REFs are: >> > >>> > >> > >>> > _6 -> 6 -> 8 >> > >>> > _13 -> 11 >> > >>> > >> > >>> >i.e. SLSR recognizes the first two references share the same basis, >> > >>> >while the last one is on it own. >> > >>> > >> > >>> >With the patch, an extra candidate chain can be recognized: >> > >>> > >> > >>> > a2_5(D) + (sizetype) i_1(D) * 80 -> 6 -> 11 -> 8 >> > >>> > >> > >>> >i.e. all of the three references are found to have the same basis >> > >>> >(a2_5(D) + (sizetype) i_1(D) * 80), which is essentially the expanded >> > >>> >_6 >> > >>> >or _13, with the immediate offset removed. The pass is now able to >> > >>> >lower all of the three references, instead of the first two only, to >> > >>> >MEM_REFs. >> > >>> >> > >>> Ok, so slsr handles arbitrary complex bases and figures out common >> > >>> components? If so, then why not just use get_inner_reference? After >> > >>> all slsr does not use tree-affine as representation for bases (which >> > >>> it could?) >> > >> >> > >> I think that's overstating SLSR's current capabilities a bit. :) We do >> > >> use get_inner_reference to come up with the base expression for >> > >> reference candidates (based on some of your suggestions a couple of >> > >> years back). However, in the case of multiple levels of array >> > >> references, we miss opportunities because get_inner_reference stops at >> > >> an SSA name that could be further expanded by following its definition >> > >> back to a more fundamental base expression. >> > > >> > > Using tree-affine.c to_affine_comb / affine_comb_to_tree has exactly the >> > > same problem. >> > >> > Oh, you're using affine combination expansion ... which is even more >> > expensive. So why isn't that then done for all ref candidates? That is, >> > why do two different things, get_inner_reference _and_ affine-combination >> > dances. And why build back trees from that instead of storing the >> > affine combination. >> >> Well, the original design had no desire to use the expensive machinery >> of affine combination expansion. For what was envisioned, the simpler >> mechanisms of get_inner_reference have been plenty. >> >> My thought, and please correct me if I'm wrong, is that once we've >> already reduced to an SSA name from get_inner_reference, the affine >> machinery will terminate fairly quickly -- we shouldn't get into too >> deep a search on underlying pointer arithmetic in most cases. But >> compile time testing will tell us whether this is reasonable.
Indeed. Doing this still feels backward and odd - the pass should be able to determine this globally instead of repeating local analysis (even with a cache). > As a middle ground, may I suggest that we only do the extra tree_affine > expansion at -O2 and above? Any extra compile time should be a blip at > those levels. At -O1 there could be legitimate issues, though. You should check flag_expensive_optimizations here I think. Richard. > Bill > >> >> Bill >> >> > >> > I'll bet we come back with compile-time issues after this patch >> > went in. I'll count on you two to fix them then. >> > >> > Richard. >> > >> > >> Part of the issue here is that reference candidates are basis for a more >> > >> specific optimization than the mult and add candidates. The latter have >> > >> a more general framework for building up a recording of simple affine >> > >> expressions that can be strength-reduced. Ultimately we ought to be >> > >> able to do something similar for reference candidates, building up >> > >> simple affine expressions from base expressions, so that everything is >> > >> done in a forward order and the tree-affine interfaces aren't needed. >> > >> But that will take some more fundamental design changes, and since this >> > >> provides some good improvements for important cases, I feel it's >> > >> reasonable to get this into the release. >> > > >> > > But I fail to see what is special about doing the dance to affine and >> > > then back to trees just to drop the constant offset which would be >> > > done by get_inner_reference as well and cheaper if you just ignore >> > > bitpos. >> > > >> > > ?! >> > > >> > > Richard. >> > > >> > >> Thanks, >> > >> Bill >> > >> >> > >>> >> > >>> Richard. >> > >>> >> > >>> >> > >> >> > >> > >