On Tue, Jun 14, 2011 at 4:18 PM, William J. Schmidt <wschm...@linux.vnet.ibm.com> wrote: > On Tue, 2011-06-14 at 15:39 +0200, Richard Guenther wrote: >> On Fri, Jun 10, 2011 at 5:11 PM, William J. Schmidt >> <wschm...@linux.vnet.ibm.com> wrote: >> > On Tue, 2011-06-07 at 16:49 +0200, Richard Guenther wrote: >> >> On Tue, Jun 7, 2011 at 4:14 PM, William J. Schmidt >> >> <wschm...@linux.vnet.ibm.com> wrote: >> > >> > <snip> >> > >> >> >> > Loss of aliasing information >> >> >> > ============================ >> >> >> > The most serious problem I've run into is degraded performance due >> >> >> > to poorer >> >> >> > instruction scheduling choices. I tracked this down to >> >> >> > alias.c:nonoverlapping_component_refs_p. >> >> >> > >> >> >> > This code proves that two memory accesses don't overlap by >> >> >> > attempting to prove >> >> >> > that they access different fields of the same structure. This is >> >> >> > done using >> >> >> > the MEM_EXPRs of the two rtx's, which record the expression trees >> >> >> > that were >> >> >> > translated into the rtx's during expand. When address lowering is >> >> >> > not >> >> >> > present, a simple COMPONENT_REF will appear in the MEM_EXPR: x.a, >> >> >> > for >> >> >> > example. However, address lowering changes the simple COMPONENT_REF >> >> >> > into a >> >> >> > [TARGET_]MEM_REF that is no longer necessarily identifiable as a >> >> >> > field >> >> >> > reference. Thus the aliasing machinery can no longer prove that two >> >> >> > such >> >> >> > field references are disjoint. >> >> >> > >> >> >> > This has severe consequences for performance, and has to be dealt >> >> >> > with if >> >> >> > address lowering is to be successful. >> >> >> > >> >> >> > I've worked around this with an admittedly fragile solution; I'll >> >> >> > discuss the >> >> >> > drawbacks below. The idea is to construct a mapping from >> >> >> > replacement mem_refs >> >> >> > to the original expressions that they replaced. When a MEM_EXPR is >> >> >> > being set >> >> >> > during expand, we first look up the mem_ref in the mapping. If >> >> >> > present, the >> >> >> > MEM_EXPR is set to the original expression, rather than to the >> >> >> > mem_ref. This >> >> >> > essentially duplicates the behavior in the absence of address >> >> >> > lowering. >> >> >> >> >> >> Ick. We had this in the past via TMR_ORIGINAL which caused all sorts >> >> >> of problems. Removing it didn't cause much degradation because we now >> >> >> preserve points-to information. >> >> >> >> >> >> Originally I played with lowering all memory accesses to MEM_REFs >> >> >> (see the old mem-ref branch), and the loss of type-based alias >> >> >> disambiguation was indeed an issue. >> >> >> >> >> >> But - I definitely do not like the idea of preserving something similar >> >> >> to TMR_ORIGINAL. Instead we can try preserving some information >> >> >> we derive from it. We keep the original access type that we can use >> >> >> for TBAA but do not retain knowledge on whether the type of the >> >> >> MEM_REF is valid for TBAA or if it is view-converted. >> >> > >> >> > Yes, I really don't like what I have at the moment, either. I put it in >> >> > place as a stopgap to let me proceed to look for other performance >> >> > problems. >> >> > >> >> > The question is how we can infer useful information for TBAA from the >> >> > MEM_REFs and TMRs. I poked at trying to identify types and offsets from >> >> > the MEM_EXPRs, but this ended up being useless; I had to constrain too >> >> > many cases to maintain correctness, and couldn't prove the type >> >> > information for the important cases in SPEC I was trying to address. >> >> > >> >> > Unfortunately, the whole design goes down the drain if we can't find a >> >> > way to solve the TBAA issue. The performance degradations are too >> >> > costly. >> >> >> >> If you look at what basic TBAA the alias oracle performs then it boils >> >> down to the fact that get_alias_set for a.b.c might end up using the >> >> alias-set of the type of C but for MEM[&a + 4] it will use the alias set >> >> of the type of a. The tree alias-oracle extracts both alias sets, that >> >> of the outermost valid type and that of the innermost as both are >> >> equally useful. But the MEM_REF (or TARGET_MEM_REF) tree >> >> only have storage for one such alias-set. Thus my idea at some point >> >> was to store the other one as well in some form. It will not be >> >> the full information (after all, the complete access path does provide >> >> some extra information - see aliasing_component_refs_p). >> > >> > This is what concerns me. TBAA information for the outer and inner >> > components doesn't seem sufficient to provide what >> > nonoverlapping_component_refs_p is currently able to prove. The latter >> > searches for a common RECORD_TYPE somewhere along the two access paths, >> > and then disambiguates if the two associated referenced fields differ. >> > For a simple case like "struct x { int a; int b; };", a and b have the >> > same type and alias-set, so the alias-set information doesn't add >> > anything. It isn't sufficient alone for the disambiguation of x1.a = >> > MEM_REF[&x1, 0] and x2.b = MEM_REF[&x2, 4]. >> > >> > Obviously the offset is sufficient to disambiguate for this simple case >> > with a common base type, but when the shared record types aren't at the >> > outermost level, we can't detect whether it is. >> > >> > At the moment I don't see how we can avoid degradation unless we keep >> > the full access path around somewhere, for [TARGET_]MEM_REFs built from >> > COMPONENT_REFs. I hope I'm wrong. >> >> You are not wrong. But the question is, does it make a difference? >> >> Richard. > > Yes, it does. This scenario occurs in 188.ammp, among others, and leads > to a large degradation without the change. The performance-critical > loop in mm_fv_update_nonbon makes heavy use of indirect references to > the ATOM structure that contains numerous float variables. When the > COMPONENT_REFs have been converted to MEM_REFs, the alias machinery can > no longer disambiguate these, which constrains the scheduler. The > result of the poor scheduling (on PowerPC, at least) is a large increase > in floating-point spill code in the critical loop.
As they appear in loops I wonder why IVOPTs doesn't already expose this problem? In general the answer to missed TBAA optimizations is of course "make sure that PTA works", which usually means using LTO ... I really really don't like preserving TBAA related information as trees. Instead if we really think preserving access paths is a must (I have already significantly reduced the number of preserved paths with introducing MEM_REFs!) then we have to find a different representation. I suppose you can turn the AMMP case into a (small) testcase that illustrates the issue and allows easier analysis and test of fixes? Richard.