Thanks Richard. Yes, without that restriction, the number of partitions in the partition map will be increased somewhat. But I think it may not increase a lot for 2 reasons. 1. usually coalesce list is not a very big list and only the vars in that list will be added to conflict graph. It already reduces conflict graph bitmaps a lot. 2. a ssa var may appear in multiple phi stmts. Suppose in phi-stmt1 it has different basename with other phi arg, while in phi-stmt2 it has the same basename with other phi arg. In such case, the ssa var will be added to conflict graph anyway because of phi-stmt2, but it will not be added to coalesce list for phi-stmt1 with the restriction. So the restriction block the coalesce opportunity in phi-stmt1 without reducing memory.
I hacked the out-of-ssa phase and added different names var in the same phi into coalesce list. I tried spec2000 int and saw no significant memory increase for expand phase (I used your -fmem-report patch to dump the memory usage of each pass. It is useful. I am wondering why it didn't go into the trunk). Thanks, Wei Mi. On Fri, Aug 23, 2013 at 5:10 AM, Richard Biener <richard.guent...@gmail.com> wrote: > Wei Mi <w...@google.com> wrote: >>For the following case: >> >>float total = 0.2; >> >>int main() { >> int i; >> >> for (i = 0; i < 1000000000; i++) { >> total += i; >> } >> >> return total == 0.3; >>} >> >>The gcc assembly of its kernel loop is: >> >>.L3: >> movaps %xmm0, %xmm1 >>.L2: >> cvtsi2ss %eax, %xmm0 >> addl $1, %eax >> cmpl $1000000000, %eax >> addss %xmm1, %xmm0 >> jne .L3 >> >>The movaps is redundent, the loop could be changed to: >> >>.L3: >> cvtsi2ss %eax, %xmm1 >> addl $1, %eax >> cmpl $1000000000, %eax >> addss %xmm1, %xmm0 >> jne .L3 >> >>Manually removing the extra movaps improves performance from 1.26s to >>0.95s >>on sandybridge using trunk (r201859). >> >>load PRE tries to promote MEM op of total out of the loop, it generates >>a >>new PHI at the start of loop body: >> >> <bb 2>: >> pretmp_22 = total; >> goto <bb 4>; >> >> <bb 3>: >> >> <bb 4>: >> # i_15 = PHI <i_8(3), 0(2)> >># prephitmp_23 = PHI <total.1_6(3), pretmp_22(2)> ==> PHI >>generated. >> _4 = (float) i_15; >> total.0_5 = prephitmp_23; >> total.1_6 = _4 + total.0_5; >> total = total.1_6; >> i_8 = i_15 + 1; >> if (i_8 != 1000000000) >> goto <bb 3>; >> else >> goto <bb 5>; >> >>out-of-ssa phase should have coalesced prephitmp_23 and total.1_6(3) to >>the >>same temp var, but existing out-of-ssa has a limitation that it will >>not >>coalesce ssa variables with different base var names, even if they are >>in >>the same phi and their live ranges don't conflict. So out-of-ssa will >>insert the redundent mov pretmp = total.1_6 in bb3. >> >> <bb 2>: >> pretmp = total; >> goto <bb 4>; >> >> <bb 3>: >> pretmp = total.1_6; ==> inserted by out-of-ssa. >> >> <bb 4>: >> _4 = (float) i_15; >> total.1_6 = _4 + pretmp; >> i_8 = i_15 + 1; >> if (i_8 != 1000000000) >> goto <bb 3>; >> else >> goto <bb 5>; >> >>IRA phase has the potential to allocate pretmp and total.1_6 to the >>same >>hardreg and remove the extra mov, but for the above case, regmove phase >>happen to block ira from doing the cleanup. regmove guesses the >>register >>constraint of an insn and try to change the insn to satisfy the >>constraint >>before IRA phase. Usually it could help IRA make a better decision, but >>here regmove decides to merge _4 and total.1_6 into total.1_6 in order >>to >>satisfy the constraint of two operand plus on x86 (addss xmm1, xmm2). >>After >>_4 and total.1_6 are merged, The live range of total.1_6 has conflict >>with >>that of pretmp in IRA, so they cannot be allocated to the same hardreg, >>and >>the redundent mov (pretmp = total.1_6) couldn't be deleted. However, It >>is >>not trivial to make regmove choose to merge total.1_6 and pretmp, >>because >>it requires regmove to have global live range analysis (Existing >>regmove >>has simple correctness check in a range limited to single bb). >> >>If we use -mtune=corei7-avx, then the redundent mov disappear. That is >>because after using avx support, regmove knows avx provide three >>operands >>plus: vaddsd xmm1, xmm2, xmm3/m32, so it will not merge total.1_6 and >>_4, >>then IRA could allocate total.1_6 and pretmp to the same hardreg. >> >>If we change the type of total from float to int, then the redundent >>mov >>also disappears. It has similar reason as the above one. x86 provides >>LEA >>insn which could be used as plus op and it could have three operands, >>so >>regmove chooses not to merge total.1_6 and _4. >> >>My question is, why out-of-ssa cannot do the cleanup by coalescing all >>the >>vars without conflicts in the same phi stmt, instead of only coalescing >>the >>vars with the same base name? > > The restriction exists to keep conflict bitmaps small. Otherwise you'll have > quadratic memory usage for them. > > Richard. > >>Thanks, >>Wei Mi. > >