Wei Mi <w...@google.com> wrote: >For the following case: > >float total = 0.2; > >int main() { > int i; > > for (i = 0; i < 1000000000; i++) { > total += i; > } > > return total == 0.3; >} > >The gcc assembly of its kernel loop is: > >.L3: > movaps %xmm0, %xmm1 >.L2: > cvtsi2ss %eax, %xmm0 > addl $1, %eax > cmpl $1000000000, %eax > addss %xmm1, %xmm0 > jne .L3 > >The movaps is redundent, the loop could be changed to: > >.L3: > cvtsi2ss %eax, %xmm1 > addl $1, %eax > cmpl $1000000000, %eax > addss %xmm1, %xmm0 > jne .L3 > >Manually removing the extra movaps improves performance from 1.26s to >0.95s >on sandybridge using trunk (r201859). > >load PRE tries to promote MEM op of total out of the loop, it generates >a >new PHI at the start of loop body: > > <bb 2>: > pretmp_22 = total; > goto <bb 4>; > > <bb 3>: > > <bb 4>: > # i_15 = PHI <i_8(3), 0(2)> ># prephitmp_23 = PHI <total.1_6(3), pretmp_22(2)> ==> PHI >generated. > _4 = (float) i_15; > total.0_5 = prephitmp_23; > total.1_6 = _4 + total.0_5; > total = total.1_6; > i_8 = i_15 + 1; > if (i_8 != 1000000000) > goto <bb 3>; > else > goto <bb 5>; > >out-of-ssa phase should have coalesced prephitmp_23 and total.1_6(3) to >the >same temp var, but existing out-of-ssa has a limitation that it will >not >coalesce ssa variables with different base var names, even if they are >in >the same phi and their live ranges don't conflict. So out-of-ssa will >insert the redundent mov pretmp = total.1_6 in bb3. > > <bb 2>: > pretmp = total; > goto <bb 4>; > > <bb 3>: > pretmp = total.1_6; ==> inserted by out-of-ssa. > > <bb 4>: > _4 = (float) i_15; > total.1_6 = _4 + pretmp; > i_8 = i_15 + 1; > if (i_8 != 1000000000) > goto <bb 3>; > else > goto <bb 5>; > >IRA phase has the potential to allocate pretmp and total.1_6 to the >same >hardreg and remove the extra mov, but for the above case, regmove phase >happen to block ira from doing the cleanup. regmove guesses the >register >constraint of an insn and try to change the insn to satisfy the >constraint >before IRA phase. Usually it could help IRA make a better decision, but >here regmove decides to merge _4 and total.1_6 into total.1_6 in order >to >satisfy the constraint of two operand plus on x86 (addss xmm1, xmm2). >After >_4 and total.1_6 are merged, The live range of total.1_6 has conflict >with >that of pretmp in IRA, so they cannot be allocated to the same hardreg, >and >the redundent mov (pretmp = total.1_6) couldn't be deleted. However, It >is >not trivial to make regmove choose to merge total.1_6 and pretmp, >because >it requires regmove to have global live range analysis (Existing >regmove >has simple correctness check in a range limited to single bb). > >If we use -mtune=corei7-avx, then the redundent mov disappear. That is >because after using avx support, regmove knows avx provide three >operands >plus: vaddsd xmm1, xmm2, xmm3/m32, so it will not merge total.1_6 and >_4, >then IRA could allocate total.1_6 and pretmp to the same hardreg. > >If we change the type of total from float to int, then the redundent >mov >also disappears. It has similar reason as the above one. x86 provides >LEA >insn which could be used as plus op and it could have three operands, >so >regmove chooses not to merge total.1_6 and _4. > >My question is, why out-of-ssa cannot do the cleanup by coalescing all >the >vars without conflicts in the same phi stmt, instead of only coalescing >the >vars with the same base name?
The restriction exists to keep conflict bitmaps small. Otherwise you'll have quadratic memory usage for them. Richard. >Thanks, >Wei Mi.