(Sorry if you received the mail twice. The first one was rejected because it was not plain text mode)
For the following case: float total = 0.2; int main() { int i; for (i = 0; i < 1000000000; i++) { total += i; } return total == 0.3; } The gcc assembly of its kernel loop is: .L3: movaps %xmm0, %xmm1 .L2: cvtsi2ss %eax, %xmm0 addl $1, %eax cmpl $1000000000, %eax addss %xmm1, %xmm0 jne .L3 The movaps is redundent, the loop could be changed to: .L3: cvtsi2ss %eax, %xmm1 addl $1, %eax cmpl $1000000000, %eax addss %xmm1, %xmm0 jne .L3 Manually removing the extra movaps improves performance from 1.26s to 0.95s on sandybridge using trunk (r201859). load PRE tries to promote MEM op of total out of the loop, it generates a new PHI at the start of loop body: <bb 2>: pretmp_22 = total; goto <bb 4>; <bb 3>: <bb 4>: # i_15 = PHI <i_8(3), 0(2)> # prephitmp_23 = PHI <total.1_6(3), pretmp_22(2)> ==> PHI generated. _4 = (float) i_15; total.0_5 = prephitmp_23; total.1_6 = _4 + total.0_5; total = total.1_6; i_8 = i_15 + 1; if (i_8 != 1000000000) goto <bb 3>; else goto <bb 5>; out-of-ssa phase should have coalesced prephitmp_23 and total.1_6(3) to the same temp var, but existing out-of-ssa has a limitation that it will not coalesce ssa variables with different base var names, even if they are in the same phi and their live ranges don't conflict. So out-of-ssa will insert the redundent mov pretmp = total.1_6 in bb3. <bb 2>: pretmp = total; goto <bb 4>; <bb 3>: pretmp = total.1_6; ==> inserted by out-of-ssa. <bb 4>: _4 = (float) i_15; total.1_6 = _4 + pretmp; i_8 = i_15 + 1; if (i_8 != 1000000000) goto <bb 3>; else goto <bb 5>; IRA phase has the potential to allocate pretmp and total.1_6 to the same hardreg and remove the extra mov, but for the above case, regmove phase happen to block ira from doing the cleanup. regmove guesses the register constraint of an insn and try to change the insn to satisfy the constraint before IRA phase. Usually it could help IRA make a better decision, but here regmove decides to merge _4 and total.1_6 into total.1_6 in order to satisfy the constraint of two operand plus on x86 (addss xmm1, xmm2). After _4 and total.1_6 are merged, The live range of total.1_6 has conflict with that of pretmp in IRA, so they cannot be allocated to the same hardreg, and the redundent mov (pretmp = total.1_6) couldn't be deleted. However, It is not trivial to make regmove choose to merge total.1_6 and pretmp, because it requires regmove to have global live range analysis (Existing regmove has simple correctness check in a range limited to single bb). If we use -mtune=corei7-avx, then the redundent mov disappear. That is because after using avx support, regmove knows avx provide three operands plus: vaddsd xmm1, xmm2, xmm3/m32, so it will not merge total.1_6 and _4, then IRA could allocate total.1_6 and pretmp to the same hardreg. If we change the type of total from float to int, then the redundent mov also disappears. It has similar reason as the above one. x86 provides LEA insn which could be used as plus op and it could have three operands, so regmove chooses not to merge total.1_6 and _4. My question is, why out-of-ssa cannot do the cleanup by coalescing all the vars without conflicts in the same phi stmt, instead of only coalescing the vars with the same base name? Thanks, Wei Mi.