(Sorry if you received the mail twice. The first one was rejected
because it was not plain text mode)

For the following case:

float total = 0.2;

int main() {
 int i;

 for (i = 0; i < 1000000000; i++) {
   total += i;
 }

 return total == 0.3;
}

The gcc assembly of its kernel loop is:

.L3:
       movaps  %xmm0, %xmm1
.L2:
       cvtsi2ss        %eax, %xmm0
       addl    $1, %eax
       cmpl    $1000000000, %eax
       addss   %xmm1, %xmm0
       jne     .L3

The movaps is redundent, the loop could be changed to:

.L3:
       cvtsi2ss        %eax, %xmm1
       addl    $1, %eax
       cmpl    $1000000000, %eax
       addss   %xmm1, %xmm0
       jne     .L3

Manually removing the extra movaps improves performance from 1.26s to
0.95s on sandybridge using trunk (r201859).

load PRE tries to promote MEM op of total out of the loop, it
generates a new PHI at the start of loop body:

 <bb 2>:
 pretmp_22 = total;
 goto <bb 4>;

 <bb 3>:

 <bb 4>:
 # i_15 = PHI <i_8(3), 0(2)>
 # prephitmp_23 = PHI <total.1_6(3), pretmp_22(2)>       ==> PHI generated.
 _4 = (float) i_15;
 total.0_5 = prephitmp_23;
 total.1_6 = _4 + total.0_5;
 total = total.1_6;
 i_8 = i_15 + 1;
 if (i_8 != 1000000000)
   goto <bb 3>;
 else
   goto <bb 5>;

out-of-ssa phase should have coalesced prephitmp_23 and total.1_6(3)
to the same temp var, but existing out-of-ssa has a limitation that it
will not coalesce ssa variables with different base var names, even if
they are in the same phi and their live ranges don't conflict. So
out-of-ssa will insert the redundent mov pretmp = total.1_6 in bb3.

 <bb 2>:
 pretmp = total;
 goto <bb 4>;

 <bb 3>:
 pretmp = total.1_6;        ==> inserted by out-of-ssa.

 <bb 4>:
 _4 = (float) i_15;
 total.1_6 = _4 + pretmp;
 i_8 = i_15 + 1;
 if (i_8 != 1000000000)
   goto <bb 3>;
 else
   goto <bb 5>;

IRA phase has the potential to allocate pretmp and total.1_6 to the
same hardreg and remove the extra mov, but for the above case, regmove
phase happen to block ira from doing the cleanup. regmove guesses the
register constraint of an insn and try to change the insn to satisfy
the constraint before IRA phase. Usually it could help IRA make a
better decision, but here regmove decides to merge _4 and total.1_6
into total.1_6 in order to satisfy the constraint of two operand plus
on x86 (addss xmm1, xmm2). After _4 and total.1_6 are merged, The live
range of total.1_6 has conflict with that of pretmp in IRA, so they
cannot be allocated to the same hardreg, and the redundent mov (pretmp
= total.1_6) couldn't be deleted. However, It is not trivial to make
regmove choose to merge total.1_6 and pretmp, because it requires
regmove to have global live range analysis (Existing regmove has
simple correctness check in a range limited to single bb).

If we use -mtune=corei7-avx, then the redundent mov disappear. That is
because after using avx support, regmove knows avx provide three
operands plus: vaddsd xmm1, xmm2, xmm3/m32, so it will not merge
total.1_6 and _4, then IRA could allocate total.1_6 and pretmp to the
same hardreg.

If we change the type of total from float to int, then the redundent
mov also disappears. It has similar reason as the above one. x86
provides LEA insn which could be used as plus op and it could have
three operands, so regmove chooses not to merge total.1_6 and _4.

My question is, why out-of-ssa cannot do the cleanup by coalescing all
the vars without conflicts in the same phi stmt, instead of only
coalescing the vars with the same base name?

Thanks,
Wei Mi.

Reply via email to