Thanks Richard. Yes, without that restriction, the number of
partitions in the partition map will be increased somewhat. But I
think it may not increase a lot for 2 reasons. 1. usually coalesce
list is not a very big list and only the vars in that list will be
added to conflict graph. It already reduces conflict graph bitmaps a
lot. 2. a ssa var may appear in multiple phi stmts. Suppose in
phi-stmt1 it has different basename with other phi arg, while in
phi-stmt2 it has the same basename with other phi arg. In such case,
the ssa var will be added to conflict graph anyway because of
phi-stmt2, but it will not be added to coalesce list for phi-stmt1
with the restriction. So the restriction block the coalesce
opportunity in phi-stmt1 without reducing memory.

I hacked the out-of-ssa phase and added different names var in the
same phi into coalesce list. I tried spec2000 int and saw no
significant memory increase for expand phase (I used your -fmem-report
patch to dump the memory usage of each pass. It is useful. I am
wondering why it didn't go into the trunk).

Thanks,
Wei Mi.



On Fri, Aug 23, 2013 at 5:10 AM, Richard Biener
<richard.guent...@gmail.com> wrote:
> Wei Mi <w...@google.com> wrote:
>>For the following case:
>>
>>float total = 0.2;
>>
>>int main() {
>> int i;
>>
>> for (i = 0; i < 1000000000; i++) {
>>   total += i;
>> }
>>
>> return total == 0.3;
>>}
>>
>>The gcc assembly of its kernel loop is:
>>
>>.L3:
>>       movaps  %xmm0, %xmm1
>>.L2:
>>       cvtsi2ss        %eax, %xmm0
>>       addl    $1, %eax
>>       cmpl    $1000000000, %eax
>>       addss   %xmm1, %xmm0
>>       jne     .L3
>>
>>The movaps is redundent, the loop could be changed to:
>>
>>.L3:
>>       cvtsi2ss        %eax, %xmm1
>>       addl    $1, %eax
>>       cmpl    $1000000000, %eax
>>       addss   %xmm1, %xmm0
>>       jne     .L3
>>
>>Manually removing the extra movaps improves performance from 1.26s to
>>0.95s
>>on sandybridge using trunk (r201859).
>>
>>load PRE tries to promote MEM op of total out of the loop, it generates
>>a
>>new PHI at the start of loop body:
>>
>> <bb 2>:
>> pretmp_22 = total;
>> goto <bb 4>;
>>
>> <bb 3>:
>>
>> <bb 4>:
>> # i_15 = PHI <i_8(3), 0(2)>
>># prephitmp_23 = PHI <total.1_6(3), pretmp_22(2)>       ==> PHI
>>generated.
>> _4 = (float) i_15;
>> total.0_5 = prephitmp_23;
>> total.1_6 = _4 + total.0_5;
>> total = total.1_6;
>> i_8 = i_15 + 1;
>> if (i_8 != 1000000000)
>>   goto <bb 3>;
>> else
>>   goto <bb 5>;
>>
>>out-of-ssa phase should have coalesced prephitmp_23 and total.1_6(3) to
>>the
>>same temp var, but existing out-of-ssa has a limitation that it will
>>not
>>coalesce ssa variables with different base var names, even if they are
>>in
>>the same phi and their live ranges don't conflict. So out-of-ssa will
>>insert the redundent mov pretmp = total.1_6 in bb3.
>>
>> <bb 2>:
>> pretmp = total;
>> goto <bb 4>;
>>
>> <bb 3>:
>> pretmp = total.1_6;        ==> inserted by out-of-ssa.
>>
>> <bb 4>:
>> _4 = (float) i_15;
>> total.1_6 = _4 + pretmp;
>> i_8 = i_15 + 1;
>> if (i_8 != 1000000000)
>>   goto <bb 3>;
>> else
>>   goto <bb 5>;
>>
>>IRA phase has the potential to allocate pretmp and total.1_6 to the
>>same
>>hardreg and remove the extra mov, but for the above case, regmove phase
>>happen to block ira from doing the cleanup. regmove guesses the
>>register
>>constraint of an insn and try to change the insn to satisfy the
>>constraint
>>before IRA phase. Usually it could help IRA make a better decision, but
>>here regmove decides to merge _4 and total.1_6 into total.1_6 in order
>>to
>>satisfy the constraint of two operand plus on x86 (addss xmm1, xmm2).
>>After
>>_4 and total.1_6 are merged, The live range of total.1_6 has conflict
>>with
>>that of pretmp in IRA, so they cannot be allocated to the same hardreg,
>>and
>>the redundent mov (pretmp = total.1_6) couldn't be deleted. However, It
>>is
>>not trivial to make regmove choose to merge total.1_6 and pretmp,
>>because
>>it requires regmove to have global live range analysis (Existing
>>regmove
>>has simple correctness check in a range limited to single bb).
>>
>>If we use -mtune=corei7-avx, then the redundent mov disappear. That is
>>because after using avx support, regmove knows avx provide three
>>operands
>>plus: vaddsd xmm1, xmm2, xmm3/m32, so it will not merge total.1_6 and
>>_4,
>>then IRA could allocate total.1_6 and pretmp to the same hardreg.
>>
>>If we change the type of total from float to int, then the redundent
>>mov
>>also disappears. It has similar reason as the above one. x86 provides
>>LEA
>>insn which could be used as plus op and it could have three operands,
>>so
>>regmove chooses not to merge total.1_6 and _4.
>>
>>My question is, why out-of-ssa cannot do the cleanup by coalescing all
>>the
>>vars without conflicts in the same phi stmt, instead of only coalescing
>>the
>>vars with the same base name?
>
> The restriction exists to keep conflict bitmaps small. Otherwise you'll have 
> quadratic memory usage for them.
>
> Richard.
>
>>Thanks,
>>Wei Mi.
>
>

Reply via email to