On Fri, Dec 4, 2015 at 11:00 AM, Bin.Cheng <amker.ch...@gmail.com> wrote:
> On Fri, Dec 4, 2015 at 10:48 AM, Bin.Cheng <amker.ch...@gmail.com> wrote:
>> On Wed, Dec 2, 2015 at 5:11 AM, Steve Ellcey <sell...@imgtec.com> wrote:
>>>
>>> I have a question involving ivopts and PR 48814, which was a fix for
>>> the post increment operation.  Prior to the fix for PR 48814, MIPS
>>> would generate this loop for strcmp (C code from glibc):
>>>
>>> $L4:
>>>         lbu     $3,0($4)
>>>         lbu     $2,0($5)
>>>         addiu   $4,$4,1
>>>         beq     $3,$0,$L7
>>>         addiu   $5,$5,1    # This is a branch delay slot
>>>         beq     $3,$2,$L4
>>>         subu    $2,$3,$2   # This is a branch delay slot (only used after 
>>> loop)
>>>
>>>
>>> With the current top-of-tree we now generate:
>>>
>>>         addiu   $4,$4,1
>>> $L8:
>>>         lbu     $3,-1($4)
>>>         addiu   $5,$5,1
>>>         beq     $3,$0,$L7
>>>         lbu     $2,-1($5)  # This is a branch delay slot
>>>         beq     $3,$2,$L8
>>>         addiu   $4,$4,1    # This is a branch delay slot
>>>
>>>         subu    $2,$3,$2   # Done only once now after exiting loop.
>>>
>>> The main problem with the new loop is that the beq comparing $2 and $3
>>> is right before the load of $2 so there can be a delay due to the time
>>> that the load takes.  The ideal code would probably be:
>>>
>>>         addiu   $4,$4,1
>>> $L8:
>>>         lbu     $3,-1($4)
>>>         lbu     $2,0($5)  # This is a branch delay slot
>>>         beq     $3,$0,$L7
>>>         addiu   $5,$5,1
>>>         beq     $3,$2,$L8
>>>         addiu   $4,$4,1    # This is a branch delay slot
>>>
>>>         subu    $2,$3,$2   # Done only once now after exiting loop.
>>>
>>> Where we load $2 earlier (using a 0 offset instead of a -1 offset) and
>>> then do the increment of $5 after using it in the load.  The problem
>>> is that this isn't something that can just be done in the instruction
>>> scheduler because we are changing one of the instructions (to modify the
>>> offset) in addition to rearranging them and I don't think the instruction
>>> scheduler supports that.
>> Hmm, I think Bernd introduced sched_flag !DONT_BREAK_DEPENDENCIES to
>> resolve dependence by modifying address expression.  I think this is
>> the same problem, what's needed is to model dependence using that
>> framework.  Maybe delay slot is special here?
>>
>>>
>>> It looks like is the ivopts code that decided to increment the registers
>>> first and use the -1 offsets in the loads after instead of using 0 offsets
>>> and then incrementing the offsets after the loads but I can't figure out
>>> how or why ivopts made that decision.
>>>
>>> Does anyone have any ideas on how I could 'fix' GCC to make it generate
>>> the ideal code?  Is there some way to do it in the instruction scheduler?
>>> Is there some way to modify ivopts to fix this by modifying the cost
>> It's likely IVO just peaks the first candidate when it runs into a
>> tie.  Could you please post preprocessed source code so that I can
>> have a look?  I am not familiar with glibc.  Thanks.
> Oh, I saw the example in another thread of yours.
>
Dump before IVO is as below:

  <bb 3>:
  # s1_1 = PHI <p1_4(D)(2), s1_6(6)>
  # s2_2 = PHI <p2_5(D)(2), s2_9(6)>
  s1_6 = s1_1 + 1;
  c1_8 = *s1_1;
  s2_9 = s2_2 + 1;
  c2_10 = *s2_2;
  if (c1_8 == 0)
    goto <bb 4>;
  else
    goto <bb 5>;

And the iv candidates are as:
candidate 1 (important)
  var_before ivtmp.6
  var_after ivtmp.6
  incremented before exit test
  type unsigned int
  base (unsigned int) p1_4(D)
  step 1
  base object (void *) p1_4(D)
candidate 2 (important)
  original biv
  type const unsigned char *
  base (const unsigned char *) p1_4(D)
  step 1
  base object (void *) p1_4(D)
candidate 3 (important)
  var_before ivtmp.7
  var_after ivtmp.7
  incremented before exit test
  type unsigned int
  base (unsigned int) p2_5(D)
  step 1
  base object (void *) p2_5(D)
candidate 4 (important)
  original biv
  type const unsigned char *
  base (const unsigned char *) p2_5(D)
  step 1
  base object (void *) p2_5(D)

Generally GCC would choose normal candidates {1, 3} and insert
increment before exit condition.  This is expected in this case.  But
when there is applicable original candidates {2, 4}, GCC would prefer
these in order to achieve better debugging.  Also as I suspected,
[reg] and [reg-1] have same address cost on mips, that's why GCC makes
current decision.

Thanks,
bin

Reply via email to