Re: Delay slot filling - what still matters, and what doesn't matter so much anymore?

Steven Bosscher Fri, 19 Apr 2013 14:54:09 -0700

On Thu, Apr 18, 2013 at 6:22 AM, Jeff Law wrote:
> On 04/17/2013 03:52 PM, Steven Bosscher wrote:
>>
>> First of all: What is still important to handle?
>>
>> It's clear that the expectations in reorg.c are "anything goes" but
>> modern RISCs (everything since the PA-8000, say) probably have some
>> limitations on what is helpful to have, or not have, in a delay slot.
>> According to the comments in pa.h about MASK_JUMP_IN_DELAY, having
>> jumps in delay slots of other jumps is one such thing: They don't
>> bring benefit to the PA-8000 and they don't work with DWARF2 CFI. As
>> far as I know, SPARC and MIPS don't allow jumps in delay slots, SH
>> looks like it doesn't allow it either, and CRIS can do it for short
>> branches but doesn't do because the trade-off between benefit and
>> machine description complexity comes out negative.
>
> Note that sparc and/or mips might use the adjust the return pointer trick.
> I know it wasn't my idea when I added it to the PA.


If they do that trick, it's not documented how that should work. It
doesn't look like it though.

Test case:
---- 8< ----
void f1 (void);
void f2 (void);
void f3 (void);

void foo (long a)
{
  if (a != 0)
    {
      f1 ();
      goto skip_some;
    }
  else
    f2 ();
skip_some:
  f3 ();
}
---- 8< ----

sparc64 assembly (with -O2 -fno-reorder-blocks):
---- 8< ----
foo:
        save    %sp, -176, %sp
        brz,pt  %i0, .L2
         nop
        call    f1, 0
         nop
        ba,pt   %xcc, .L3
         nop
.L2:
        call    f2, 0
         nop
.L3:
        call    f3, 0
         restore
---- 8< ----
sparc32 is identical except for the frame size.

mipsisa64 assembly (also with -O2 -fno-reorder-blocks):
---- 8< ----
foo:
        .frame  $sp,8,$31               # vars= 0, regs= 1/0, args= 0, gp= 0
        .mask   0x80000000,0
        .fmask  0x00000000,0
        .set    noreorder
        .set    nomacro
        daddiu  $sp,$sp,-8
        beq     $4,$0,$L2
        sd      $31,0($sp)

        jal     f1
        nop

        j       $L6
        ld      $31,0($sp)

        .align  3
$L2:
        jal     f2
        nop

$L3 = .
        ld      $31,0($sp)
$L6:
        j       f3
        daddiu  $sp,$sp,8
---- 8< ----


>>  On the scheduler
>> implementation side: Branches as delayed insns in delay slots of other
>> branches is impossible to express in the CFG (at least in GCC, but I
>> think in general it can't be done cleanly). Therefore I want to drop
>> support for branches in delay slots. What do you think about this?
>
> Certainly no need to support it in the generic case.  The only question is
> whether or not it's worth supporting the adjust the return pointer in the
> delay slot stuff.  Given an target without call/ret predictor stack, it can
> be a singificant advantage.  Such things might exist in the embedded space.

This shouldn't be very difficult to support if the target models this
as a jump in the delay slot of calls only. I can let the delay slot
filler allow jumps in delay slots of calls but not in delay slots of
other jumps. But for the moment I'm going to ignore this case unless
someone knows a target in the FSF tree that would benefit of it.


>> What about multiple delay slots? It looks like reorg.c has code to
>> handle insns with multiple delay slots, but there currently are no GCC
>> targets in the FSF tree that have insns with multiple delay slots and
>> that use define_delay.
>
> Ping Hans, I think he was the last person who tried to deal with reorg and
> multiple delay slots (c4x?).  I certainly wouldn't lose any sleep if we
> killed the limit support for multiple delay slots.

Right, c4x has 3 delay slots. There are also out-of-tree ports for
targets like SHARC. But most such DSP-like targets have some form of
support for predication, so Bernd's c6x scheme would be a better fit.
(And c4x is too old to care about anyway :-)


>> Another thing I noticed about targets with delay slots that can be
>> nullified, is that at least some of the ifcvt.c transformations could
>> be applied to fill more delay slots (obviously if_case_1 and
>> if_case_2. In reorg.c, optimize_skip does some kind of if-conversion.
>> Has anyone looked at whether optimize_skip still does something, and
>> derived a test case for that?
>
> I doubt anyone has looked at it recently.  It pre-dates our if-conversion
> code by a decade or more.

So I collected some stats myself, for a small number (31) files of gcc
itself, mostly from libcpp and various generator files, compiled at
-O2 for sparc64:

        pass 1                  pass 2          
total   simple  eager   skip    simple  eager   skip
insns   9743    3488    22      1297    525     0
filled  5918    2980    22      21      0       0
hit%    61%     31%     0%      0%      0%      0%
                                                
total   pass 1  pass 2                          
insns   9743    1297
filled  8920    21
hit%    92%     2%

So the first fill_simple_delay_slots pass fills ~60% of the slots, and
the first fill_eager_delay_slots fills another ~30%. The second pass
is not very effective.

The "skip" column is for a separate counter for optimize_skip. It
triggers only 21 times in the first pass. If I disable if-conversion
(-fno-if-conversion{,2}) this number goes up to 31, and with
if-conversion *and* bb-reorder disabled (-fno-reorder-blocks) it hits
63 times. Even at -Os, without if-conversion, without bb-reorder,
optimize_skip only hits 77 times (compared to filled simple:7216;
filled eager:3563).

The effect of if-conversion is interesting on its own. With
if-conversion enabled there are 9743 insns needing delay slots.
Without if-conversion there are 12565 insns needing delay slots, but
they are apparently only a little more difficult to fill, because the
second pass fills more slots:

total   pass 1  pass 2                          
insns   12565   1434
filled  11185   66
hit%    89%     5%                              

In pass 1, simple fills 7420 slots (59%) and eager fills 3765 slots
(30%), which appears to be the typical ratio for sparc, but also for
the mips targets I've played with.

Ciao!
Steven

Re: Delay slot filling - what still matters, and what doesn't matter so much anymore?

Reply via email to