Jeff Law wrote:
    
> Examples please?  We should be probing the outgoing args at the probe
> interval  once the total static frame is greater than 3k.  The dynamic
> space should be probed by generic code.

OK, here are a few simple examples that enable a successful jump of the stack
guard despite -fstack-clash-protection:

int t1(int x)
{
  char arr[3000];
  return arr[x];
}

int t2(int x)
{
  char *p = __builtin_alloca (4050);
  x = t1 (x);
  return p[x];
}

#define ARG32(X) X,X,X,X,X,X,X,X,X,X,X,X,X,X,X,X,X,X,X,X,X,X,X,X,X,X,X,X,X,X,X,X
#define ARG192(X) ARG32(X),ARG32(X),ARG32(X),ARG32(X),ARG32(X),ARG32(X)
void out1(ARG192(__int128));

int t3(int x)
{
  if (x < 1000)
    return t1 (x) + 1;

  out1 (ARG192(0));
  return 0;
}


This currently generates:

t1:
        sub     sp, sp, #3008
        add     x1, sp, 8
        ldrb    w0, [x1, w0, sxtw]
        add     sp, sp, 3008
        ret

t2:
        stp     x29, x30, [sp, -16]!
        mov     x1, 4072
        add     x29, sp, 0
        sub     sp, sp, #4080
        mov     x2, sp
        str     xzr, [sp, x1]
        bl      t1
        ldrb    w0, [x2, w0, sxtw]
        add     sp, x29, 0
        ldp     x29, x30, [sp], 16
        ret

t3:
        stp     x29, x30, [sp, -16]!
        cmp     w0, 999
        add     x29, sp, 0
        sub     sp, sp, #3008
        bgt     .L15
        bl      t1
        add     w0, w0, 1
.L14:
        add     sp, x29, 0
        ldp     x29, x30, [sp], 16
        ret

As you can see t2 allocates 4080 bytes on the stack but probes only the top,
leaving 4072 bytes unprobed. When it calls t1, it drops the stack by another 
3008
bytes without a probe (as it is allowed up to 3KB), so now we've got a distance 
of
almost 7KB between 2 probes...

Similarly functions with large outgoing arguments are not correctly probed.
t3 creates 3008 bytes of outgoing area without a probe and then calls t1
which will decrement the stack by another 3008 bytes without a probe.

Both t2 and t3 must probe SP+1024 to ensure the callee can adjust SP by up to
3KB before emitting a probe.


> What we should be doing, per your request is emit an initial probe if we
> know the function is going to require probing of any form.  Then we emit
> probes at 4k intervals.  At least that's how I understood your
> simplification.  So for a 7k stack that's two probes -- one at *sp at
> the start of the prologue then the second after the first 4k is allocated.

There is no benefit in doing an initial probe that way. We don't have to probe
the first 3KB even if the function has a larger stack. So if we have a 6KB 
stack,
we can drop the stack by 3KB, probe, drop it by another 3KB, then push the
callee-saves. If the stack is larger than 7KB we probe at 7KB, then 11KB etc.

> > I don't understand what last_probe_offset is for, it should always be zero
> > after saving the callee-saves (though currently it is not set correctly). 
> > And it
> > has to be set to a fixed value to limit the maximum outgoing args when doing
> > final_adjust.
> It's supposed to catch the case where neither the initial adjustment nor
> final adjustment in and of themselves require probes, but the
> combination would.

That's not possible - the 2 cases are completely independent given that the
callee-saves always provide an implicit probe (you can't have outgoing arguments
if there is no call, and you can't have a call without callee-saving LR).

> My understanding was that you didn't want that level of trackign around
> the callee saves.  It's also not clear if that will work in the presence
> of separate shrink wrapping.

Shrinkwrapping will be safe once I ensure LR is at the bottom of the 
callee-saves.
With this fix we know for sure that if there is a call, we've saved LR at SP+0 
before
adjusting SP by final_adjust. This means we don't need to try to figure out 
where
the last probe was - it's always SP+0 if final_adjust != 0, simplifying things.

> I have no idea what you mean by setting to a fixed value for limit the
> maximum outgoing args.  We don't limit the maximum outgoing args -- we
> probe that space just like any other as we cross 4k boundaries.

No that's not correct - remember we want to reserve 1024 bytes for the outgoing
area. So we must probe if the outgoing area is > 1024, not 4KB (and we need to
probe again at 5KB, 9KB etc).

> We're clearly not communicating well.  Example would probably help.

My t3 example above shows how you can easily jump the stack guard using lots of
outgoing args (but only if you call a different function with no outgoing args 
first!).

Wilco
    

Reply via email to