Jeff Law wrote: > Examples please? We should be probing the outgoing args at the probe > interval once the total static frame is greater than 3k. The dynamic > space should be probed by generic code.
OK, here are a few simple examples that enable a successful jump of the stack guard despite -fstack-clash-protection: int t1(int x) { char arr[3000]; return arr[x]; } int t2(int x) { char *p = __builtin_alloca (4050); x = t1 (x); return p[x]; } #define ARG32(X) X,X,X,X,X,X,X,X,X,X,X,X,X,X,X,X,X,X,X,X,X,X,X,X,X,X,X,X,X,X,X,X #define ARG192(X) ARG32(X),ARG32(X),ARG32(X),ARG32(X),ARG32(X),ARG32(X) void out1(ARG192(__int128)); int t3(int x) { if (x < 1000) return t1 (x) + 1; out1 (ARG192(0)); return 0; } This currently generates: t1: sub sp, sp, #3008 add x1, sp, 8 ldrb w0, [x1, w0, sxtw] add sp, sp, 3008 ret t2: stp x29, x30, [sp, -16]! mov x1, 4072 add x29, sp, 0 sub sp, sp, #4080 mov x2, sp str xzr, [sp, x1] bl t1 ldrb w0, [x2, w0, sxtw] add sp, x29, 0 ldp x29, x30, [sp], 16 ret t3: stp x29, x30, [sp, -16]! cmp w0, 999 add x29, sp, 0 sub sp, sp, #3008 bgt .L15 bl t1 add w0, w0, 1 .L14: add sp, x29, 0 ldp x29, x30, [sp], 16 ret As you can see t2 allocates 4080 bytes on the stack but probes only the top, leaving 4072 bytes unprobed. When it calls t1, it drops the stack by another 3008 bytes without a probe (as it is allowed up to 3KB), so now we've got a distance of almost 7KB between 2 probes... Similarly functions with large outgoing arguments are not correctly probed. t3 creates 3008 bytes of outgoing area without a probe and then calls t1 which will decrement the stack by another 3008 bytes without a probe. Both t2 and t3 must probe SP+1024 to ensure the callee can adjust SP by up to 3KB before emitting a probe. > What we should be doing, per your request is emit an initial probe if we > know the function is going to require probing of any form. Then we emit > probes at 4k intervals. At least that's how I understood your > simplification. So for a 7k stack that's two probes -- one at *sp at > the start of the prologue then the second after the first 4k is allocated. There is no benefit in doing an initial probe that way. We don't have to probe the first 3KB even if the function has a larger stack. So if we have a 6KB stack, we can drop the stack by 3KB, probe, drop it by another 3KB, then push the callee-saves. If the stack is larger than 7KB we probe at 7KB, then 11KB etc. > > I don't understand what last_probe_offset is for, it should always be zero > > after saving the callee-saves (though currently it is not set correctly). > > And it > > has to be set to a fixed value to limit the maximum outgoing args when doing > > final_adjust. > It's supposed to catch the case where neither the initial adjustment nor > final adjustment in and of themselves require probes, but the > combination would. That's not possible - the 2 cases are completely independent given that the callee-saves always provide an implicit probe (you can't have outgoing arguments if there is no call, and you can't have a call without callee-saving LR). > My understanding was that you didn't want that level of trackign around > the callee saves. It's also not clear if that will work in the presence > of separate shrink wrapping. Shrinkwrapping will be safe once I ensure LR is at the bottom of the callee-saves. With this fix we know for sure that if there is a call, we've saved LR at SP+0 before adjusting SP by final_adjust. This means we don't need to try to figure out where the last probe was - it's always SP+0 if final_adjust != 0, simplifying things. > I have no idea what you mean by setting to a fixed value for limit the > maximum outgoing args. We don't limit the maximum outgoing args -- we > probe that space just like any other as we cross 4k boundaries. No that's not correct - remember we want to reserve 1024 bytes for the outgoing area. So we must probe if the outgoing area is > 1024, not 4KB (and we need to probe again at 5KB, 9KB etc). > We're clearly not communicating well. Example would probably help. My t3 example above shows how you can easily jump the stack guard using lots of outgoing args (but only if you call a different function with no outgoing args first!). Wilco