On 07/21/2017 12:17 PM, Wilco Dijkstra wrote: > Jeff Law wrote: > >> Examples please? We should be probing the outgoing args at the probe >> interval once the total static frame is greater than 3k. The dynamic >> space should be probed by generic code. > > OK, here are a few simple examples that enable a successful jump of the stack > guard despite -fstack-clash-protection: > > int t1(int x) > { > char arr[3000]; > return arr[x]; > } > > int t2(int x) > { > char *p = __builtin_alloca (4050); > x = t1 (x); > return p[x]; > } > > #define ARG32(X) > X,X,X,X,X,X,X,X,X,X,X,X,X,X,X,X,X,X,X,X,X,X,X,X,X,X,X,X,X,X,X,X > #define ARG192(X) ARG32(X),ARG32(X),ARG32(X),ARG32(X),ARG32(X),ARG32(X) > void out1(ARG192(__int128)); > > int t3(int x) > { > if (x < 1000) > return t1 (x) + 1; > > out1 (ARG192(0)); > return 0; > } [ ... ] Thanks. Very helpful.
At the root, the t2->t1 case is a result of the compromise we made for aarch64 prologue generation. Namely that it doesn't make conservatively correct assumptions. Instead it makes an optimistic assumption about the most recent probe. As a result for aarch64, we have to probe the alloca residual. Essentially we changed a fundamental aspect of the probing model and there's fallout. And note that in general, if we're presented with a request for dynamic stack space that is not a compile-time constant, then we have to guard that additional probe so that it's never executed if the total size of the requested space was 0 bytes. Ugh. But manageable. The t3->t2 case is just an mis-understanding + oversight on my part. It also comes into play because of the optimistic assumptions we've agreed to make in aarch64 prologue generation. I'll take care of them. But it does show the danger of making those optimistic assumptions in the prologue code. > > >> What we should be doing, per your request is emit an initial probe if we >> know the function is going to require probing of any form. Then we emit >> probes at 4k intervals. At least that's how I understood your >> simplification. So for a 7k stack that's two probes -- one at *sp at >> the start of the prologue then the second after the first 4k is allocated. > > There is no benefit in doing an initial probe that way. We don't have to probe > the first 3KB even if the function has a larger stack. So if we have a 6KB > stack, > we can drop the stack by 3KB, probe, drop it by another 3KB, then push the > callee-saves. If the stack is larger than 7KB we probe at 7KB, then 11KB etc. It simplifies some stuff and I thought that's why you wanted it. I don't mind taking it out and making the code a bit smarter. >>> I don't understand what last_probe_offset is for, it should always be zero >>> after saving the callee-saves (though currently it is not set correctly). >>> And it >>> has to be set to a fixed value to limit the maximum outgoing args when doing >>> final_adjust. >> It's supposed to catch the case where neither the initial adjustment nor >> final adjustment in and of themselves require probes, but the >> combination would. > > That's not possible - the 2 cases are completely independent given that the > callee-saves always provide an implicit probe (you can't have outgoing > arguments > if there is no call, and you can't have a call without callee-saving LR). Thanks. That does help. But it also highlights my point above. THe port maintainers are in a better position to refine the basic implementation. What is very obvious to you because you're familiar with the port wasn't obvious to me. We've seen this play out elsewhere with the callee saves area allocation as well. > >> My understanding was that you didn't want that level of trackign around >> the callee saves. It's also not clear if that will work in the presence >> of separate shrink wrapping. > > Shrinkwrapping will be safe once I ensure LR is at the bottom of the > callee-saves. > With this fix we know for sure that if there is a call, we've saved LR at > SP+0 before > adjusting SP by final_adjust. This means we don't need to try to figure out > where > the last probe was - it's always SP+0 if final_adjust != 0, simplifying > things. Is that the case right now? If not I'd prefer to not add back tracking callee saves. You can add that tracking when you change the location of LR. > >> I have no idea what you mean by setting to a fixed value for limit the >> maximum outgoing args. We don't limit the maximum outgoing args -- we >> probe that space just like any other as we cross 4k boundaries. > > No that's not correct - remember we want to reserve 1024 bytes for the > outgoing > area. So we must probe if the outgoing area is > 1024, not 4KB (and we need to > probe again at 5KB, 9KB etc). I think this is just a communication issue -- sorry. Yes we need to make sure that in the case where we have more than 1k of outgoing arg space that we probe it. My plan is to probe at the given probe interval, then probe the lowest address if there's more than 1k of outgoing arg space. > >> We're clearly not communicating well. Example would probably help. > > My t3 example above shows how you can easily jump the stack guard using lots > of > outgoing args (but only if you call a different function with no outgoing > args first!). Right. I just missed the special case probe of the lowest addr in the outgoing space if it's > 1k. Jeff