Hi Jeff, There is an issue with your AArch64 patch, it fails to apply properly and does so silently using 'patch'. I also noticed some odd control characters in the other patches, but they didn't appear to fail (or at least everything builds).
Anyway with -Ofast -static the overall codesize increase is ~0.7% on SPEC2017, with several above 1% and one over 3%... So -O2 and dynamic linking should get you well above 1% on average. This is a bit too much static overhead to enable by default. The current version ICEs with a 64KB probe size so I can't see whether that has lower overhead. I briefly looked at the generated sequences, the biggest issue is that they don't work together and thus do not protect against jumping the stack guard. For example both alloca and outgoing arguments can allocate 4KB without probing, then call a function which allocates another 3KB on top of that (ie. max probe distance is 7KB...). There are also too many probes emitted, for example a function with a 7KB frame emits 2 explicit probes when it should emit just 1 (with a 4KB probe size we can adjust the stack by 3KB then probe, then by another 4KB, then save callee-saves and adjust by up to 1KB for outgoing arguments without inserting more probes). I don't understand what last_probe_offset is for, it should always be zero after saving the callee-saves (though currently it is not set correctly). And it has to be set to a fixed value to limit the maximum outgoing args when doing final_adjust. Finally emitting inline loops generates a large amount of code. Although we can easily reduce the overhead, especially for alloca, I think using helper functions seems best. Wilco