https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107831
--- Comment #6 from Petr Skocik <pskocik at gmail dot com> --- (In reply to Jakub Jelinek from comment #2) > (In reply to Petr Skocik from comment #1) > > Sidenote regarding the stack-allocating code for cases when the size is not > > known to be less than pagesize: the code generated for those cases is quite > > large. It could be replaced (at least under -Os) with a call to a special > > assembly function that'd pop the return address (assuming the target machine > > pushes return addresses to the stack), allocate adjust and allocate the > > stack size in a piecemeal fashion so as to not skip guard pages, the repush > > the return address and return to caller with the stacksize expanded. > > You certainly don't want to kill the return stack the CPU has, even if it > results in a few saved bytes for -Os. That's a very interesting point because I have written x86_64 assembly "functions" that did pop the return address, pushed something to the stack, and then repushed the return address and returned. In a loop, it doesn't seem to perform badly compared to inline code, so I figure it shouldn't be messing with the return stack buffer. After all, even though the return happens through a different place in the callstack, it's still returning to the original caller. The one time I absolutely must have accidentally messed with the return stack buffer was when I wrote context switching routine and originally tried to "ret" to the new context. It turned out to be very measurably many times slower that `pop %rcx; jmp *%rcx;` (also measured on a loop), so that's why I think popping a return address, allocating on the stack, and then pushing and returning is not really a performance killer (on my Intel CPU anyway). If it was messing with the return stack buffer, I think would be getting similar slowdowns to what I got with context switching code trying to `ret`.