Re: stack arenas using alloca

Michael Clark via Gcc Thu, 22 Aug 2024 20:24:50 -0700

On 8/15/24 06:24, Michael Clark wrote:

Hi Folks,
*sending again with Thunderbird because Apple Mail munged the message*.
I wanted to share a seed of an idea I have been ruminating on for awhile, and that is being able to return alloca memory from a function.
I think it’s trivially possible by hacking the epilogue to unlink theframe pointer but not pop the stack. Stack traces should work fine andit just looks like the parent function called an alloca which covers thefixed child frame and the alloca that it returned.
I tend to use alloca quite a lot in my programming style because I liketo avoid the heap. I have a style where I run a loop with a size pass,then call alloca, then run the loop again without suppressing writes. Iwould like to be able to return, initially, one alloca to a parentfunction, but more than one should also be possible.
I haven't thought a lot about the mechanics of how one might signal tothe compiler to suppress popping the stack, but I imagined one couldreturn a pointer or structure with _Stack qualifiers to signal thatstack memory from the child frame is being returned.
This is just a quick note as there is a deeper conversation aboutcompacting stack arenas that would be possible if one had a runtime*and* compile time reflection API for frame metadata, where the frame isexpressed as an anonymous C structure perhaps accessible via a functionon a reference to a function. If you had frame metadata, you could emitone or many memmove calls. trivially one just needs the fixed framelength to recover the fixed frame stack memory and with full framereflection, it would be possible to defer to a runtime routine in thecase two or more stack pointers are returned, assuming the compilertracks sizes of alloca calls. Ideally, the reflection API would also beconstexpr-able meaning the runtime compaction could be inlined. Thiswon't work for types that have pointers unless one has a full single-thread compacting GC that uses type metadata, but I tend to use indices.But unlike heap compaction, it doesn't have to worry about races.
The beauty of stack arenas, and compacting stack arenas, as a subtype,are many. Firstly the memory has no thread aliasing issues unlike heapmemory, meaning analysis is much simpler and there are no pause issueslike there are for GC. Secondly, it would be quite a neat way toconstexpr variable-length arrays or a _List type because, unlike malloc,it’s much easier to reason about.
I have a relatively complete C reflection API that could be used as astarting point for C reflection. It currently works as an LLVM plugin. Iam not sure how I would do this in GCC. A Python parser came to mind butit would be better if it somehow ran inside the compiler. It would alsobe great to have access to the frame of a function as an anonymousstructure. It’s only runtime at present and it could benefit from a forcomprehension because looping on variable length lists is a littleclumsy. If I could return alloca memory in a structure that wouldlargely solve this problem.
- https://github.com/michaeljclark/crefl


It seems to work. I am hacking on LLVM. here are some notes.

- the calling convention could return with the new stack pointercovering the dynamic area or it could return with the dynamic delta in aspare register for the caller to issue an alloca. I am preferring thatthe stack pointer covers the dynamic region on return because it makesthe boundary clear on functions returning alloca memory. the stackpointer always covers the extent of stack allocated memory.

- the epilogue code that subtracts the dynamic stack offset insteadneeds to not do that, which slightly changes epilogue register restorecode, if it uses pop instructions, or it needs the delta for framepointer relative moves. in my experiment I stashed the stack pointer in%rdi, let the x86 epilogue pop callee save restores, then I fetch thereturn address, restore the stack pointer, and push the return address.

- as mentioned, the epilogue needs to push the return address to the endof the stack so that the return instruction works, or it could move itinto a register and make an indirect branch. similar. this is unlessfolks went with an alternative convention that returns the delta andissues an alloca in the parent frame after the call. I don't like that.

- either way, metadata needs to make the call look pretty much like itis also a dynamic alloca instruction, so that there is a boundary andmetadata like there is around dynamic alloca so that stack pointercalculations after the function are gonna work properly.

- emitting a write with the stack pointer in alloca instructions is aninteresting consideration. it wouldn't be depended on so it wouldn't addmuch latency, just a store slot and tiny bit of store bandwidth. it'salready right there. this is for the debugger. the neat thing is thislooks like a frame pointer to the debugger. you could make an implicitclosures that stops frame relative moves from moving past the closure.then allocas are boundaries for implicit closures. instructions withdependencies on the stack pointer can't move past these or they need acopy of the version of the stack pointer in the parent frame.

I also dropped a mind bomb in LLVM discourse. it seems it doesn't needan actual "calling convention" in the ABI sense per se as all ABIs usethe same stack and frame pointer, but there needs to be a common returndiscipline and metadata. extern call that looks like a dynamic alloca.


the succinct translation of this thread is, could this work?

  extern void* alloca(size_t n) __attribute__((allocareturn));

if you can make alloca work as an extern function, it will work.

the main points are:

- best that function exit stack pointer covers the dynamic area becausethe stack pointer currently conveys cover, just it's being restored.

- experiment has __attribute__((allocareturn)) kinda like noreturn.
- found out, but of course, I must push return address in the epilogue.
- can I return "struct { int n; int arr[n]; }" too please?

there are a lot more points. I can now return a VLA but bounds aren'tcarried because that information while available is currently lost, butdoesn't have to be, and there is talk about static GC for a compactingversion that issues memmove instructions. the offsets are dynamic butthe block count is static and the scanning is done at compile time.


here is a reference to crazy talk on the LLVM discourse:

- https://discourse.llvm.org/t/rfc-stack-arenas-using-alloca/80716

like I said this is crazy talk as alloca isn't even in the C standard.but VLAs are, and the current implementation of VLAs depends on alloca.


Michael.

Re: stack arenas using alloca

Reply via email to