On 8/15/24 06:24, Michael Clark wrote:
Hi Folks,
*sending again with Thunderbird because Apple Mail munged the message*.
I wanted to share a seed of an idea I have been ruminating on for a
while, and that is being able to return alloca memory from a function.
I think it’s trivially possible by hacking the epilogue to unlink the
frame pointer but not pop the stack. Stack traces should work fine and
it just looks like the parent function called an alloca which covers the
fixed child frame and the alloca that it returned.
I tend to use alloca quite a lot in my programming style because I like
to avoid the heap. I have a style where I run a loop with a size pass,
then call alloca, then run the loop again without suppressing writes. I
would like to be able to return, initially, one alloca to a parent
function, but more than one should also be possible.
I haven't thought a lot about the mechanics of how one might signal to
the compiler to suppress popping the stack, but I imagined one could
return a pointer or structure with _Stack qualifiers to signal that
stack memory from the child frame is being returned.
This is just a quick note as there is a deeper conversation about
compacting stack arenas that would be possible if one had a runtime
*and* compile time reflection API for frame metadata, where the frame is
expressed as an anonymous C structure perhaps accessible via a function
on a reference to a function. If you had frame metadata, you could emit
one or many memmove calls. trivially one just needs the fixed frame
length to recover the fixed frame stack memory and with full frame
reflection, it would be possible to defer to a runtime routine in the
case two or more stack pointers are returned, assuming the compiler
tracks sizes of alloca calls. Ideally, the reflection API would also be
constexpr-able meaning the runtime compaction could be inlined. This
won't work for types that have pointers unless one has a full single-
thread compacting GC that uses type metadata, but I tend to use indices.
But unlike heap compaction, it doesn't have to worry about races.
The beauty of stack arenas, and compacting stack arenas, as a subtype,
are many. Firstly the memory has no thread aliasing issues unlike heap
memory, meaning analysis is much simpler and there are no pause issues
like there are for GC. Secondly, it would be quite a neat way to
constexpr variable-length arrays or a _List type because, unlike malloc,
it’s much easier to reason about.
I have a relatively complete C reflection API that could be used as a
starting point for C reflection. It currently works as an LLVM plugin. I
am not sure how I would do this in GCC. A Python parser came to mind but
it would be better if it somehow ran inside the compiler. It would also
be great to have access to the frame of a function as an anonymous
structure. It’s only runtime at present and it could benefit from a for
comprehension because looping on variable length lists is a little
clumsy. If I could return alloca memory in a structure that would
largely solve this problem.
- https://github.com/michaeljclark/crefl
It seems to work. I am hacking on LLVM. here are some notes.
- the calling convention could return with the new stack pointer
covering the dynamic area or it could return with the dynamic delta in a
spare register for the caller to issue an alloca. I am preferring that
the stack pointer covers the dynamic region on return because it makes
the boundary clear on functions returning alloca memory. the stack
pointer always covers the extent of stack allocated memory.
- the epilogue code that subtracts the dynamic stack offset instead
needs to not do that, which slightly changes epilogue register restore
code, if it uses pop instructions, or it needs the delta for frame
pointer relative moves. in my experiment I stashed the stack pointer in
%rdi, let the x86 epilogue pop callee save restores, then I fetch the
return address, restore the stack pointer, and push the return address.
- as mentioned, the epilogue needs to push the return address to the end
of the stack so that the return instruction works, or it could move it
into a register and make an indirect branch. similar. this is unless
folks went with an alternative convention that returns the delta and
issues an alloca in the parent frame after the call. I don't like that.
- either way, metadata needs to make the call look pretty much like it
is also a dynamic alloca instruction, so that there is a boundary and
metadata like there is around dynamic alloca so that stack pointer
calculations after the function are gonna work properly.
- emitting a write with the stack pointer in alloca instructions is an
interesting consideration. it wouldn't be depended on so it wouldn't add
much latency, just a store slot and tiny bit of store bandwidth. it's
already right there. this is for the debugger. the neat thing is this
looks like a frame pointer to the debugger. you could make an implicit
closures that stops frame relative moves from moving past the closure.
then allocas are boundaries for implicit closures. instructions with
dependencies on the stack pointer can't move past these or they need a
copy of the version of the stack pointer in the parent frame.
I also dropped a mind bomb in LLVM discourse. it seems it doesn't need
an actual "calling convention" in the ABI sense per se as all ABIs use
the same stack and frame pointer, but there needs to be a common return
discipline and metadata. extern call that looks like a dynamic alloca.
the succinct translation of this thread is, could this work?
extern void* alloca(size_t n) __attribute__((allocareturn));
if you can make alloca work as an extern function, it will work.
the main points are:
- best that function exit stack pointer covers the dynamic area because
the stack pointer currently conveys cover, just it's being restored.
- experiment has __attribute__((allocareturn)) kinda like noreturn.
- found out, but of course, I must push return address in the epilogue.
- can I return "struct { int n; int arr[n]; }" too please?
there are a lot more points. I can now return a VLA but bounds aren't
carried because that information while available is currently lost, but
doesn't have to be, and there is talk about static GC for a compacting
version that issues memmove instructions. the offsets are dynamic but
the block count is static and the scanning is done at compile time.
here is a reference to crazy talk on the LLVM discourse:
- https://discourse.llvm.org/t/rfc-stack-arenas-using-alloca/80716
like I said this is crazy talk as alloca isn't even in the C standard.
but VLAs are, and the current implementation of VLAs depends on alloca.
Michael.