(Adding gcc ML as this needs a larger community wide expertise and discussion. All, please also see thread and a few follow ups
https://inbox.sourceware.org/binutils/caonfrcv0v1catcpyxpplglzwkssws+wsfd12cg71rsi8ic0...@mail.gmail.com/)

Thanks for starting the discussion Mehrdad.

Stackless coroutines are one aspect, but as we discuss the general intent/support in the formats for stacktracing through them, the problem of stackful coroutines may also be good to keep in the context.

Some comments/curiosities below.

On 1/22/26 9:09 AM, Mehrdad Niknami wrote:
Hi,

(Apologies in advance for the long email. This is also my first time using the mailing list, so sorry in advance for any missteps.)

Briefly: we have been working on incorporating C++ coroutine frames into stack traces. The particular issue we have been trying to solve is that many common tools that obtain stack traces (e.g., perf) do not appear to have a mechanism for discovering or traversing C++ coroutine frames. It was suggested to us that SFrames could perhaps facilitate this across many tools, and so we're reaching out here to discuss whether/how that may be achievable, as we are not too familiar with SFrame details.

As C++ coroutines have a very high amount of complexity, I have attempted to distill down the relevant details of our current design below, along with a quick primer on how C++ coroutines work from the user side.

Overall, we're wondering if SFrames (either as they exist now, or via a practical extension) would be suitable for this "stitching" or "interleaving" of coroutine stack frames, and if so, what we could plan for and/or expect on that front.

Thank you.

------------------------------------------------------------------------
A toy coroutine may look like this:

    Task<int> factorial(int n) {

       printf("n = %d\n", n);

       if (n == 0) {
          co_return 1;
       }

       int subresult = co_await factorial(n - 1);

       co_return n * subresult;
    }

The Task type internally holds a std::coroutine_handle<> object, which is basically an opaque void* pointing to the various bookkeeping data of the coroutine. The Task type is specially marked to signify to the compiler that this function is a coroutine, and defines various behaviors, such as those of the keywords co_await and co_return (e.g., by storing the provided values in heap memory somewhere for later retrieval).

Ordinarily, if we have

    int main() {

       factorial(10).Resume();  // Run the coroutine

    }

and if we then stop the program to obtain a stack trace arbitrarily, we may obtain a typical CPU stack trace such as this:

    printf()
    factorial(n=3)

        /(Note that I'm including the n= values for illustrative
        purposes; they of course would not be apparent solely from a
        list of instruction addresses.)/

    std::coroutine_handle<>::resume()

        /(Note that the std::coroutine_handle<>::resume() <https://
        en.cppreference.com/w/cpp/coroutine/coroutine_handle/
        resume.html> function uses a compiler intrinsic such as
        __builtin_coro_resume() <https://clang.llvm.org/docs/
        LanguageExtensions.html#c-coroutines-support-builtins> to resume
        the coroutine underneath.)/

    Task::Resume()
    main()


However, what we /wish/ to obtain in such a situation is the following:

    printf()
    factorial(n=3)
    factorial(n=2) (.resume)
    factorial(n=1) (.resume)
    std::coroutine_handle<>::resume()

    Task<>::Resume()
    main()


The information required for observing factorial(2) and factorial(1) does exist in memory, but it is not part of the ordinary thread call stack.
In our implementation, it is represented via a singly-linked list:

 1. The currently-running coroutine (in this case, factorial(3)), which
    is the head of the list, is maintained in a known *thread-local*
    variable.
 2. Each coroutine also remembers the address of its caller (a.k.a.
    waiter, a.k.a. continuation).


Curious, when you say "our implementation", do you mean a coroutine library or clang implementation ? Either way, IIUC, how the caller is stored in the Promise object is not ABI/language defined. So, stacktracing through these dynamically allocated Promise objects on the heap is not easily doable via a third party unwinding application routine oblivious of the running application itself.

On that note, just for the sake of a thought experiment: Is (Was) it possible (for stack tracing purposes) to assign some skeleton for the "linked list data structure" of these dynamically allocated Promise object on the heap ? Something like having a Promise object header on the heap which specifies the size of Promise obj, location of the next one on heap, and offset to the caller IP within the object.


Typically, the address of the calling coroutine is stored inside a so- called "Promise" object on the heap associated with the Task, whose address is obtained via invoking std::coroutine_handle<T>::promise() on the coroutine handle embedded inside the Task. Each coroutine saves/restores the handle to its caller at various points via hooks provided through the language & compiler. When the compiler asks the coroutine "/Now that you are suspended, what should I resume next?/" (via complicated machinery <https:// en.cppreference.com/w/cpp/language/ coroutines.html#:~:text=if%20await_suspend%20returns%20a%20coroutine%20handle%20for%20some%20other%20coroutine%2C%20that%20handle%20is%20resumed>), the coroutine responds with the handle of its caller (the continuation).



If the coroutine frames are marked as such in the stacktrace format (as Jens suggested on the binutils thread reply), the stacktacer knows where to stitch the heap frames without guessing or doing linear memory scans.

On that note, so looks like there can there be a stacktrace of multiple interleaved subsets of "normal" and coroutine functions ? I.e.,

normal_B ()
coroutine_caller_B ()
coroutine_init_B ()
normal_server_start ()
normal_server_init ()
coroutine_server_caller ()
std::coroutine_handle::resume()
normal_event_handler ()
main ()

If yes, how is this handled in your approach below...

The missing piece of the puzzle is: how would the interleaving process figure out *where* to insert the coroutine frames into the ordinary stack trace? Conceptually, we need some way to identify Task::Resume() in the stack trace, and insert everything beneath the std::coroutine_handle<>::resume() that it is eventually calling downstream.

Right now we accomplish this via some dynamic probing at program initialization:

 1. We block the inlining of Task::Resume()
 2. We force the inlining of std::coroutine_handle<>::resume() (which
    just reduces to an intrinsic, __builtin_coro_resume())
 3. We call a dummy coroutine whose sole job is to return its own
    __builtin_return_address(0).

This return address is in the body of Task<>::Resume, and this forces it to be the unique address in the entire program that all Task<T> coroutines eventually return to. Therefore, we perform a linear search for it starting from the leaf frame on the stack, and we insert all the asynchronous frames after that.

Reply via email to