Re: SFrame for C++ coroutine stack traces

Indu Bhagat via Gcc Mon, 02 Mar 2026 23:14:59 -0800

(Adding gcc ML as this needs a larger community wide expertise anddiscussion. All, please also see thread and a few follow ups

https://inbox.sourceware.org/binutils/caonfrcv0v1catcpyxpplglzwkssws+wsfd12cg71rsi8ic0...@mail.gmail.com/)


Thanks for starting the discussion Mehrdad.

Stackless coroutines are one aspect, but as we discuss the generalintent/support in the formats for stacktracing through them, the problemof stackful coroutines may also be good to keep in the context.


Some comments/curiosities below.

On 1/22/26 9:09 AM, Mehrdad Niknami wrote:

Hi,
(Apologies in advance for the long email. This is also my first timeusing the mailing list, so sorry in advance for any missteps.)
Briefly: we have been working on incorporating C++ coroutine frames intostack traces.The particular issue we have been trying to solve is that many commontools that obtain stack traces (e.g., perf) do not appear to have amechanism for discovering or traversing C++ coroutine frames.It was suggested to us that SFrames could perhaps facilitate this acrossmany tools, and so we're reaching out here to discuss whether/how thatmay be achievable, as we are not too familiar with SFrame details.
As C++ coroutines have a very high amount of complexity, I haveattempted to distill down the relevant details of our currentdesign below, along with a quick primer on how C++ coroutines work fromthe user side.
Overall, we're wondering if SFrames (either as they exist now, or via apractical extension) would be suitable for this "stitching" or"interleaving" of coroutine stack frames, and if so, what we could planfor and/or expect on that front.
Thank you.

------------------------------------------------------------------------
A toy coroutine may look like this:

    Task<int> factorial(int n) {

       printf("n = %d\n", n);

       if (n == 0) {
          co_return 1;
       }

       int subresult = co_await factorial(n - 1);

       co_return n * subresult;
    }
The Task type internally holds a std::coroutine_handle<> object, whichis basically an opaque void* pointing to the various bookkeeping data ofthe coroutine.The Task type is specially marked to signify to the compiler that thisfunction is a coroutine, and defines various behaviors, such as those ofthe keywords co_await and co_return (e.g., by storing the providedvalues in heap memory somewhere for later retrieval).
Ordinarily, if we have

    int main() {

       factorial(10).Resume();  // Run the coroutine

    }
and if we then stop the program to obtain a stack trace arbitrarily, wemay obtain a typical CPU stack trace such as this:
    printf()
    factorial(n=3)

        /(Note that I'm including the n= values for illustrative
        purposes; they of course would not be apparent solely from a
        list of instruction addresses.)/

    std::coroutine_handle<>::resume()

        /(Note that the std::coroutine_handle<>::resume() <https://
        en.cppreference.com/w/cpp/coroutine/coroutine_handle/
        resume.html> function uses a compiler intrinsic such as
        __builtin_coro_resume() <https://clang.llvm.org/docs/
        LanguageExtensions.html#c-coroutines-support-builtins> to resume
        the coroutine underneath.)/

    Task::Resume()
    main()


However, what we /wish/ to obtain in such a situation is the following:

    printf()
    factorial(n=3)
    factorial(n=2) (.resume)
    factorial(n=1) (.resume)
    std::coroutine_handle<>::resume()

    Task<>::Resume()
    main()
The information required for observing factorial(2) andfactorial(1) does exist in memory, but it is not part of the ordinarythread call stack.
In our implementation, it is represented via a singly-linked list:

 1. The currently-running coroutine (in this case, factorial(3)), which
    is the head of the list, is maintained in a known *thread-local*
    variable.
 2. Each coroutine also remembers the address of its caller (a.k.a.
    waiter, a.k.a. continuation).

Curious, when you say "our implementation", do you mean a coroutinelibrary or clang implementation ? Either way, IIUC, how the caller isstored in the Promise object is not ABI/language defined. So,stacktracing through these dynamically allocated Promise objects on theheap is not easily doable via a third party unwinding applicationroutine oblivious of the running application itself.

On that note, just for the sake of a thought experiment: Is (Was) itpossible (for stack tracing purposes) to assign some skeleton for the"linked list data structure" of these dynamically allocated Promiseobject on the heap ? Something like having a Promise object header onthe heap which specifies the size of Promise obj, location of the nextone on heap, and offset to the caller IP within the object.

Typically, the address of the calling coroutine is stored inside a so-called "Promise" object on the heap associated with the Task, whoseaddress is obtained via invoking std::coroutine_handle<T>::promise() onthe coroutine handle embedded inside the Task.Each coroutine saves/restores the handle to its caller at various pointsvia hooks provided through the language & compiler.When the compiler asks the coroutine "/Now that you are suspended, whatshould I resume next?/" (via complicated machinery <https://en.cppreference.com/w/cpp/language/coroutines.html#:~:text=if%20await_suspend%20returns%20a%20coroutine%20handle%20for%20some%20other%20coroutine%2C%20that%20handle%20is%20resumed>), the coroutine responds with the handle of its caller (the continuation).

If the coroutine frames are marked as such in the stacktrace format (asJens suggested on the binutils thread reply), the stacktacer knows whereto stitch the heap frames without guessing or doing linear memory scans.

On that note, so looks like there can there be a stacktrace of multipleinterleaved subsets of "normal" and coroutine functions ? I.e.,


normal_B ()
coroutine_caller_B ()
coroutine_init_B ()
normal_server_start ()
normal_server_init ()
coroutine_server_caller ()
std::coroutine_handle::resume()
normal_event_handler ()
main ()

If yes, how is this handled in your approach below...

The missing piece of the puzzle is: how would the interleaving processfigure out *where* to insert the coroutine frames into the ordinarystack trace?Conceptually, we need some way to identify Task::Resume() in the stacktrace, and insert everything beneath thestd::coroutine_handle<>::resume() that it is eventually calling downstream.
Right now we accomplish this via some dynamic probing at programinitialization:
 1. We block the inlining of Task::Resume()
 2. We force the inlining of std::coroutine_handle<>::resume() (which
    just reduces to an intrinsic, __builtin_coro_resume())
 3. We call a dummy coroutine whose sole job is to return its own
    __builtin_return_address(0).
This return address is in the body of Task<>::Resume, and this forces itto be the unique address in the entire program thatall Task<T> coroutines eventually return to.Therefore, we perform a linear search for it starting from the leafframe on the stack, and we insert all the asynchronous frames after that.

Re: SFrame for C++ coroutine stack traces

Reply via email to