Hi there, following up with discussions that took place between various folks from Oracle, Amazon, SAP and Datadog, I would like to propose two proof-of-concepts for a JVMTI async stack-walker API. Those are by no means complete implementations, but are meant to serve as a basis for further discussions.
The goal of the proposed APIs is to replace the inherently unsafe, complex and unsupported AsyncGetCallTrace with a safe API that serves the same purpose as ASGCT. The requirements are: 1. the returned stack-traces are not safepoint-biased (there are already JVMTI APIs that return biased stack-traces). 2. the API is signal-safe (e.g. non-blocking, not calling into syscalls, doing allocations, etc) 3. Not exposing and don't make assumptions about JVM internals 4. Keep API surface as minimal as possible The first concept has been proposed by Oracle. It is a simple API to request a stack-trace through JVMTI. The stack-trace would be reported (asynchronously) through a JFR event. The API allows to pass a (jlong) 'user_data', which is then sent along with the JFR event. That way an agent has the ability to associate the stack-trace with whatever else it needs to do. https://github.com/openjdk/jdk/pull/29038 The PR also links to example code that shows how the API would be used in a simple HelloWorld profiling agent. The implementation uses the JFR CPU time sampler (https://openjdk.org/jeps/509) infrastructure for building and sending the stack-traces. It basically only extends the entry-point (from a signal handler) such that it can also accept a request from JVMTI, and the event-sending code, such that it can also send the new AsyncStackTrace event. Personally, I don't like that POC too much. I find it weird how it crosses subsystem boundaries (call JVMTI, get an event from JFR), and it is cumbersome to handle in a profiler agent (how to associate the JFR events with whatever the profiler agent wants to do). That is why I also made a second POC: The second POC proposes a JVMTI-only API. Like in the first POC, an agent can request a stack-trace through a signal-safe API. However, instead of getting a JFR, it would get the stack-trace via callback functions. This seems significantly cleaner to me, and seems to be in the spirit of various other JVMTI APIs (e.g. heap walking). Also, it has the advantage that it does not expose any JVM internals, makes no assumptions about datastructures, etc. https://github.com/openjdk/jdk/pull/29067 The POC implementation still uses the same JFR CPU time sampler, but that doesn't have to be so. In-fact, I think if we agree that this is the way to go, we would come up with a much cleaner design that implements a similar stack-walker infrastructure fully within JVMTI, or even let JFR and JVMTI share the same code for the async stack-walking, and clean up some weird an unnecessary dependencies on the way (e.g. with JFR thread sampling code). Please let me know what you think! Cheers, Roman
