Supportive of inclusion as well. General preference for invoking as a library rather than forking processes.
Jon, thanks for the tips on off-CPU profiling - added to my personal cheat sheet. I have seen cases where specific async-profiler/JVM/Cassandra version combos (JDK11/4.1-derived source tree) will immediately crash the JVM on profile - especially successive profile invocations on the same process - but have not observed this on JDK21 or trunk-derived source trees. If we have user reports of that happening, we’ll need to figure out how to reproduce and get to the bottom of it. – Scott > On Jun 13, 2025, at 5:24 PM, Francisco Guerrero <fran...@apache.org> wrote: > > Thanks for bringing this discussion Doug. I didn't realize that > async-profiler allows you to > bring it as a dependency. It looks pretty neat from what I could tell. I also > think bringing > this to Cassandra as a dependency is a reasonable approach. We need to come > up with > a solid way to expose this via JMX / vtable. > > Best, > - Francisco > >> On 2025/06/13 21:08:28 Doug Rohrer wrote: >> The nice thing from what I can tell about using the Java API per [6] below >> is that you can literally just get an instance of the profiler and pass it >> some commands in the `execute` method… just need to be careful how much of >> that surface area we expose. Jon (and others obviously) I’d love to get your >> take on how we could make a useful interface to the async-profiler, maybe >> exposed via JMX, that doesn’t require someone to read the entirety of the >> async-profiler docs and provides some useful profiles without the rough >> edges (things like managing temp files so users don’t have to know the >> layout of the filesystem C* is running on, for example, since at least in >> the Sidecar we’d be executing this on behalf of a remote user, with all of >> the constraints that implies). >> >> We can always be more protective in the Sidecar than we are server-side as >> well, but it seems like helping operators not do bad things is a good thing. >> >> Obviously we’d want the ability Cassandra-side to disable this functionality >> all together however we implement it. >> >> Doug >> >>>> On Jun 13, 2025, at 2:38 PM, Jon Haddad <j...@rustyrazorblade.com> wrote: >>> >>> I'd be very happy to see async-profiler included with C* I've made >>> extensive use of it in my performance evaluations [1][2], and even posted a >>> video about it [3] for general Java perf analysis (among others). It's >>> part of easy-cass-lab and is easily the most informative tool I've found >>> for the getting to the bottom of anything performance related. >>> >>> There's probably a good case to be made for including it with the C* >>> artifact as well as having it be something you can drop in. I lean towards >>> including it all the time, but I haven't run it this way myself yet, so >>> there might be some downside I'm unaware of. >>> >>> When you call the asprof executable, it attaches the async-profiler to the >>> running jvm using jattach [4]. We could do this as well, if we wanted to >>> avoid including it with the release, but I don't know how much we really >>> benefit from that. I've run into issues with it when it's unable to >>> detatch correctly, then you're unable to reattach it until after the server >>> is restarted. On the flip side, I don't know if you're able to set up all >>> the same options for arbitrary profiling when it's loaded as an agent and >>> turned on/off dynamically. I think we can, based on the integration page >>> [6], but I haven't tried it yet. It would be a bummer if we only had a >>> single mode of profiling available. >>> >>> The default mode, CPU profiling, is fantastic, but I've also made extensive >>> use of allocation profiling [5] to identify perf issues as well so having >>> that available is a must, imo. Wall clock / off cpu profiling is great for >>> identifying when IO is the root cause, which isn't clearly revealed by >>> on-cpu profiling due to the way threads are scheduled. When I look at a >>> system I typically do CPU / Wall / Alloc / Off-CPU to be thorough, and the >>> last thing you want to do is have to restart between each one. You can >>> also specify specific Java methods, include or exclude frames matching >>> specific regex, and a whole slew of other options. The latest version even >>> supports continuous profiling with heatmaps although I haven't tried it >>> yet. >>> >>> So hopefully the option we go with allows all of that, otherwise the limits >>> would impose more of a headache to me as I'd need to remove it and continue >>> to bring my own. >>> >>> Under the hood, the async-profiler uses Linux perf events + <> asynchronous >>> polling of the java stack to match them up and generate it's reports. As a >>> result, it requires certain permissions to run and get all the details I >>> like. Specifically these kernel parameters: >>> >>> sudo sysctl kernel.perf_event_paranoid=1 >>> sudo sysctl kernel.kptr_restrict=0 >>> >>> You also need to enable some capabilities for off-cpu profiliing: >>> >>> sudo find /usr/lib/jvm/ -type f -name 'java' -exec setcap >>> "cap_perfmon,cap_sys_ptrace,cap_syslog=ep" {} \; >>> >>> Then you can do off-cpu with this wild cryptic version (shout out to Andrei >>> Pangin for helping me with this [7]): >>> >>> asprof -e kprobe:schedule -i 2 --cstack dwarf -X '*Unsafe.park*' "${@:2}" >>> $PID >>> >>> There's also some subtle issues when it's run in a container, since by >>> default you don't have access to the perf_event_open syscall. Just >>> something to keep in mind. This is one of my main grievances with >>> container deployments. >>> >>> Indeed Patrick, I am very happy to see this discussion! Thanks Doug for >>> starting the thread. >>> >>> Jon >>> >>> [1] https://issues.apache.org/jira/browse/CASSANDRA-15452 >>> [2] https://issues.apache.org/jira/browse/CASSANDRA-19477 >>> [3] >>> https://www.youtube.com/watch?v=yNZtnzjyJRI&t=212s&pp=ygUOYXN5bmMgcHJvZmlsZXI%3D >>> [4] >>> https://github.com/async-profiler/async-profiler/blob/2b556680dc8f5d02c3f26ac119d835dc2381e604/src/jattach/jattach_hotspot.c#L38 >>> [5] https://issues.apache.org/jira/browse/CASSANDRA-20428 >>> [6] >>> https://github.com/async-profiler/async-profiler/blob/master/docs/IntegratingAsyncProfiler.md >>> [7] https://github.com/async-profiler/async-profiler/issues/907 >>> >>> >>> On Fri, Jun 13, 2025 at 10:18 AM Patrick McFadin <pmcfa...@gmail.com >>> <mailto:pmcfa...@gmail.com>> wrote: >>>> The fact o3 used "Bus-factor" as a dimension is just amazing. >>>> >>>> After reading more about the project, the possibilities are pretty >>>> interesting. I suspect we'll see this in a Haddad talk soon. >>>> >>>> On Fri, Jun 13, 2025 at 1:57 AM Josh McKenzie <jmcken...@apache.org >>>> <mailto:jmcken...@apache.org>> wrote: >>>>> I was curious if o3 (model from OpenAI) would be able to do a deep dive >>>>> health check on a repo to assist in considering taking it as a >>>>> dependency. The results can be found here: >>>>> https://chatgpt.com/share/684be703-1d4c-8002-b831-f997f829f4b4 >>>>> >>>>> Apparently it can, and can do it quite well. This was a useful time saver >>>>> (and honestly did a better job than I usually can in > 10x the time) >>>>> >>>>> I'm +1 to taking this as a dependency on the lib in core C*. The rest of >>>>> the ecosystem can consume it (more easily if we move to a >>>>> cassandra-shared regime shared library build as well), and it opens up >>>>> some interesting opportunities for us in both how we test core C* proper >>>>> and what we expose in tooling. >>>>> >>>>> On Thu, Jun 12, 2025, at 7:36 PM, Paulo Motta wrote: >>>>>> I'd prefer to avoid calling an external process and use the library if >>>>>> possible. Not sure about including it in the project by default, but >>>>>> also not against. >>>>>> >>>>>> If there's contention about including it, I wonder if it would make >>>>>> sense to explore java's optional module extension[1] to make this >>>>>> available optionally ? I can see this being useful for other extensions >>>>>> if we haven't explored that option. >>>>>> >>>>>> Then we could have another project cassandra-sidecar-extensions (or >>>>>> similar) that would be linked by sidecar/advanced operators to enable >>>>>> extended featureset in the main process. >>>>>> >>>>>> >>>>>> [1] - >>>>>> https://openjdk.org/projects/jigsaw/doc/topics/optional.html >>>>>> >>>>>> On Thu, 12 Jun 2025 at 17:57 Doug Rohrer <droh...@apple.com >>>>>> <mailto:droh...@apple.com>> wrote: >>>>>> Hey folks! >>>>>> >>>>>> We're looking into enabling the sidecar to collect async profiles from >>>>>> Cassandra and, digging through the async-profiler code and usage, it >>>>>> seems like there may be a few different ways to do it. I’m curious if >>>>>> other folks have already done this beyond just “run asprof with the pid >>>>>> of the Cassandra process”, as I’m a bit hesitant to depend on executing >>>>>> an external process from the Sidecar to gather the actual profile if we >>>>>> can avoid it. >>>>>> >>>>>> There seem to be some opportunities to integrate the profiler into >>>>>> another project (see >>>>>> https://github.com/async-profiler/async-profiler/blob/master/docs/IntegratingAsyncProfiler.md#using-java-api) >>>>>> but it seems this would end up having to be part of Cassandra, and >>>>>> somehow callable via the sidecar (JMX? Some virtual table interface >>>>>> where you insert a row to start a profile with the profiler options, and >>>>>> it kicks off the profile, dumping the results into the table when it’s >>>>>> done?). >>>>>> >>>>>> The benefit in putting this functionality into Cassandra would be that >>>>>> other consumers (in-jvm dtests, python dtests, other monitoring systems >>>>>> where Sidecar isn’t available, easy-cass-lab) would be able to leverage >>>>>> the same interface rather than having to re-invent the wheel each time. >>>>>> >>>>>> Drawback is it’s another library, and one with native library >>>>>> dependencies, added to the class path and loaded at runtime. >>>>>> >>>>>> Thoughts? Previous experiences (good or bad)? >>>>>> >>>>>> Thanks, >>>>>> >>>>>> Doug >>>>> >> >>