> I have seen cases where specific async-profiler/JVM/Cassandra version combos > (JDK11/4.1-derived source tree) will immediately crash the JVM on profile - > especially successive profile invocations on the same process This would be a great candidate for testing to ensure that, at least for provided profiles, this doesn't happen.
On Fri, Jun 13, 2025, at 10:41 PM, C. Scott Andreas wrote: > Supportive of inclusion as well. General preference for invoking as a library > rather than forking processes. > > Jon, thanks for the tips on off-CPU profiling - added to my personal cheat > sheet. > > I have seen cases where specific async-profiler/JVM/Cassandra version combos > (JDK11/4.1-derived source tree) will immediately crash the JVM on profile - > especially successive profile invocations on the same process - but have not > observed this on JDK21 or trunk-derived source trees. If we have user reports > of that happening, we’ll need to figure out how to reproduce and get to the > bottom of it. > > – Scott > > > On Jun 13, 2025, at 5:24 PM, Francisco Guerrero <fran...@apache.org> wrote: > > > > Thanks for bringing this discussion Doug. I didn't realize that > > async-profiler allows you to > > bring it as a dependency. It looks pretty neat from what I could tell. I > > also think bringing > > this to Cassandra as a dependency is a reasonable approach. We need to come > > up with > > a solid way to expose this via JMX / vtable. > > > > Best, > > - Francisco > > > >> On 2025/06/13 21:08:28 Doug Rohrer wrote: > >> The nice thing from what I can tell about using the Java API per [6] below > >> is that you can literally just get an instance of the profiler and pass it > >> some commands in the `execute` method… just need to be careful how much of > >> that surface area we expose. Jon (and others obviously) I’d love to get > >> your take on how we could make a useful interface to the async-profiler, > >> maybe exposed via JMX, that doesn’t require someone to read the entirety > >> of the async-profiler docs and provides some useful profiles without the > >> rough edges (things like managing temp files so users don’t have to know > >> the layout of the filesystem C* is running on, for example, since at least > >> in the Sidecar we’d be executing this on behalf of a remote user, with all > >> of the constraints that implies). > >> > >> We can always be more protective in the Sidecar than we are server-side as > >> well, but it seems like helping operators not do bad things is a good > >> thing. > >> > >> Obviously we’d want the ability Cassandra-side to disable this > >> functionality all together however we implement it. > >> > >> Doug > >> > >>>> On Jun 13, 2025, at 2:38 PM, Jon Haddad <j...@rustyrazorblade.com> wrote: > >>> > >>> I'd be very happy to see async-profiler included with C* I've made > >>> extensive use of it in my performance evaluations [1][2], and even posted > >>> a video about it [3] for general Java perf analysis (among others). It's > >>> part of easy-cass-lab and is easily the most informative tool I've found > >>> for the getting to the bottom of anything performance related. > >>> > >>> There's probably a good case to be made for including it with the C* > >>> artifact as well as having it be something you can drop in. I lean > >>> towards including it all the time, but I haven't run it this way myself > >>> yet, so there might be some downside I'm unaware of. > >>> > >>> When you call the asprof executable, it attaches the async-profiler to > >>> the running jvm using jattach [4]. We could do this as well, if we > >>> wanted to avoid including it with the release, but I don't know how much > >>> we really benefit from that. I've run into issues with it when it's > >>> unable to detatch correctly, then you're unable to reattach it until > >>> after the server is restarted. On the flip side, I don't know if you're > >>> able to set up all the same options for arbitrary profiling when it's > >>> loaded as an agent and turned on/off dynamically. I think we can, based > >>> on the integration page [6], but I haven't tried it yet. It would be a > >>> bummer if we only had a single mode of profiling available. > >>> > >>> The default mode, CPU profiling, is fantastic, but I've also made > >>> extensive use of allocation profiling [5] to identify perf issues as well > >>> so having that available is a must, imo. Wall clock / off cpu profiling > >>> is great for identifying when IO is the root cause, which isn't clearly > >>> revealed by on-cpu profiling due to the way threads are scheduled. When > >>> I look at a system I typically do CPU / Wall / Alloc / Off-CPU to be > >>> thorough, and the last thing you want to do is have to restart between > >>> each one. You can also specify specific Java methods, include or exclude > >>> frames matching specific regex, and a whole slew of other options. The > >>> latest version even supports continuous profiling with heatmaps although > >>> I haven't tried it yet. > >>> > >>> So hopefully the option we go with allows all of that, otherwise the > >>> limits would impose more of a headache to me as I'd need to remove it and > >>> continue to bring my own. > >>> > >>> Under the hood, the async-profiler uses Linux perf events + <> > >>> asynchronous polling of the java stack to match them up and generate it's > >>> reports. As a result, it requires certain permissions to run and get all > >>> the details I like. Specifically these kernel parameters: > >>> > >>> sudo sysctl kernel.perf_event_paranoid=1 > >>> sudo sysctl kernel.kptr_restrict=0 > >>> > >>> You also need to enable some capabilities for off-cpu profiliing: > >>> > >>> sudo find /usr/lib/jvm/ -type f -name 'java' -exec setcap > >>> "cap_perfmon,cap_sys_ptrace,cap_syslog=ep" {} \; > >>> > >>> Then you can do off-cpu with this wild cryptic version (shout out to > >>> Andrei Pangin for helping me with this [7]): > >>> > >>> asprof -e kprobe:schedule -i 2 --cstack dwarf -X '*Unsafe.park*' "${@:2}" > >>> $PID > >>> > >>> There's also some subtle issues when it's run in a container, since by > >>> default you don't have access to the perf_event_open syscall. Just > >>> something to keep in mind. This is one of my main grievances with > >>> container deployments. > >>> > >>> Indeed Patrick, I am very happy to see this discussion! Thanks Doug for > >>> starting the thread. > >>> > >>> Jon > >>> > >>> [1] https://issues.apache.org/jira/browse/CASSANDRA-15452 > >>> [2] https://issues.apache.org/jira/browse/CASSANDRA-19477 > >>> [3] > >>> https://www.youtube.com/watch?v=yNZtnzjyJRI&t=212s&pp=ygUOYXN5bmMgcHJvZmlsZXI%3D > >>> [4] > >>> https://github.com/async-profiler/async-profiler/blob/2b556680dc8f5d02c3f26ac119d835dc2381e604/src/jattach/jattach_hotspot.c#L38 > >>> [5] https://issues.apache.org/jira/browse/CASSANDRA-20428 > >>> [6] > >>> https://github.com/async-profiler/async-profiler/blob/master/docs/IntegratingAsyncProfiler.md > >>> [7] https://github.com/async-profiler/async-profiler/issues/907 > >>> > >>> > >>> On Fri, Jun 13, 2025 at 10:18 AM Patrick McFadin <pmcfa...@gmail.com > >>> <mailto:pmcfa...@gmail.com>> wrote: > >>>> The fact o3 used "Bus-factor" as a dimension is just amazing. > >>>> > >>>> After reading more about the project, the possibilities are pretty > >>>> interesting. I suspect we'll see this in a Haddad talk soon. > >>>> > >>>> On Fri, Jun 13, 2025 at 1:57 AM Josh McKenzie <jmcken...@apache.org > >>>> <mailto:jmcken...@apache.org>> wrote: > >>>>> I was curious if o3 (model from OpenAI) would be able to do a deep dive > >>>>> health check on a repo to assist in considering taking it as a > >>>>> dependency. The results can be found here: > >>>>> https://chatgpt.com/share/684be703-1d4c-8002-b831-f997f829f4b4 > >>>>> > >>>>> Apparently it can, and can do it quite well. This was a useful time > >>>>> saver (and honestly did a better job than I usually can in > 10x the > >>>>> time) > >>>>> > >>>>> I'm +1 to taking this as a dependency on the lib in core C*. The rest > >>>>> of the ecosystem can consume it (more easily if we move to a > >>>>> cassandra-shared regime shared library build as well), and it opens up > >>>>> some interesting opportunities for us in both how we test core C* > >>>>> proper and what we expose in tooling. > >>>>> > >>>>> On Thu, Jun 12, 2025, at 7:36 PM, Paulo Motta wrote: > >>>>>> I'd prefer to avoid calling an external process and use the library if > >>>>>> possible. Not sure about including it in the project by default, but > >>>>>> also not against. > >>>>>> > >>>>>> If there's contention about including it, I wonder if it would make > >>>>>> sense to explore java's optional module extension[1] to make this > >>>>>> available optionally ? I can see this being useful for other > >>>>>> extensions if we haven't explored that option. > >>>>>> > >>>>>> Then we could have another project cassandra-sidecar-extensions (or > >>>>>> similar) that would be linked by sidecar/advanced operators to enable > >>>>>> extended featureset in the main process. > >>>>>> > >>>>>> > >>>>>> [1] - > >>>>>> https://openjdk.org/projects/jigsaw/doc/topics/optional.html > >>>>>> > >>>>>> On Thu, 12 Jun 2025 at 17:57 Doug Rohrer <droh...@apple.com > >>>>>> <mailto:droh...@apple.com>> wrote: > >>>>>> Hey folks! > >>>>>> > >>>>>> We're looking into enabling the sidecar to collect async profiles from > >>>>>> Cassandra and, digging through the async-profiler code and usage, it > >>>>>> seems like there may be a few different ways to do it. I’m curious if > >>>>>> other folks have already done this beyond just “run asprof with the > >>>>>> pid of the Cassandra process”, as I’m a bit hesitant to depend on > >>>>>> executing an external process from the Sidecar to gather the actual > >>>>>> profile if we can avoid it. > >>>>>> > >>>>>> There seem to be some opportunities to integrate the profiler into > >>>>>> another project (see > >>>>>> https://github.com/async-profiler/async-profiler/blob/master/docs/IntegratingAsyncProfiler.md#using-java-api) > >>>>>> but it seems this would end up having to be part of Cassandra, and > >>>>>> somehow callable via the sidecar (JMX? Some virtual table interface > >>>>>> where you insert a row to start a profile with the profiler options, > >>>>>> and it kicks off the profile, dumping the results into the table when > >>>>>> it’s done?). > >>>>>> > >>>>>> The benefit in putting this functionality into Cassandra would be that > >>>>>> other consumers (in-jvm dtests, python dtests, other monitoring > >>>>>> systems where Sidecar isn’t available, easy-cass-lab) would be able to > >>>>>> leverage the same interface rather than having to re-invent the wheel > >>>>>> each time. > >>>>>> > >>>>>> Drawback is it’s another library, and one with native library > >>>>>> dependencies, added to the class path and loaded at runtime. > >>>>>> > >>>>>> Thoughts? Previous experiences (good or bad)? > >>>>>> > >>>>>> Thanks, > >>>>>> > >>>>>> Doug > >>>>> > >> > >> >