> I have seen cases where specific async-profiler/JVM/Cassandra version combos 
> (JDK11/4.1-derived source tree) will immediately crash the JVM on profile - 
> especially successive profile invocations on the same process
This would be a great candidate for testing to ensure that, at least for 
provided profiles, this doesn't happen.

On Fri, Jun 13, 2025, at 10:41 PM, C. Scott Andreas wrote:
> Supportive of inclusion as well. General preference for invoking as a library 
> rather than forking processes.
> 
> Jon, thanks for the tips on off-CPU profiling - added to my personal cheat 
> sheet.
> 
> I have seen cases where specific async-profiler/JVM/Cassandra version combos 
> (JDK11/4.1-derived source tree) will immediately crash the JVM on profile - 
> especially successive profile invocations on the same process - but have not 
> observed this on JDK21 or trunk-derived source trees. If we have user reports 
> of that happening, we’ll need to figure out how to reproduce and get to the 
> bottom of it.
> 
> – Scott
> 
> > On Jun 13, 2025, at 5:24 PM, Francisco Guerrero <fran...@apache.org> wrote:
> > 
> > Thanks for bringing this discussion Doug. I didn't realize that 
> > async-profiler allows you to
> > bring it as a dependency. It looks pretty neat from what I could tell. I 
> > also think bringing
> > this to Cassandra as a dependency is a reasonable approach. We need to come 
> > up with
> > a solid way to expose this via JMX / vtable.
> > 
> > Best,
> > - Francisco
> > 
> >> On 2025/06/13 21:08:28 Doug Rohrer wrote:
> >> The nice thing from what I can tell about using the Java API per [6] below 
> >> is that you can literally just get an instance of the profiler and pass it 
> >> some commands in the `execute` method… just need to be careful how much of 
> >> that surface area we expose. Jon (and others obviously) I’d love to get 
> >> your take on how we could make a useful interface to the async-profiler, 
> >> maybe exposed via JMX, that doesn’t require someone to read the entirety 
> >> of the async-profiler docs and provides some useful profiles without the 
> >> rough edges (things like managing temp files so users don’t have to know 
> >> the layout of the filesystem C* is running on, for example, since at least 
> >> in the Sidecar we’d be executing this on behalf of a remote user, with all 
> >> of the constraints that implies).
> >> 
> >> We can always be more protective in the Sidecar than we are server-side as 
> >> well, but it seems like helping operators not do bad things is a good 
> >> thing.
> >> 
> >> Obviously we’d want the ability Cassandra-side to disable this 
> >> functionality all together however we implement it.
> >> 
> >> Doug
> >> 
> >>>> On Jun 13, 2025, at 2:38 PM, Jon Haddad <j...@rustyrazorblade.com> wrote:
> >>> 
> >>> I'd be very happy to see async-profiler included with C*  I've made 
> >>> extensive use of it in my performance evaluations [1][2], and even posted 
> >>> a video about it [3] for general Java perf analysis (among others).  It's 
> >>> part of easy-cass-lab and is easily the most informative tool I've found 
> >>> for the getting to the bottom of anything performance related.
> >>> 
> >>> There's probably a good case to be made for including it with the C* 
> >>> artifact as well as having it be something you can drop in. I lean 
> >>> towards including it all the time, but I haven't run it this way myself 
> >>> yet, so there might be some downside I'm unaware of.
> >>> 
> >>> When you call the asprof executable, it attaches the async-profiler to 
> >>> the running jvm using jattach [4].  We could do this as well, if we 
> >>> wanted to avoid including it with the release, but I don't know how much 
> >>> we really benefit from that.  I've run into issues with it when it's 
> >>> unable to detatch correctly, then you're unable to reattach it until 
> >>> after the server is restarted.  On the flip side, I don't know if you're 
> >>> able to set up all the same options for arbitrary profiling when it's 
> >>> loaded as an agent and turned on/off dynamically.  I think we can, based 
> >>> on the integration page [6], but I haven't tried it yet.  It would be a 
> >>> bummer if we only had a single mode of profiling available.  
> >>> 
> >>> The default mode, CPU profiling, is fantastic, but I've also made 
> >>> extensive use of allocation profiling [5] to identify perf issues as well 
> >>> so having that available is a must, imo. Wall clock / off cpu profiling 
> >>> is great for identifying when IO is the root cause, which isn't clearly 
> >>> revealed by on-cpu profiling due to the way threads are scheduled.  When 
> >>> I look at a system I typically do CPU / Wall / Alloc / Off-CPU to be 
> >>> thorough, and the last thing you want to do is have to restart between 
> >>> each one.  You can also specify specific Java methods, include or exclude 
> >>> frames matching specific regex, and a whole slew of other options.  The 
> >>> latest version even supports continuous profiling with heatmaps although 
> >>> I haven't tried it yet.  
> >>> 
> >>> So hopefully the option we go with allows all of that, otherwise the 
> >>> limits would impose more of a headache to me as I'd need to remove it and 
> >>> continue to bring my own.
> >>> 
> >>> Under the hood, the async-profiler uses Linux perf events + <> 
> >>> asynchronous polling of the java stack to match them up and generate it's 
> >>> reports.  As a result, it requires certain permissions to run and get all 
> >>> the details I like.  Specifically these kernel parameters:
> >>> 
> >>> sudo sysctl kernel.perf_event_paranoid=1
> >>> sudo sysctl kernel.kptr_restrict=0
> >>> 
> >>> You also need to enable some capabilities for off-cpu profiliing:
> >>> 
> >>> sudo find /usr/lib/jvm/ -type f -name 'java' -exec setcap 
> >>> "cap_perfmon,cap_sys_ptrace,cap_syslog=ep" {} \;
> >>> 
> >>> Then you can do off-cpu with this wild cryptic version (shout out to 
> >>> Andrei Pangin for helping me with this [7]):
> >>> 
> >>> asprof -e kprobe:schedule -i 2 --cstack dwarf -X '*Unsafe.park*' "${@:2}" 
> >>> $PID
> >>> 
> >>> There's also some subtle issues when it's run in a container, since by 
> >>> default you don't have access to the perf_event_open syscall.  Just 
> >>> something to keep in mind.  This is one of my main grievances with 
> >>> container deployments.
> >>> 
> >>> Indeed Patrick, I am very happy to see this discussion!  Thanks Doug for 
> >>> starting the thread.
> >>> 
> >>> Jon
> >>> 
> >>> [1] https://issues.apache.org/jira/browse/CASSANDRA-15452
> >>> [2] https://issues.apache.org/jira/browse/CASSANDRA-19477
> >>> [3] 
> >>> https://www.youtube.com/watch?v=yNZtnzjyJRI&t=212s&pp=ygUOYXN5bmMgcHJvZmlsZXI%3D
> >>> [4] 
> >>> https://github.com/async-profiler/async-profiler/blob/2b556680dc8f5d02c3f26ac119d835dc2381e604/src/jattach/jattach_hotspot.c#L38
> >>> [5] https://issues.apache.org/jira/browse/CASSANDRA-20428
> >>> [6] 
> >>> https://github.com/async-profiler/async-profiler/blob/master/docs/IntegratingAsyncProfiler.md
> >>> [7] https://github.com/async-profiler/async-profiler/issues/907
> >>> 
> >>> 
> >>> On Fri, Jun 13, 2025 at 10:18 AM Patrick McFadin <pmcfa...@gmail.com 
> >>> <mailto:pmcfa...@gmail.com>> wrote:
> >>>> The fact o3 used "Bus-factor" as a dimension is just amazing.
> >>>> 
> >>>> After reading more about the project, the possibilities are pretty 
> >>>> interesting. I suspect we'll see this in a Haddad talk soon.
> >>>> 
> >>>> On Fri, Jun 13, 2025 at 1:57 AM Josh McKenzie <jmcken...@apache.org 
> >>>> <mailto:jmcken...@apache.org>> wrote:
> >>>>> I was curious if o3 (model from OpenAI) would be able to do a deep dive 
> >>>>> health check on a repo to assist in considering taking it as a 
> >>>>> dependency. The results can be found here: 
> >>>>> https://chatgpt.com/share/684be703-1d4c-8002-b831-f997f829f4b4
> >>>>> 
> >>>>> Apparently it can, and can do it quite well. This was a useful time 
> >>>>> saver (and honestly did a better job than I usually can in > 10x the 
> >>>>> time)
> >>>>> 
> >>>>> I'm +1 to taking this as a dependency on the lib in core C*. The rest 
> >>>>> of the ecosystem can consume it (more easily if we move to a 
> >>>>> cassandra-shared regime shared library build as well), and it opens up 
> >>>>> some interesting opportunities for us in both how we test core C* 
> >>>>> proper and what we expose in tooling.
> >>>>> 
> >>>>> On Thu, Jun 12, 2025, at 7:36 PM, Paulo Motta wrote:
> >>>>>> I'd prefer to avoid calling an external process and use the library if 
> >>>>>> possible. Not sure about including it in the project by default, but 
> >>>>>> also not against.
> >>>>>> 
> >>>>>> If there's contention about including it, I wonder if it would make 
> >>>>>> sense to explore  java's optional module extension[1] to make this 
> >>>>>> available optionally ? I can see this being useful for other 
> >>>>>> extensions if we haven't explored that option.
> >>>>>> 
> >>>>>> Then we could have another project cassandra-sidecar-extensions (or 
> >>>>>> similar) that would be linked by sidecar/advanced operators to enable 
> >>>>>> extended featureset in the main process.
> >>>>>> 
> >>>>>> 
> >>>>>> [1] -
> >>>>>> https://openjdk.org/projects/jigsaw/doc/topics/optional.html
> >>>>>> 
> >>>>>> On Thu, 12 Jun 2025 at 17:57 Doug Rohrer <droh...@apple.com 
> >>>>>> <mailto:droh...@apple.com>> wrote:
> >>>>>> Hey folks!
> >>>>>> 
> >>>>>> We're looking into enabling the sidecar to collect async profiles from 
> >>>>>> Cassandra and, digging through the async-profiler code and usage, it 
> >>>>>> seems like there may be a few different ways to do it. I’m curious if 
> >>>>>> other folks have already done this beyond just “run asprof with the 
> >>>>>> pid of the Cassandra process”, as I’m a bit hesitant to depend on 
> >>>>>> executing an external process from the Sidecar to gather the actual 
> >>>>>> profile if we can avoid it.
> >>>>>> 
> >>>>>> There seem to be some opportunities to integrate the profiler into 
> >>>>>> another project (see 
> >>>>>> https://github.com/async-profiler/async-profiler/blob/master/docs/IntegratingAsyncProfiler.md#using-java-api)
> >>>>>>  but it seems this would end up having to be part of Cassandra, and 
> >>>>>> somehow callable via the sidecar (JMX? Some virtual table interface 
> >>>>>> where you insert a row to start a profile with the profiler options, 
> >>>>>> and it kicks off the profile, dumping the results into the table when 
> >>>>>> it’s done?).
> >>>>>> 
> >>>>>> The benefit in putting this functionality into Cassandra would be that 
> >>>>>> other consumers (in-jvm dtests, python dtests, other monitoring 
> >>>>>> systems where Sidecar isn’t available, easy-cass-lab) would be able to 
> >>>>>> leverage the same interface rather than having to re-invent the wheel 
> >>>>>> each time.
> >>>>>> 
> >>>>>> Drawback is it’s another library, and one with native library 
> >>>>>> dependencies, added to the class path and loaded at runtime.
> >>>>>> 
> >>>>>> Thoughts? Previous experiences (good or bad)?
> >>>>>> 
> >>>>>> Thanks,
> >>>>>> 
> >>>>>> Doug
> >>>>> 
> >> 
> >> 
> 

Reply via email to