Thanks for bringing this discussion Doug. I didn't realize that async-profiler 
allows you to
bring it as a dependency. It looks pretty neat from what I could tell. I also 
think bringing
this to Cassandra as a dependency is a reasonable approach. We need to come up 
with
a solid way to expose this via JMX / vtable.

Best,
- Francisco

On 2025/06/13 21:08:28 Doug Rohrer wrote:
> The nice thing from what I can tell about using the Java API per [6] below is 
> that you can literally just get an instance of the profiler and pass it some 
> commands in the `execute` method… just need to be careful how much of that 
> surface area we expose. Jon (and others obviously) I’d love to get your take 
> on how we could make a useful interface to the async-profiler, maybe exposed 
> via JMX, that doesn’t require someone to read the entirety of the 
> async-profiler docs and provides some useful profiles without the rough edges 
> (things like managing temp files so users don’t have to know the layout of 
> the filesystem C* is running on, for example, since at least in the Sidecar 
> we’d be executing this on behalf of a remote user, with all of the 
> constraints that implies).
> 
> We can always be more protective in the Sidecar than we are server-side as 
> well, but it seems like helping operators not do bad things is a good thing.
> 
> Obviously we’d want the ability Cassandra-side to disable this functionality 
> all together however we implement it.
> 
> Doug
> 
> > On Jun 13, 2025, at 2:38 PM, Jon Haddad <j...@rustyrazorblade.com> wrote:
> > 
> > I'd be very happy to see async-profiler included with C*  I've made 
> > extensive use of it in my performance evaluations [1][2], and even posted a 
> > video about it [3] for general Java perf analysis (among others).  It's 
> > part of easy-cass-lab and is easily the most informative tool I've found 
> > for the getting to the bottom of anything performance related.
> > 
> > There's probably a good case to be made for including it with the C* 
> > artifact as well as having it be something you can drop in. I lean towards 
> > including it all the time, but I haven't run it this way myself yet, so 
> > there might be some downside I'm unaware of.
> > 
> > When you call the asprof executable, it attaches the async-profiler to the 
> > running jvm using jattach [4].  We could do this as well, if we wanted to 
> > avoid including it with the release, but I don't know how much we really 
> > benefit from that.  I've run into issues with it when it's unable to 
> > detatch correctly, then you're unable to reattach it until after the server 
> > is restarted.  On the flip side, I don't know if you're able to set up all 
> > the same options for arbitrary profiling when it's loaded as an agent and 
> > turned on/off dynamically.  I think we can, based on the integration page 
> > [6], but I haven't tried it yet.  It would be a bummer if we only had a 
> > single mode of profiling available.  
> > 
> > The default mode, CPU profiling, is fantastic, but I've also made extensive 
> > use of allocation profiling [5] to identify perf issues as well so having 
> > that available is a must, imo. Wall clock / off cpu profiling is great for 
> > identifying when IO is the root cause, which isn't clearly revealed by 
> > on-cpu profiling due to the way threads are scheduled.  When I look at a 
> > system I typically do CPU / Wall / Alloc / Off-CPU to be thorough, and the 
> > last thing you want to do is have to restart between each one.  You can 
> > also specify specific Java methods, include or exclude frames matching 
> > specific regex, and a whole slew of other options.  The latest version even 
> > supports continuous profiling with heatmaps although I haven't tried it 
> > yet.  
> > 
> > So hopefully the option we go with allows all of that, otherwise the limits 
> > would impose more of a headache to me as I'd need to remove it and continue 
> > to bring my own.
> > 
> > Under the hood, the async-profiler uses Linux perf events + <> asynchronous 
> > polling of the java stack to match them up and generate it's reports.  As a 
> > result, it requires certain permissions to run and get all the details I 
> > like.  Specifically these kernel parameters:
> > 
> > sudo sysctl kernel.perf_event_paranoid=1
> > sudo sysctl kernel.kptr_restrict=0
> > 
> > You also need to enable some capabilities for off-cpu profiliing:
> > 
> > sudo find /usr/lib/jvm/ -type f -name 'java' -exec setcap 
> > "cap_perfmon,cap_sys_ptrace,cap_syslog=ep" {} \;
> > 
> > Then you can do off-cpu with this wild cryptic version (shout out to Andrei 
> > Pangin for helping me with this [7]):
> > 
> > asprof -e kprobe:schedule -i 2 --cstack dwarf -X '*Unsafe.park*' "${@:2}" 
> > $PID
> > 
> > There's also some subtle issues when it's run in a container, since by 
> > default you don't have access to the perf_event_open syscall.  Just 
> > something to keep in mind.  This is one of my main grievances with 
> > container deployments.
> > 
> > Indeed Patrick, I am very happy to see this discussion!  Thanks Doug for 
> > starting the thread.
> > 
> > Jon
> > 
> > [1] https://issues.apache.org/jira/browse/CASSANDRA-15452
> > [2] https://issues.apache.org/jira/browse/CASSANDRA-19477
> > [3] 
> > https://www.youtube.com/watch?v=yNZtnzjyJRI&t=212s&pp=ygUOYXN5bmMgcHJvZmlsZXI%3D
> > [4] 
> > https://github.com/async-profiler/async-profiler/blob/2b556680dc8f5d02c3f26ac119d835dc2381e604/src/jattach/jattach_hotspot.c#L38
> > [5] https://issues.apache.org/jira/browse/CASSANDRA-20428
> > [6] 
> > https://github.com/async-profiler/async-profiler/blob/master/docs/IntegratingAsyncProfiler.md
> > [7] https://github.com/async-profiler/async-profiler/issues/907
> > 
> > 
> > On Fri, Jun 13, 2025 at 10:18 AM Patrick McFadin <pmcfa...@gmail.com 
> > <mailto:pmcfa...@gmail.com>> wrote:
> >> The fact o3 used "Bus-factor" as a dimension is just amazing. 
> >> 
> >> After reading more about the project, the possibilities are pretty 
> >> interesting. I suspect we'll see this in a Haddad talk soon. 
> >> 
> >> On Fri, Jun 13, 2025 at 1:57 AM Josh McKenzie <jmcken...@apache.org 
> >> <mailto:jmcken...@apache.org>> wrote:
> >>> I was curious if o3 (model from OpenAI) would be able to do a deep dive 
> >>> health check on a repo to assist in considering taking it as a 
> >>> dependency. The results can be found here: 
> >>> https://chatgpt.com/share/684be703-1d4c-8002-b831-f997f829f4b4
> >>> 
> >>> Apparently it can, and can do it quite well. This was a useful time saver 
> >>> (and honestly did a better job than I usually can in > 10x the time)
> >>> 
> >>> I'm +1 to taking this as a dependency on the lib in core C*. The rest of 
> >>> the ecosystem can consume it (more easily if we move to a 
> >>> cassandra-shared regime shared library build as well), and it opens up 
> >>> some interesting opportunities for us in both how we test core C* proper 
> >>> and what we expose in tooling.
> >>> 
> >>> On Thu, Jun 12, 2025, at 7:36 PM, Paulo Motta wrote:
> >>>> I'd prefer to avoid calling an external process and use the library if 
> >>>> possible. Not sure about including it in the project by default, but 
> >>>> also not against.
> >>>> 
> >>>> If there's contention about including it, I wonder if it would make 
> >>>> sense to explore  java's optional module extension[1] to make this 
> >>>> available optionally ? I can see this being useful for other extensions 
> >>>> if we haven't explored that option.
> >>>> 
> >>>> Then we could have another project cassandra-sidecar-extensions (or 
> >>>> similar) that would be linked by sidecar/advanced operators to enable 
> >>>> extended featureset in the main process.
> >>>> 
> >>>> 
> >>>> [1] - 
> >>>> https://openjdk.org/projects/jigsaw/doc/topics/optional.html
> >>>> 
> >>>> On Thu, 12 Jun 2025 at 17:57 Doug Rohrer <droh...@apple.com 
> >>>> <mailto:droh...@apple.com>> wrote:
> >>>> Hey folks!
> >>>> 
> >>>> We're looking into enabling the sidecar to collect async profiles from 
> >>>> Cassandra and, digging through the async-profiler code and usage, it 
> >>>> seems like there may be a few different ways to do it. I’m curious if 
> >>>> other folks have already done this beyond just “run asprof with the pid 
> >>>> of the Cassandra process”, as I’m a bit hesitant to depend on executing 
> >>>> an external process from the Sidecar to gather the actual profile if we 
> >>>> can avoid it.
> >>>> 
> >>>> There seem to be some opportunities to integrate the profiler into 
> >>>> another project (see 
> >>>> https://github.com/async-profiler/async-profiler/blob/master/docs/IntegratingAsyncProfiler.md#using-java-api)
> >>>>  but it seems this would end up having to be part of Cassandra, and 
> >>>> somehow callable via the sidecar (JMX? Some virtual table interface 
> >>>> where you insert a row to start a profile with the profiler options, and 
> >>>> it kicks off the profile, dumping the results into the table when it’s 
> >>>> done?).
> >>>> 
> >>>> The benefit in putting this functionality into Cassandra would be that 
> >>>> other consumers (in-jvm dtests, python dtests, other monitoring systems 
> >>>> where Sidecar isn’t available, easy-cass-lab) would be able to leverage 
> >>>> the same interface rather than having to re-invent the wheel each time.
> >>>> 
> >>>> Drawback is it’s another library, and one with native library 
> >>>> dependencies, added to the class path and loaded at runtime.
> >>>> 
> >>>> Thoughts? Previous experiences (good or bad)?
> >>>> 
> >>>> Thanks,
> >>>> 
> >>>> Doug
> >>> 
> 
> 

Reply via email to