Supportive of inclusion as well. General preference for invoking as a library 
rather than forking processes.

Jon, thanks for the tips on off-CPU profiling - added to my personal cheat 
sheet.

I have seen cases where specific async-profiler/JVM/Cassandra version combos 
(JDK11/4.1-derived source tree) will immediately crash the JVM on profile - 
especially successive profile invocations on the same process - but have not 
observed this on JDK21 or trunk-derived source trees. If we have user reports 
of that happening, we’ll need to figure out how to reproduce and get to the 
bottom of it.

– Scott

> On Jun 13, 2025, at 5:24 PM, Francisco Guerrero <fran...@apache.org> wrote:
> 
> Thanks for bringing this discussion Doug. I didn't realize that 
> async-profiler allows you to
> bring it as a dependency. It looks pretty neat from what I could tell. I also 
> think bringing
> this to Cassandra as a dependency is a reasonable approach. We need to come 
> up with
> a solid way to expose this via JMX / vtable.
> 
> Best,
> - Francisco
> 
>> On 2025/06/13 21:08:28 Doug Rohrer wrote:
>> The nice thing from what I can tell about using the Java API per [6] below 
>> is that you can literally just get an instance of the profiler and pass it 
>> some commands in the `execute` method… just need to be careful how much of 
>> that surface area we expose. Jon (and others obviously) I’d love to get your 
>> take on how we could make a useful interface to the async-profiler, maybe 
>> exposed via JMX, that doesn’t require someone to read the entirety of the 
>> async-profiler docs and provides some useful profiles without the rough 
>> edges (things like managing temp files so users don’t have to know the 
>> layout of the filesystem C* is running on, for example, since at least in 
>> the Sidecar we’d be executing this on behalf of a remote user, with all of 
>> the constraints that implies).
>> 
>> We can always be more protective in the Sidecar than we are server-side as 
>> well, but it seems like helping operators not do bad things is a good thing.
>> 
>> Obviously we’d want the ability Cassandra-side to disable this functionality 
>> all together however we implement it.
>> 
>> Doug
>> 
>>>> On Jun 13, 2025, at 2:38 PM, Jon Haddad <j...@rustyrazorblade.com> wrote:
>>> 
>>> I'd be very happy to see async-profiler included with C*  I've made 
>>> extensive use of it in my performance evaluations [1][2], and even posted a 
>>> video about it [3] for general Java perf analysis (among others).  It's 
>>> part of easy-cass-lab and is easily the most informative tool I've found 
>>> for the getting to the bottom of anything performance related.
>>> 
>>> There's probably a good case to be made for including it with the C* 
>>> artifact as well as having it be something you can drop in. I lean towards 
>>> including it all the time, but I haven't run it this way myself yet, so 
>>> there might be some downside I'm unaware of.
>>> 
>>> When you call the asprof executable, it attaches the async-profiler to the 
>>> running jvm using jattach [4].  We could do this as well, if we wanted to 
>>> avoid including it with the release, but I don't know how much we really 
>>> benefit from that.  I've run into issues with it when it's unable to 
>>> detatch correctly, then you're unable to reattach it until after the server 
>>> is restarted.  On the flip side, I don't know if you're able to set up all 
>>> the same options for arbitrary profiling when it's loaded as an agent and 
>>> turned on/off dynamically.  I think we can, based on the integration page 
>>> [6], but I haven't tried it yet.  It would be a bummer if we only had a 
>>> single mode of profiling available.  
>>> 
>>> The default mode, CPU profiling, is fantastic, but I've also made extensive 
>>> use of allocation profiling [5] to identify perf issues as well so having 
>>> that available is a must, imo. Wall clock / off cpu profiling is great for 
>>> identifying when IO is the root cause, which isn't clearly revealed by 
>>> on-cpu profiling due to the way threads are scheduled.  When I look at a 
>>> system I typically do CPU / Wall / Alloc / Off-CPU to be thorough, and the 
>>> last thing you want to do is have to restart between each one.  You can 
>>> also specify specific Java methods, include or exclude frames matching 
>>> specific regex, and a whole slew of other options.  The latest version even 
>>> supports continuous profiling with heatmaps although I haven't tried it 
>>> yet.  
>>> 
>>> So hopefully the option we go with allows all of that, otherwise the limits 
>>> would impose more of a headache to me as I'd need to remove it and continue 
>>> to bring my own.
>>> 
>>> Under the hood, the async-profiler uses Linux perf events + <> asynchronous 
>>> polling of the java stack to match them up and generate it's reports.  As a 
>>> result, it requires certain permissions to run and get all the details I 
>>> like.  Specifically these kernel parameters:
>>> 
>>> sudo sysctl kernel.perf_event_paranoid=1
>>> sudo sysctl kernel.kptr_restrict=0
>>> 
>>> You also need to enable some capabilities for off-cpu profiliing:
>>> 
>>> sudo find /usr/lib/jvm/ -type f -name 'java' -exec setcap 
>>> "cap_perfmon,cap_sys_ptrace,cap_syslog=ep" {} \;
>>> 
>>> Then you can do off-cpu with this wild cryptic version (shout out to Andrei 
>>> Pangin for helping me with this [7]):
>>> 
>>> asprof -e kprobe:schedule -i 2 --cstack dwarf -X '*Unsafe.park*' "${@:2}" 
>>> $PID
>>> 
>>> There's also some subtle issues when it's run in a container, since by 
>>> default you don't have access to the perf_event_open syscall.  Just 
>>> something to keep in mind.  This is one of my main grievances with 
>>> container deployments.
>>> 
>>> Indeed Patrick, I am very happy to see this discussion!  Thanks Doug for 
>>> starting the thread.
>>> 
>>> Jon
>>> 
>>> [1] https://issues.apache.org/jira/browse/CASSANDRA-15452
>>> [2] https://issues.apache.org/jira/browse/CASSANDRA-19477
>>> [3] 
>>> https://www.youtube.com/watch?v=yNZtnzjyJRI&t=212s&pp=ygUOYXN5bmMgcHJvZmlsZXI%3D
>>> [4] 
>>> https://github.com/async-profiler/async-profiler/blob/2b556680dc8f5d02c3f26ac119d835dc2381e604/src/jattach/jattach_hotspot.c#L38
>>> [5] https://issues.apache.org/jira/browse/CASSANDRA-20428
>>> [6] 
>>> https://github.com/async-profiler/async-profiler/blob/master/docs/IntegratingAsyncProfiler.md
>>> [7] https://github.com/async-profiler/async-profiler/issues/907
>>> 
>>> 
>>> On Fri, Jun 13, 2025 at 10:18 AM Patrick McFadin <pmcfa...@gmail.com 
>>> <mailto:pmcfa...@gmail.com>> wrote:
>>>> The fact o3 used "Bus-factor" as a dimension is just amazing.
>>>> 
>>>> After reading more about the project, the possibilities are pretty 
>>>> interesting. I suspect we'll see this in a Haddad talk soon.
>>>> 
>>>> On Fri, Jun 13, 2025 at 1:57 AM Josh McKenzie <jmcken...@apache.org 
>>>> <mailto:jmcken...@apache.org>> wrote:
>>>>> I was curious if o3 (model from OpenAI) would be able to do a deep dive 
>>>>> health check on a repo to assist in considering taking it as a 
>>>>> dependency. The results can be found here: 
>>>>> https://chatgpt.com/share/684be703-1d4c-8002-b831-f997f829f4b4
>>>>> 
>>>>> Apparently it can, and can do it quite well. This was a useful time saver 
>>>>> (and honestly did a better job than I usually can in > 10x the time)
>>>>> 
>>>>> I'm +1 to taking this as a dependency on the lib in core C*. The rest of 
>>>>> the ecosystem can consume it (more easily if we move to a 
>>>>> cassandra-shared regime shared library build as well), and it opens up 
>>>>> some interesting opportunities for us in both how we test core C* proper 
>>>>> and what we expose in tooling.
>>>>> 
>>>>> On Thu, Jun 12, 2025, at 7:36 PM, Paulo Motta wrote:
>>>>>> I'd prefer to avoid calling an external process and use the library if 
>>>>>> possible. Not sure about including it in the project by default, but 
>>>>>> also not against.
>>>>>> 
>>>>>> If there's contention about including it, I wonder if it would make 
>>>>>> sense to explore  java's optional module extension[1] to make this 
>>>>>> available optionally ? I can see this being useful for other extensions 
>>>>>> if we haven't explored that option.
>>>>>> 
>>>>>> Then we could have another project cassandra-sidecar-extensions (or 
>>>>>> similar) that would be linked by sidecar/advanced operators to enable 
>>>>>> extended featureset in the main process.
>>>>>> 
>>>>>> 
>>>>>> [1] -
>>>>>> https://openjdk.org/projects/jigsaw/doc/topics/optional.html
>>>>>> 
>>>>>> On Thu, 12 Jun 2025 at 17:57 Doug Rohrer <droh...@apple.com 
>>>>>> <mailto:droh...@apple.com>> wrote:
>>>>>> Hey folks!
>>>>>> 
>>>>>> We're looking into enabling the sidecar to collect async profiles from 
>>>>>> Cassandra and, digging through the async-profiler code and usage, it 
>>>>>> seems like there may be a few different ways to do it. I’m curious if 
>>>>>> other folks have already done this beyond just “run asprof with the pid 
>>>>>> of the Cassandra process”, as I’m a bit hesitant to depend on executing 
>>>>>> an external process from the Sidecar to gather the actual profile if we 
>>>>>> can avoid it.
>>>>>> 
>>>>>> There seem to be some opportunities to integrate the profiler into 
>>>>>> another project (see 
>>>>>> https://github.com/async-profiler/async-profiler/blob/master/docs/IntegratingAsyncProfiler.md#using-java-api)
>>>>>>  but it seems this would end up having to be part of Cassandra, and 
>>>>>> somehow callable via the sidecar (JMX? Some virtual table interface 
>>>>>> where you insert a row to start a profile with the profiler options, and 
>>>>>> it kicks off the profile, dumping the results into the table when it’s 
>>>>>> done?).
>>>>>> 
>>>>>> The benefit in putting this functionality into Cassandra would be that 
>>>>>> other consumers (in-jvm dtests, python dtests, other monitoring systems 
>>>>>> where Sidecar isn’t available, easy-cass-lab) would be able to leverage 
>>>>>> the same interface rather than having to re-invent the wheel each time.
>>>>>> 
>>>>>> Drawback is it’s another library, and one with native library 
>>>>>> dependencies, added to the class path and loaded at runtime.
>>>>>> 
>>>>>> Thoughts? Previous experiences (good or bad)?
>>>>>> 
>>>>>> Thanks,
>>>>>> 
>>>>>> Doug
>>>>> 
>> 
>> 

Reply via email to