Thanks for bringing this discussion Doug. I didn't realize that async-profiler allows you to bring it as a dependency. It looks pretty neat from what I could tell. I also think bringing this to Cassandra as a dependency is a reasonable approach. We need to come up with a solid way to expose this via JMX / vtable.
Best, - Francisco On 2025/06/13 21:08:28 Doug Rohrer wrote: > The nice thing from what I can tell about using the Java API per [6] below is > that you can literally just get an instance of the profiler and pass it some > commands in the `execute` method… just need to be careful how much of that > surface area we expose. Jon (and others obviously) I’d love to get your take > on how we could make a useful interface to the async-profiler, maybe exposed > via JMX, that doesn’t require someone to read the entirety of the > async-profiler docs and provides some useful profiles without the rough edges > (things like managing temp files so users don’t have to know the layout of > the filesystem C* is running on, for example, since at least in the Sidecar > we’d be executing this on behalf of a remote user, with all of the > constraints that implies). > > We can always be more protective in the Sidecar than we are server-side as > well, but it seems like helping operators not do bad things is a good thing. > > Obviously we’d want the ability Cassandra-side to disable this functionality > all together however we implement it. > > Doug > > > On Jun 13, 2025, at 2:38 PM, Jon Haddad <j...@rustyrazorblade.com> wrote: > > > > I'd be very happy to see async-profiler included with C* I've made > > extensive use of it in my performance evaluations [1][2], and even posted a > > video about it [3] for general Java perf analysis (among others). It's > > part of easy-cass-lab and is easily the most informative tool I've found > > for the getting to the bottom of anything performance related. > > > > There's probably a good case to be made for including it with the C* > > artifact as well as having it be something you can drop in. I lean towards > > including it all the time, but I haven't run it this way myself yet, so > > there might be some downside I'm unaware of. > > > > When you call the asprof executable, it attaches the async-profiler to the > > running jvm using jattach [4]. We could do this as well, if we wanted to > > avoid including it with the release, but I don't know how much we really > > benefit from that. I've run into issues with it when it's unable to > > detatch correctly, then you're unable to reattach it until after the server > > is restarted. On the flip side, I don't know if you're able to set up all > > the same options for arbitrary profiling when it's loaded as an agent and > > turned on/off dynamically. I think we can, based on the integration page > > [6], but I haven't tried it yet. It would be a bummer if we only had a > > single mode of profiling available. > > > > The default mode, CPU profiling, is fantastic, but I've also made extensive > > use of allocation profiling [5] to identify perf issues as well so having > > that available is a must, imo. Wall clock / off cpu profiling is great for > > identifying when IO is the root cause, which isn't clearly revealed by > > on-cpu profiling due to the way threads are scheduled. When I look at a > > system I typically do CPU / Wall / Alloc / Off-CPU to be thorough, and the > > last thing you want to do is have to restart between each one. You can > > also specify specific Java methods, include or exclude frames matching > > specific regex, and a whole slew of other options. The latest version even > > supports continuous profiling with heatmaps although I haven't tried it > > yet. > > > > So hopefully the option we go with allows all of that, otherwise the limits > > would impose more of a headache to me as I'd need to remove it and continue > > to bring my own. > > > > Under the hood, the async-profiler uses Linux perf events + <> asynchronous > > polling of the java stack to match them up and generate it's reports. As a > > result, it requires certain permissions to run and get all the details I > > like. Specifically these kernel parameters: > > > > sudo sysctl kernel.perf_event_paranoid=1 > > sudo sysctl kernel.kptr_restrict=0 > > > > You also need to enable some capabilities for off-cpu profiliing: > > > > sudo find /usr/lib/jvm/ -type f -name 'java' -exec setcap > > "cap_perfmon,cap_sys_ptrace,cap_syslog=ep" {} \; > > > > Then you can do off-cpu with this wild cryptic version (shout out to Andrei > > Pangin for helping me with this [7]): > > > > asprof -e kprobe:schedule -i 2 --cstack dwarf -X '*Unsafe.park*' "${@:2}" > > $PID > > > > There's also some subtle issues when it's run in a container, since by > > default you don't have access to the perf_event_open syscall. Just > > something to keep in mind. This is one of my main grievances with > > container deployments. > > > > Indeed Patrick, I am very happy to see this discussion! Thanks Doug for > > starting the thread. > > > > Jon > > > > [1] https://issues.apache.org/jira/browse/CASSANDRA-15452 > > [2] https://issues.apache.org/jira/browse/CASSANDRA-19477 > > [3] > > https://www.youtube.com/watch?v=yNZtnzjyJRI&t=212s&pp=ygUOYXN5bmMgcHJvZmlsZXI%3D > > [4] > > https://github.com/async-profiler/async-profiler/blob/2b556680dc8f5d02c3f26ac119d835dc2381e604/src/jattach/jattach_hotspot.c#L38 > > [5] https://issues.apache.org/jira/browse/CASSANDRA-20428 > > [6] > > https://github.com/async-profiler/async-profiler/blob/master/docs/IntegratingAsyncProfiler.md > > [7] https://github.com/async-profiler/async-profiler/issues/907 > > > > > > On Fri, Jun 13, 2025 at 10:18 AM Patrick McFadin <pmcfa...@gmail.com > > <mailto:pmcfa...@gmail.com>> wrote: > >> The fact o3 used "Bus-factor" as a dimension is just amazing. > >> > >> After reading more about the project, the possibilities are pretty > >> interesting. I suspect we'll see this in a Haddad talk soon. > >> > >> On Fri, Jun 13, 2025 at 1:57 AM Josh McKenzie <jmcken...@apache.org > >> <mailto:jmcken...@apache.org>> wrote: > >>> I was curious if o3 (model from OpenAI) would be able to do a deep dive > >>> health check on a repo to assist in considering taking it as a > >>> dependency. The results can be found here: > >>> https://chatgpt.com/share/684be703-1d4c-8002-b831-f997f829f4b4 > >>> > >>> Apparently it can, and can do it quite well. This was a useful time saver > >>> (and honestly did a better job than I usually can in > 10x the time) > >>> > >>> I'm +1 to taking this as a dependency on the lib in core C*. The rest of > >>> the ecosystem can consume it (more easily if we move to a > >>> cassandra-shared regime shared library build as well), and it opens up > >>> some interesting opportunities for us in both how we test core C* proper > >>> and what we expose in tooling. > >>> > >>> On Thu, Jun 12, 2025, at 7:36 PM, Paulo Motta wrote: > >>>> I'd prefer to avoid calling an external process and use the library if > >>>> possible. Not sure about including it in the project by default, but > >>>> also not against. > >>>> > >>>> If there's contention about including it, I wonder if it would make > >>>> sense to explore java's optional module extension[1] to make this > >>>> available optionally ? I can see this being useful for other extensions > >>>> if we haven't explored that option. > >>>> > >>>> Then we could have another project cassandra-sidecar-extensions (or > >>>> similar) that would be linked by sidecar/advanced operators to enable > >>>> extended featureset in the main process. > >>>> > >>>> > >>>> [1] - > >>>> https://openjdk.org/projects/jigsaw/doc/topics/optional.html > >>>> > >>>> On Thu, 12 Jun 2025 at 17:57 Doug Rohrer <droh...@apple.com > >>>> <mailto:droh...@apple.com>> wrote: > >>>> Hey folks! > >>>> > >>>> We're looking into enabling the sidecar to collect async profiles from > >>>> Cassandra and, digging through the async-profiler code and usage, it > >>>> seems like there may be a few different ways to do it. I’m curious if > >>>> other folks have already done this beyond just “run asprof with the pid > >>>> of the Cassandra process”, as I’m a bit hesitant to depend on executing > >>>> an external process from the Sidecar to gather the actual profile if we > >>>> can avoid it. > >>>> > >>>> There seem to be some opportunities to integrate the profiler into > >>>> another project (see > >>>> https://github.com/async-profiler/async-profiler/blob/master/docs/IntegratingAsyncProfiler.md#using-java-api) > >>>> but it seems this would end up having to be part of Cassandra, and > >>>> somehow callable via the sidecar (JMX? Some virtual table interface > >>>> where you insert a row to start a profile with the profiler options, and > >>>> it kicks off the profile, dumping the results into the table when it’s > >>>> done?). > >>>> > >>>> The benefit in putting this functionality into Cassandra would be that > >>>> other consumers (in-jvm dtests, python dtests, other monitoring systems > >>>> where Sidecar isn’t available, easy-cass-lab) would be able to leverage > >>>> the same interface rather than having to re-invent the wheel each time. > >>>> > >>>> Drawback is it’s another library, and one with native library > >>>> dependencies, added to the class path and loaded at runtime. > >>>> > >>>> Thoughts? Previous experiences (good or bad)? > >>>> > >>>> Thanks, > >>>> > >>>> Doug > >>> > >