> The fact o3 used "Bus-factor" as a dimension is just amazing. Yeah - that got me too.
On Fri, Jun 13, 2025, at 2:38 PM, Jon Haddad wrote: > I'd be very happy to see async-profiler included with C* I've made extensive > use of it in my performance evaluations [1][2], and even posted a video about > it [3] for general Java perf analysis (among others). It's part of > easy-cass-lab and is easily the most informative tool I've found for the > getting to the bottom of anything performance related. > > There's probably a good case to be made for including it with the C* artifact > as well as having it be something you can drop in. I lean towards including > it all the time, but I haven't run it this way myself yet, so there might be > some downside I'm unaware of. > > When you call the asprof executable, it attaches the async-profiler to the > running jvm using jattach [4]. We could do this as well, if we wanted to > avoid including it with the release, but I don't know how much we really > benefit from that. I've run into issues with it when it's unable to detatch > correctly, then you're unable to reattach it until after the server is > restarted. On the flip side, I don't know if you're able to set up all the > same options for arbitrary profiling when it's loaded as an agent and turned > on/off dynamically. I think we can, based on the integration page [6], but I > haven't tried it yet. It would be a bummer if we only had a single mode of > profiling available. > > The default mode, CPU profiling, is fantastic, but I've also made extensive > use of allocation profiling [5] to identify perf issues as well so having > that available is a must, imo. Wall clock / off cpu profiling is great for > identifying when IO is the root cause, which isn't clearly revealed by on-cpu > profiling due to the way threads are scheduled. When I look at a system I > typically do CPU / Wall / Alloc / Off-CPU to be thorough, and the last thing > you want to do is have to restart between each one. You can also specify > specific Java methods, include or exclude frames matching specific regex, and > a whole slew of other options. The latest version even supports continuous > profiling with heatmaps although I haven't tried it yet. > > So hopefully the option we go with allows all of that, otherwise the limits > would impose more of a headache to me as I'd need to remove it and continue > to bring my own. > > Under the hood, the async-profiler uses Linux perf events + asynchronous > polling of the java stack to match them up and generate it's reports. As a > result, it requires certain permissions to run and get all the details I > like. Specifically these kernel parameters: > > sudo sysctl kernel.perf_event_paranoid=1 > sudo sysctl kernel.kptr_restrict=0 > > You also need to enable some capabilities for off-cpu profiliing: > > sudo find /usr/lib/jvm/ -type f -name 'java' -exec setcap > "cap_perfmon,cap_sys_ptrace,cap_syslog=ep" {} \; > > Then you can do off-cpu with this wild cryptic version (shout out to Andrei > Pangin for helping me with this [7]): > > asprof -e kprobe:schedule -i 2 --cstack dwarf -X '*Unsafe.park*' "${@:2}" $PID > > There's also some subtle issues when it's run in a container, since by > default you don't have access to the perf_event_open syscall. Just something > to keep in mind. This is one of my main grievances with container > deployments. > > Indeed Patrick, I am very happy to see this discussion! Thanks Doug for > starting the thread. > > Jon > > [1] https://issues.apache.org/jira/browse/CASSANDRA-15452 > [2] https://issues.apache.org/jira/browse/CASSANDRA-19477 > [3] > https://www.youtube.com/watch?v=yNZtnzjyJRI&t=212s&pp=ygUOYXN5bmMgcHJvZmlsZXI%3D > [4] > https://github.com/async-profiler/async-profiler/blob/2b556680dc8f5d02c3f26ac119d835dc2381e604/src/jattach/jattach_hotspot.c#L38 > [5] https://issues.apache.org/jira/browse/CASSANDRA-20428 > [6] > https://github.com/async-profiler/async-profiler/blob/master/docs/IntegratingAsyncProfiler.md > [7] https://github.com/async-profiler/async-profiler/issues/907 > > > On Fri, Jun 13, 2025 at 10:18 AM Patrick McFadin <pmcfa...@gmail.com> wrote: >> The fact o3 used "Bus-factor" as a dimension is just amazing. >> >> After reading more about the project, the possibilities are pretty >> interesting. I suspect we'll see this in a Haddad talk soon. >> >> On Fri, Jun 13, 2025 at 1:57 AM Josh McKenzie <jmcken...@apache.org> wrote: >>> __ >>> I was curious if o3 (model from OpenAI) would be able to do a deep dive >>> health check on a repo to assist in considering taking it as a dependency. >>> The results can be found here: >>> https://chatgpt.com/share/684be703-1d4c-8002-b831-f997f829f4b4 >>> >>> Apparently it can, and can do it quite well. This was a useful time saver >>> (and honestly did a better job than I usually can in > 10x the time) >>> >>> I'm +1 to taking this as a dependency on the lib in core C*. The rest of >>> the ecosystem can consume it (more easily if we move to a cassandra-shared >>> regime shared library build as well), and it opens up some interesting >>> opportunities for us in both how we test core C* proper and what we expose >>> in tooling. >>> >>> On Thu, Jun 12, 2025, at 7:36 PM, Paulo Motta wrote: >>>> I'd prefer to avoid calling an external process and use the library if >>>> possible. Not sure about including it in the project by default, but also >>>> not against. >>>> >>>> If there's contention about including it, I wonder if it would make sense >>>> to explore java's optional module extension[1] to make this available >>>> optionally ? I can see this being useful for other extensions if we >>>> haven't explored that option. >>>> >>>> Then we could have another project cassandra-sidecar-extensions (or >>>> similar) that would be linked by sidecar/advanced operators to enable >>>> extended featureset in the main process. >>>> >>>> >>>> [1] - >>>> https://openjdk.org/projects/jigsaw/doc/topics/optional.html >>>> >>>> On Thu, 12 Jun 2025 at 17:57 Doug Rohrer <droh...@apple.com> wrote: >>>>> Hey folks! >>>>> >>>>> We're looking into enabling the sidecar to collect async profiles from >>>>> Cassandra and, digging through the async-profiler code and usage, it >>>>> seems like there may be a few different ways to do it. I’m curious if >>>>> other folks have already done this beyond just “run asprof with the pid >>>>> of the Cassandra process”, as I’m a bit hesitant to depend on executing >>>>> an external process from the Sidecar to gather the actual profile if we >>>>> can avoid it. >>>>> >>>>> There seem to be some opportunities to integrate the profiler into >>>>> another project (see >>>>> https://github.com/async-profiler/async-profiler/blob/master/docs/IntegratingAsyncProfiler.md#using-java-api) >>>>> but it seems this would end up having to be part of Cassandra, and >>>>> somehow callable via the sidecar (JMX? Some virtual table interface where >>>>> you insert a row to start a profile with the profiler options, and it >>>>> kicks off the profile, dumping the results into the table when it’s >>>>> done?). >>>>> >>>>> The benefit in putting this functionality into Cassandra would be that >>>>> other consumers (in-jvm dtests, python dtests, other monitoring systems >>>>> where Sidecar isn’t available, easy-cass-lab) would be able to leverage >>>>> the same interface rather than having to re-invent the wheel each time. >>>>> >>>>> Drawback is it’s another library, and one with native library >>>>> dependencies, added to the class path and loaded at runtime. >>>>> >>>>> Thoughts? Previous experiences (good or bad)? >>>>> >>>>> Thanks, >>>>> >>>>> Doug >>>