>
> I tried to debug more and could see using top that Command is
> MutationStage in top output , Any clue we get from this ?
>
That just means there's lots of writes hitting your cluster. Without the
thread dump, it would be difficult to know if the threads are blocked by
futex_wait or whatever
I have limited options to use JDK based tools because in our environment we
are running JRE .
I tried to debug more and could see using top that Command is MutationStage
in top output , Any clue we get from this ?
top - 16:30:47 up 94 days, 5:33, 1 user, load average: 134.83, 142.48,
144.75
Ta
Async-profiler (https://github.com/jvm-profiling-tools/async-profiler )
flamegraphs can also be a really good tool to figure out the exact
callgraph that's leading to the futex_wait, both in and out of the JVM.
Sure Eric...
I tried strace as well ...
Surbhi, just a *friendly* reminder that it's customary to reply back to the
mailing list instead of emailing me directly so that everyone else in the
list can participate. ☺
> I tried taking thread dump using kill -3 but it just came back and
> no file generated.
> How do you take the thread dum
The bug is in the kernel - it'd be worth looking at your specific kernel
via `uname -a` just to confirm you're not somehow running an old kernel. If
you're sure you're on a good kernel, then yea, thread inspection is your
next step.
https://github.com/aragozin/jvm-tools/blob/master/sjk-core/docs/TT
I wrote that article 5 years ago but I didn't think it would still be
relevant today. 😁
Have you tried to do a thread dump to see which are the most dominant
threads? That's the most effective way of troubleshooting high CPU
situations. Cheers!
>
Hi,
We have noticed in a Cassandra Cluster , one of the node has 100% cpu
utilization, using top we can see that cassandra process is showing
futex_wait .
We are on CentOS release 6.10 (Final) .As per below document the futex bug
was on Centos 6.6 .
https://support.datastax.com/hc/en-us/articles
Hi,
We have 11/11 node cluster running Cassandra 2.1.15 version.
We are observing that 3 nodes from each data center are becoming
unresponsive for short period of time.
This behavior is happening only in 6 nodes (i.e. 3 from each data center)
and we are seeing a lot of Gossip stage has pending tas