Re: Troubleshooting internal Cassandra operations.

Jon Haddad Mon, 01 Dec 2025 13:46:28 -0800

+1 to a heap dump, but with some caveats.  They're great if you know what
you're looking for, and are familiar with the tooling around them, but I've
run into issues with large heap dumps in the past that were effectively
unusable because every tool I tried would either lock up or crash.


These days I usually reach for the async-profiler.  If you want to know
what's being allocated at any given window of time, use the `-e alloc` mode
and you can find out pretty quickly where your allocations are coming
from.  CASSANDRA-20428 is a good example, where I found that compaction has
a single call generating 40% of allocations.

It wouldn't surprise me if there were a ton of hints sent over (since the
nodes were was down for hours), then lots of pressure from unthrottled
compaction and / or a small heap or small new gen caused the old gen to get
flooded with objects.  Just a guess, there's not much to go with in the
original question.

Jon

[1] https://issues.apache.org/jira/browse/CASSANDRA-20428

On Mon, Dec 1, 2025 at 1:04 PM Elliott Sims via user <
[email protected]> wrote:

> A heap dump is a good start.  You can also turn on detailed GC logging and
> look through that.  I definitely find it useful to check "heap size after
> full GC" (via jconsole, collected metrics, GC logging, or tools like jstat
> or nodetool gcstats) and heap allocation rate to figure out if it's a
> problem of "heap too small for live data-set" vs "GC can't keep up".
> "nodetool sjk ttop -o ALLOC" can give you a good idea of both allocation
> rate and what's doing the allocating.
>
> There's lots of commercial tools, but Eclipse MAT's heap analyzer does a
> decent job of finding major heap space consumers.  It requires jumping
> through some extra hoops for heaps that are large relative to local memory,
> though.
>
> On Fri, Nov 28, 2025 at 3:01 AM Michalis Kotsiouros (EXT) via user <
> [email protected]> wrote:
>
>> Hello community,
>>
>> I have, recently, faced the following problem in a Cassandra cluster.
>> There were 2 datacenters with 15 Cassandra nodes each on 4.1.x version.
>>
>> Some of the Cassandra nodes were gracefully stopped for a couple of hours
>> for administrative purposes.
>>
>> After some time since those Cassandra nodes were started again, other
>> Cassandra nodes started reporting long GC pauses. The situation
>> deteriorated over time resulting in some of them restarting due to OOM. The
>> rest  of the impacted Cassandra node, that did not restart due to OOM, were
>> administratively restarted and the system was fully recovered.
>>
>> I suppose that some background operation was keeping the impacted
>> Cassandra nodes busy, and the symptom was the intensive use of Heap memory
>> and thus the long GC pauses which caused a major performance hit.
>>
>> My main question is if you are aware of any possible ways to be able to
>> identify what a Cassandra node internally does to facilitate
>> troubleshooting of such cases. My ideas so far are to produce and analyze a
>> heap dump of the Cassandra process of a misbehaving Cassandra node and to
>> collect and analyze the Thread Pool statistics provided by the JMX
>> interface. Do you have similar troubleshooting requirements in your
>> deplyments and if yes what did you do? Are you aware of any article around
>> the specific topic?
>>
>> Thank you in advance!
>>
>>
>>
>> BR
>>
>> MK
>>
>
> This email, including its contents and any attachment(s), may contain
> confidential and/or proprietary information and is solely for the review
> and use of the intended recipient(s). If you have received this email in
> error, please notify the sender and permanently delete this email, its
> content, and any attachment(s). Any disclosure, copying, or taking of any
> action in reliance on an email received in error is strictly prohibited.
>

Re: Troubleshooting internal Cassandra operations.

Reply via email to