We are using Cassandra 3.x version..

Recently, our production database is going through some instability issues. One 
of our node is keep going down from every 2 days up to a few of times a day. 
The node is down due to JVM out of memory. According to my investigation, I 
suspect that this might be related to the writing and/or running compaction of 
the large partitions for some of our large data tables. Here's might be what 
had happened
1. The node went OOM due to unable to de-serialize or compacting some large 
partitions under some condition due to memory constrains.
2. Once we re-started it, which was usually a few hours later, the other nodes 
in the cluster were trying to perform the hinted handoff to the down node to 
patch the missing data. From now on, the down node would have to handle handoff 
plus the normal data load, which made it even busier.
3. The node was not able to complete the handoff and went down again.
4. This went again and again.

This was not the first time we're seeing this issue. The last time, we fixed 
the issue by manually stopping some of aggregation jobs for a whole night to 
allow the node to complete the handoff. We're not too sure about the root cause 
yet, and we don't have explanation why this happens only to one node. I 
investigated the issue and found two related JIRAs of Cassandra
https://issues.apache.org/jira/browse/CASSANDRA-8269 and
https://issues.apache.org/jira/browse/CASSANDRA-8723

Both JIRA mentioned that this might only be the case with Cassandra 2.x.

Thanks,

Harika


[http://wwwin.cisco.com/c/dam/cec/organizations/gmcc/services-tools/signaturetool/images/logo/logo_gradient.png]



Harika Vangapelli
Engineer - IT
hvang...@cisco.com<mailto:hvang...@cisco.com>
Tel:

Cisco Systems, Inc.



United States
cisco.com


[http://www.cisco.com/assets/swa/img/thinkbeforeyouprint.gif]Think before you 
print.

This email may contain confidential and privileged material for the sole use of 
the intended recipient. Any review, use, distribution or disclosure by others 
is strictly prohibited. If you are not the intended recipient (or authorized to 
receive for the recipient), please contact the sender by reply email and delete 
all copies of this message.
Please click 
here<http://www.cisco.com/web/about/doing_business/legal/cri/index.html> for 
Company Registration Information.


  • Cassandra Node keep ... Harika Vangapelli -T (hvangape - AKRAYA INC at Cisco)

Reply via email to