> > Nodes are going down due to Out of Memory and we are using 31GB heap size > in DC1 , however in DC2 (Which serves the traffic) has 16GB heap . > Why we had to increase heap in DC1 is because , DC1 nodes were going down > due Out of Memory issue but DC2 nodes never went down . >
It doesn't sound right that the primary DC is DC2 but DC1 is under load. You might not be aware of it but the symptom suggests DC1 is getting hit with lots of traffic. If you run netstat (or whatever utility/tool of your choice), you should see established connections to the cluster. That should give you clues as to where it's coming from. > We also noticed below kind of messages in system.log > FailureDetector.java:288 - Not marking nodes down due to local pause of > 9532654114 > 5000000000 > That's another smoking gun that the nodes are buried in GC. A 9.5-second pause is significant. The slow hinted handoffs is really the least of your problem right now. If nodes weren't going down, there wouldn't be hints to handoff in the first place. Cheers! GOT QUESTIONS? Apache Cassandra experts from the community and DataStax have answers! Share your expertise on https://community.datastax.com/.