Thanks everyone for the response. How to debug more on GC issue ? Is there any GC issue which is present in 3.11.0 ?
On Thu, Feb 27, 2020 at 8:46 AM Reid Pinchback <rpinchb...@tripadvisor.com> wrote: > Our experience with G1GC was that 31gb wasn’t optimal (for us) because > while you have less frequent full GCs they are bigger when they do happen. > But even so, not to the point of a 9.5s full collection. > > > > Unless it is a rare event associated with something weird happening > outside of the JVM (there are some whacky interactions between memory and > dirty page writing that could cause it, but not typically), then that is > evidence of a really tough fight to reclaim memory. There are a lot of > things that can impact garbage collection performance. Something is either > being pushed very hard, or something is being constrained very tightly > compared to resource demand. > > > > I’m with Erick, I wouldn’t be putting my attention right now on anything > but the GC issue. Everything else that happens within the JVM envelope is > going to be a misread on timing until you have stable garbage collection. > You might have other issues later, but you aren’t going to know what those > are yet. > > > > One thing you could at least try to eliminate quickly as a factor. Are > repairs running at the time that things are slow? In prior to 3.11.5 you > lack one of the tuning knobs for doing a tradeoff on memory vs network > bandwidth when doing repairs. > > > > I’d also make sure you have tuned C* to migrate whatever you reasonably > can to be off-heap. > > > > Another thought for surprise demands on memory. I don’t know if this is > in 3.11.0, you’ll have to check the C* bash scripts for launching the > service. The number of malloc arenas haven’t always been curtailed, and > that could result in an explosion in memory demand. I just don’t recall > where in C* version history that was addressed. > > > > > > *From: *Erick Ramirez <erick.rami...@datastax.com> > *Reply-To: *"user@cassandra.apache.org" <user@cassandra.apache.org> > *Date: *Wednesday, February 26, 2020 at 9:55 PM > *To: *"user@cassandra.apache.org" <user@cassandra.apache.org> > *Subject: *Re: Hints replays very slow in one DC > > > > *Message from External Sender* > > Nodes are going down due to Out of Memory and we are using 31GB heap size > in DC1 , however in DC2 (Which serves the traffic) has 16GB heap . > > Why we had to increase heap in DC1 is because , DC1 nodes were going down > due Out of Memory issue but DC2 nodes never went down . > > > > It doesn't sound right that the primary DC is DC2 but DC1 is under load. > You might not be aware of it but the symptom suggests DC1 is getting hit > with lots of traffic. If you run netstat (or whatever utility/tool of > your choice), you should see established connections to the cluster. That > should give you clues as to where it's coming from. > > > > We also noticed below kind of messages in system.log > > FailureDetector.java:288 - Not marking nodes down due to local pause of > 9532654114 > 5000000000 > > > > That's another smoking gun that the nodes are buried in GC. A 9.5-second > pause is significant. The slow hinted handoffs is really the least of your > problem right now. If nodes weren't going down, there wouldn't be hints to > handoff in the first place. Cheers! > > > > GOT QUESTIONS? Apache Cassandra experts from the community and DataStax have > answers! Share your expertise on https://community.datastax.com/ > <https://urldefense.proofpoint.com/v2/url?u=https-3A__community.datastax.com_&d=DwMFaQ&c=9Hv6XPedRSA-5PSECC38X80c1h60_XWA4z1k_R1pROA&r=OIgB3poYhzp3_A7WgD7iBCnsJaYmspOa2okNpf6uqWc&m=C0gRic-Qm5s2TDaPnWIg9ki0Zfc99_sNxDDPBTS4Sqw&s=ts13dLS5C9fN0TvYJQmSKlqMnSHpS-j3blE22HMedsg&e=> > . >