Using DSE 4.8.1 / 2.1.11.872, Java version 1.8.0_66 We upgraded our cluster this weekend and have been having issues with dropped mutations since then. Intensely investigating a single node and toying with settings has revealed that GC stalls don’t make up enough time to explain the 10 seconds of apparent stall that would cause a hangup.
tpstats output typically shows active threads in the low single digits and pending similar or 0. Before a failure, pending MutationStage will skyrocket into 5+ digits. System.log regularly shows the gossiper complaining, then slow log complaints, then logs dropped mutations. For the entire minute of 00:55, the gc logging shows no single pause > .14 seconds and most of them much smaller. Abbreviated GC log after switching to g1gc (problem also exhibited before G1GC): 2015-10-30T00:55:00.550+0000: 6752.857: [GC pause (G1 Evacuation Pause) (young) 2015-10-30T00:55:02.843+0000: 6755.150: [GC pause (GCLocker Initiated GC) (young) 2015-10-30T00:55:05.241+0000: 6757.548: [GC pause (G1 Evacuation Pause) (young) 2015-10-30T00:55:07.755+0000: 6760.062: [GC pause (G1 Evacuation Pause) (young) 2015-10-30T00:55:10.532+0000: 6762.839: [GC pause (G1 Evacuation Pause) (young) 2015-10-30T00:55:13.080+0000: 6765.387: [GC pause (G1 Evacuation Pause) (young) 2015-10-30T00:55:15.914+0000: 6768.221: [GC pause (G1 Evacuation Pause) (young) 2015-10-30T00:55:18.619+0000: 6770.926: [GC pause (GCLocker Initiated GC) (young) 2015-10-30T00:55:23.270+0000: 6775.578: [GC pause (GCLocker Initiated GC) (young) 2015-10-30T00:55:28.662+0000: 6780.969: [GC pause (GCLocker Initiated GC) (young) 2015-10-30T00:55:33.326+0000: 6785.633: [GC pause (G1 Evacuation Pause) (young) 2015-10-30T00:55:36.600+0000: 6788.907: [GC pause (G1 Evacuation Pause) (young) 2015-10-30T00:55:40.050+0000: 6792.357: [GC pause (G1 Evacuation Pause) (young) 2015-10-30T00:55:43.728+0000: 6796.035: [GC pause (G1 Evacuation Pause) (young) 2015-10-30T00:55:48.216+0000: 6800.523: [GC pause (G1 Evacuation Pause) (young) 2015-10-30T00:55:53.621+0000: 6805.928: [GC pause (G1 Evacuation Pause) (young) 2015-10-30T00:55:59.048+0000: 6811.355: [GC pause (GCLocker Initiated GC) (young) System log snippet of the pattern I’m seeing: WARN [GossipTasks:1] 2015-10-30 00:55:25,129 Gossiper.java:747 - Gossip stage has 1 pending tasks; skipping status check (no nodes will be marked down) INFO [CompactionExecutor:210] 2015-10-30 00:55:26,006 CompactionTask.java:141 - Compacting [SSTableReader(path='/mnt/cassandra/data/system/hints/system-hints-ka-8283-Data.db'), SSTableReader(path='/mnt/cassandra/data/system/hints/system-hints-ka-8286-Data.db'), SSTableReader(path='/mnt/cassandra/data/system/hints/system-hints-ka-8284-Data.db'), SSTableReader(path='/mnt/cassandra/data/system/hints/system-hints-ka-8285-Data.db'), SSTableReader(path='/mnt/cassandra/data/system/hints/system-hints-ka-8287-Data.db')] WARN [GossipTasks:1] 2015-10-30 00:55:26,230 Gossiper.java:747 - Gossip stage has 3 pending tasks; skipping status check (no nodes will be marked down) WARN [GossipTasks:1] 2015-10-30 00:55:27,330 Gossiper.java:747 - Gossip stage has 5 pending tasks; skipping status check (no nodes will be marked down) WARN [GossipTasks:1] 2015-10-30 00:55:28,431 Gossiper.java:747 - Gossip stage has 7 pending tasks; skipping status check (no nodes will be marked down) WARN [GossipTasks:1] 2015-10-30 00:55:29,531 Gossiper.java:747 - Gossip stage has 10 pending tasks; skipping status check (no nodes will be marked down) INFO [CqlSlowLog-Writer-thread-0] 2015-10-30 00:55:32,448 CqlSlowLogWriter.java:151 - Recording statements with duration of 16042 in slow log INFO [CqlSlowLog-Writer-thread-0] 2015-10-30 00:55:32,451 CqlSlowLogWriter.java:151 - Recording statements with duration of 16047 in slow log INFO [CqlSlowLog-Writer-thread-0] 2015-10-30 00:55:32,453 CqlSlowLogWriter.java:151 - Recording statements with duration of 16018 in slow log INFO [CqlSlowLog-Writer-thread-0] 2015-10-30 00:55:32,454 CqlSlowLogWriter.java:151 - Recording statements with duration of 16042 in slow log INFO [CqlSlowLog-Writer-thread-0] 2015-10-30 00:55:32,455 CqlSlowLogWriter.java:151 - Recording statements with duration of 16024 in slow log INFO [CqlSlowLog-Writer-thread-0] 2015-10-30 00:55:32,457 CqlSlowLogWriter.java:151 - Recording statements with duration of 16045 in slow log INFO [CqlSlowLog-Writer-thread-0] 2015-10-30 00:55:32,458 CqlSlowLogWriter.java:151 - Recording statements with duration of 16041 in slow log INFO [CqlSlowLog-Writer-thread-0] 2015-10-30 00:55:32,459 CqlSlowLogWriter.java:151 - Recording statements with duration of 16040 in slow log INFO [CqlSlowLog-Writer-thread-0] 2015-10-30 00:55:32,461 CqlSlowLogWriter.java:151 - Recording statements with duration of 16037 in slow log WARN [GossipTasks:1] 2015-10-30 00:55:32,697 FailureDetector.java:257 - Not marking nodes down due to local pause of 8668629909 > 5000000000 INFO [CompactionExecutor:212] 2015-10-30 00:55:32,864 ColumnFamilyStore.java:900 - Enqueuing flush of compactions_in_progress: 164 (0%) on-heap, 0 (0%) off-heap INFO [MemtableFlushWriter:846] 2015-10-30 00:55:32,865 Memtable.java:347 - Writing Memtable-compactions_in_progress@969357824(0.008KiB serialized bytes, 1 ops, 0%/0% of on/off-heap limit) INFO [MemtableFlushWriter:846] 2015-10-30 00:55:32,865 Memtable.java:382 - Completed flushing /mnt/cassandra/data/system/compactions_in_progress/system-compactions_in_progress-tmp-ka-1461769-Data.db (0.000KiB) for commitlog position ReplayPosition(segmentId=1446159752729, position=8699402) INFO [ScheduledTasks:1] 2015-10-30 00:55:35,346 MessagingService.java:930 - MUTATION messages were dropped in last 5000 ms: 67432 for internal timeout and 0 for cross node timeout INFO [ScheduledTasks:1] 2015-10-30 00:55:35,350 MessagingService.java:930 - READ_REPAIR messages were dropped in last 5000 ms: 209 for internal timeout and 0 for cross node timeout I’m starting to suspect I’ll have to take regular stack dumps and catch a bad moment, but I figured I’d reach out here before going down that road to see if somebody else had ever experienced a similar kind of pain. -Jeff