you didn’t say what you upgraded from, but if it is 2.0.x, then look at CASSANDRA-9504
If so and you use commitlog_sync: batch Then you probably want to set commitlog_sync_batch_window_in_ms: 1 (or 2) Note I’m only slightly convinced this is the cause because of your READ_REPAIR issues (though if you are dropping a lot of MUTATIONS under load and your machines are overloaded, you’d be doing more READ_REPAIR than usual probably) > On Oct 29, 2015, at 8:12 PM, Jeff Ferland <j...@tubularlabs.com> wrote: > > Using DSE 4.8.1 / 2.1.11.872, Java version 1.8.0_66 > > We upgraded our cluster this weekend and have been having issues with dropped > mutations since then. Intensely investigating a single node and toying with > settings has revealed that GC stalls don’t make up enough time to explain the > 10 seconds of apparent stall that would cause a hangup. > > tpstats output typically shows active threads in the low single digits and > pending similar or 0. Before a failure, pending MutationStage will skyrocket > into 5+ digits. System.log regularly shows the gossiper complaining, then > slow log complaints, then logs dropped mutations. > > For the entire minute of 00:55, the gc logging shows no single pause > .14 > seconds and most of them much smaller. Abbreviated GC log after switching to > g1gc (problem also exhibited before G1GC): > > 2015-10-30T00:55:00.550+0000: 6752.857: [GC pause (G1 Evacuation Pause) > (young) > 2015-10-30T00:55:02.843+0000: 6755.150: [GC pause (GCLocker Initiated GC) > (young) > 2015-10-30T00:55:05.241+0000: 6757.548: [GC pause (G1 Evacuation Pause) > (young) > 2015-10-30T00:55:07.755+0000: 6760.062: [GC pause (G1 Evacuation Pause) > (young) > 2015-10-30T00:55:10.532+0000: 6762.839: [GC pause (G1 Evacuation Pause) > (young) > 2015-10-30T00:55:13.080+0000: 6765.387: [GC pause (G1 Evacuation Pause) > (young) > 2015-10-30T00:55:15.914+0000: 6768.221: [GC pause (G1 Evacuation Pause) > (young) > 2015-10-30T00:55:18.619+0000: 6770.926: [GC pause (GCLocker Initiated GC) > (young) > 2015-10-30T00:55:23.270+0000: 6775.578: [GC pause (GCLocker Initiated GC) > (young) > 2015-10-30T00:55:28.662+0000: 6780.969: [GC pause (GCLocker Initiated GC) > (young) > 2015-10-30T00:55:33.326+0000: 6785.633: [GC pause (G1 Evacuation Pause) > (young) > 2015-10-30T00:55:36.600+0000: 6788.907: [GC pause (G1 Evacuation Pause) > (young) > 2015-10-30T00:55:40.050+0000: 6792.357: [GC pause (G1 Evacuation Pause) > (young) > 2015-10-30T00:55:43.728+0000: 6796.035: [GC pause (G1 Evacuation Pause) > (young) > 2015-10-30T00:55:48.216+0000: 6800.523: [GC pause (G1 Evacuation Pause) > (young) > 2015-10-30T00:55:53.621+0000: 6805.928: [GC pause (G1 Evacuation Pause) > (young) > 2015-10-30T00:55:59.048+0000: 6811.355: [GC pause (GCLocker Initiated GC) > (young) > > System log snippet of the pattern I’m seeing: > > WARN [GossipTasks:1] 2015-10-30 00:55:25,129 Gossiper.java:747 - Gossip > stage has 1 pending tasks; skipping status check (no nodes will be marked > down) > INFO [CompactionExecutor:210] 2015-10-30 00:55:26,006 > CompactionTask.java:141 - Compacting > [SSTableReader(path='/mnt/cassandra/data/system/hints/system-hints-ka-8283-Data.db'), > > SSTableReader(path='/mnt/cassandra/data/system/hints/system-hints-ka-8286-Data.db'), > > SSTableReader(path='/mnt/cassandra/data/system/hints/system-hints-ka-8284-Data.db'), > > SSTableReader(path='/mnt/cassandra/data/system/hints/system-hints-ka-8285-Data.db'), > > SSTableReader(path='/mnt/cassandra/data/system/hints/system-hints-ka-8287-Data.db')] > WARN [GossipTasks:1] 2015-10-30 00:55:26,230 Gossiper.java:747 - Gossip > stage has 3 pending tasks; skipping status check (no nodes will be marked > down) > WARN [GossipTasks:1] 2015-10-30 00:55:27,330 Gossiper.java:747 - Gossip > stage has 5 pending tasks; skipping status check (no nodes will be marked > down) > WARN [GossipTasks:1] 2015-10-30 00:55:28,431 Gossiper.java:747 - Gossip > stage has 7 pending tasks; skipping status check (no nodes will be marked > down) > WARN [GossipTasks:1] 2015-10-30 00:55:29,531 Gossiper.java:747 - Gossip > stage has 10 pending tasks; skipping status check (no nodes will be marked > down) > INFO [CqlSlowLog-Writer-thread-0] 2015-10-30 00:55:32,448 > CqlSlowLogWriter.java:151 - Recording statements with duration of 16042 in > slow log > INFO [CqlSlowLog-Writer-thread-0] 2015-10-30 00:55:32,451 > CqlSlowLogWriter.java:151 - Recording statements with duration of 16047 in > slow log > INFO [CqlSlowLog-Writer-thread-0] 2015-10-30 00:55:32,453 > CqlSlowLogWriter.java:151 - Recording statements with duration of 16018 in > slow log > INFO [CqlSlowLog-Writer-thread-0] 2015-10-30 00:55:32,454 > CqlSlowLogWriter.java:151 - Recording statements with duration of 16042 in > slow log > INFO [CqlSlowLog-Writer-thread-0] 2015-10-30 00:55:32,455 > CqlSlowLogWriter.java:151 - Recording statements with duration of 16024 in > slow log > INFO [CqlSlowLog-Writer-thread-0] 2015-10-30 00:55:32,457 > CqlSlowLogWriter.java:151 - Recording statements with duration of 16045 in > slow log > INFO [CqlSlowLog-Writer-thread-0] 2015-10-30 00:55:32,458 > CqlSlowLogWriter.java:151 - Recording statements with duration of 16041 in > slow log > INFO [CqlSlowLog-Writer-thread-0] 2015-10-30 00:55:32,459 > CqlSlowLogWriter.java:151 - Recording statements with duration of 16040 in > slow log > INFO [CqlSlowLog-Writer-thread-0] 2015-10-30 00:55:32,461 > CqlSlowLogWriter.java:151 - Recording statements with duration of 16037 in > slow log > WARN [GossipTasks:1] 2015-10-30 00:55:32,697 FailureDetector.java:257 - Not > marking nodes down due to local pause of 8668629909 > 5000000000 > INFO [CompactionExecutor:212] 2015-10-30 00:55:32,864 > ColumnFamilyStore.java:900 - Enqueuing flush of compactions_in_progress: 164 > (0%) on-heap, 0 (0%) off-heap > INFO [MemtableFlushWriter:846] 2015-10-30 00:55:32,865 Memtable.java:347 - > Writing Memtable-compactions_in_progress@969357824(0.008KiB serialized bytes, > 1 ops, 0%/0% of on/off-heap limit) > INFO [MemtableFlushWriter:846] 2015-10-30 00:55:32,865 Memtable.java:382 - > Completed flushing > /mnt/cassandra/data/system/compactions_in_progress/system-compactions_in_progress-tmp-ka-1461769-Data.db > (0.000KiB) for commitlog position ReplayPosition(segmentId=1446159752729, > position=8699402) > INFO [ScheduledTasks:1] 2015-10-30 00:55:35,346 MessagingService.java:930 - > MUTATION messages were dropped in last 5000 ms: 67432 for internal timeout > and 0 for cross node timeout > INFO [ScheduledTasks:1] 2015-10-30 00:55:35,350 MessagingService.java:930 - > READ_REPAIR messages were dropped in last 5000 ms: 209 for internal timeout > and 0 for cross node timeout > > I’m starting to suspect I’ll have to take regular stack dumps and catch a bad > moment, but I figured I’d reach out here before going down that road to see > if somebody else had ever experienced a similar kind of pain. > > -Jeff >
smime.p7s
Description: S/MIME cryptographic signature