you didn’t say what you upgraded from, but if it is 2.0.x, then look at 
CASSANDRA-9504

If so and you use
commitlog_sync: batch
Then you probably want to set
commitlog_sync_batch_window_in_ms: 1 (or 2)
Note I’m only slightly convinced this is the cause because of your READ_REPAIR 
issues (though if you are dropping a lot of MUTATIONS under load and your 
machines are overloaded, you’d be doing more READ_REPAIR than usual probably)

> On Oct 29, 2015, at 8:12 PM, Jeff Ferland <j...@tubularlabs.com> wrote:
> 
> Using DSE 4.8.1 / 2.1.11.872, Java version 1.8.0_66
> 
> We upgraded our cluster this weekend and have been having issues with dropped 
> mutations since then. Intensely investigating a single node and toying with 
> settings has revealed that GC stalls don’t make up enough time to explain the 
> 10 seconds of apparent stall that would cause a hangup.
> 
> tpstats output typically shows active threads in the low single digits and 
> pending similar or 0. Before a failure, pending MutationStage will skyrocket 
> into 5+ digits. System.log regularly shows the gossiper complaining, then 
> slow log complaints, then logs dropped mutations.
> 
> For the entire minute of 00:55, the gc logging shows no single pause > .14 
> seconds and most of them much smaller. Abbreviated GC log after switching to 
> g1gc (problem also exhibited before G1GC):
> 
> 2015-10-30T00:55:00.550+0000: 6752.857: [GC pause (G1 Evacuation Pause) 
> (young)
> 2015-10-30T00:55:02.843+0000: 6755.150: [GC pause (GCLocker Initiated GC) 
> (young)
> 2015-10-30T00:55:05.241+0000: 6757.548: [GC pause (G1 Evacuation Pause) 
> (young)
> 2015-10-30T00:55:07.755+0000: 6760.062: [GC pause (G1 Evacuation Pause) 
> (young)
> 2015-10-30T00:55:10.532+0000: 6762.839: [GC pause (G1 Evacuation Pause) 
> (young)
> 2015-10-30T00:55:13.080+0000: 6765.387: [GC pause (G1 Evacuation Pause) 
> (young)
> 2015-10-30T00:55:15.914+0000: 6768.221: [GC pause (G1 Evacuation Pause) 
> (young)
> 2015-10-30T00:55:18.619+0000: 6770.926: [GC pause (GCLocker Initiated GC) 
> (young)
> 2015-10-30T00:55:23.270+0000: 6775.578: [GC pause (GCLocker Initiated GC) 
> (young)
> 2015-10-30T00:55:28.662+0000: 6780.969: [GC pause (GCLocker Initiated GC) 
> (young)
> 2015-10-30T00:55:33.326+0000: 6785.633: [GC pause (G1 Evacuation Pause) 
> (young)
> 2015-10-30T00:55:36.600+0000: 6788.907: [GC pause (G1 Evacuation Pause) 
> (young)
> 2015-10-30T00:55:40.050+0000: 6792.357: [GC pause (G1 Evacuation Pause) 
> (young)
> 2015-10-30T00:55:43.728+0000: 6796.035: [GC pause (G1 Evacuation Pause) 
> (young)
> 2015-10-30T00:55:48.216+0000: 6800.523: [GC pause (G1 Evacuation Pause) 
> (young)
> 2015-10-30T00:55:53.621+0000: 6805.928: [GC pause (G1 Evacuation Pause) 
> (young)
> 2015-10-30T00:55:59.048+0000: 6811.355: [GC pause (GCLocker Initiated GC) 
> (young)
> 
> System log snippet of the pattern I’m seeing:
> 
> WARN  [GossipTasks:1] 2015-10-30 00:55:25,129  Gossiper.java:747 - Gossip 
> stage has 1 pending tasks; skipping status check (no nodes will be marked 
> down)
> INFO  [CompactionExecutor:210] 2015-10-30 00:55:26,006  
> CompactionTask.java:141 - Compacting 
> [SSTableReader(path='/mnt/cassandra/data/system/hints/system-hints-ka-8283-Data.db'),
>  
> SSTableReader(path='/mnt/cassandra/data/system/hints/system-hints-ka-8286-Data.db'),
>  
> SSTableReader(path='/mnt/cassandra/data/system/hints/system-hints-ka-8284-Data.db'),
>  
> SSTableReader(path='/mnt/cassandra/data/system/hints/system-hints-ka-8285-Data.db'),
>  
> SSTableReader(path='/mnt/cassandra/data/system/hints/system-hints-ka-8287-Data.db')]
> WARN  [GossipTasks:1] 2015-10-30 00:55:26,230  Gossiper.java:747 - Gossip 
> stage has 3 pending tasks; skipping status check (no nodes will be marked 
> down)
> WARN  [GossipTasks:1] 2015-10-30 00:55:27,330  Gossiper.java:747 - Gossip 
> stage has 5 pending tasks; skipping status check (no nodes will be marked 
> down)
> WARN  [GossipTasks:1] 2015-10-30 00:55:28,431  Gossiper.java:747 - Gossip 
> stage has 7 pending tasks; skipping status check (no nodes will be marked 
> down)
> WARN  [GossipTasks:1] 2015-10-30 00:55:29,531  Gossiper.java:747 - Gossip 
> stage has 10 pending tasks; skipping status check (no nodes will be marked 
> down)
> INFO  [CqlSlowLog-Writer-thread-0] 2015-10-30 00:55:32,448  
> CqlSlowLogWriter.java:151 - Recording statements with duration of 16042 in 
> slow log
> INFO  [CqlSlowLog-Writer-thread-0] 2015-10-30 00:55:32,451  
> CqlSlowLogWriter.java:151 - Recording statements with duration of 16047 in 
> slow log
> INFO  [CqlSlowLog-Writer-thread-0] 2015-10-30 00:55:32,453  
> CqlSlowLogWriter.java:151 - Recording statements with duration of 16018 in 
> slow log
> INFO  [CqlSlowLog-Writer-thread-0] 2015-10-30 00:55:32,454  
> CqlSlowLogWriter.java:151 - Recording statements with duration of 16042 in 
> slow log
> INFO  [CqlSlowLog-Writer-thread-0] 2015-10-30 00:55:32,455  
> CqlSlowLogWriter.java:151 - Recording statements with duration of 16024 in 
> slow log
> INFO  [CqlSlowLog-Writer-thread-0] 2015-10-30 00:55:32,457  
> CqlSlowLogWriter.java:151 - Recording statements with duration of 16045 in 
> slow log
> INFO  [CqlSlowLog-Writer-thread-0] 2015-10-30 00:55:32,458  
> CqlSlowLogWriter.java:151 - Recording statements with duration of 16041 in 
> slow log
> INFO  [CqlSlowLog-Writer-thread-0] 2015-10-30 00:55:32,459  
> CqlSlowLogWriter.java:151 - Recording statements with duration of 16040 in 
> slow log
> INFO  [CqlSlowLog-Writer-thread-0] 2015-10-30 00:55:32,461  
> CqlSlowLogWriter.java:151 - Recording statements with duration of 16037 in 
> slow log
> WARN  [GossipTasks:1] 2015-10-30 00:55:32,697  FailureDetector.java:257 - Not 
> marking nodes down due to local pause of 8668629909 > 5000000000
> INFO  [CompactionExecutor:212] 2015-10-30 00:55:32,864  
> ColumnFamilyStore.java:900 - Enqueuing flush of compactions_in_progress: 164 
> (0%) on-heap, 0 (0%) off-heap
> INFO  [MemtableFlushWriter:846] 2015-10-30 00:55:32,865  Memtable.java:347 - 
> Writing Memtable-compactions_in_progress@969357824(0.008KiB serialized bytes, 
> 1 ops, 0%/0% of on/off-heap limit)
> INFO  [MemtableFlushWriter:846] 2015-10-30 00:55:32,865  Memtable.java:382 - 
> Completed flushing 
> /mnt/cassandra/data/system/compactions_in_progress/system-compactions_in_progress-tmp-ka-1461769-Data.db
>  (0.000KiB) for commitlog position ReplayPosition(segmentId=1446159752729, 
> position=8699402)
> INFO  [ScheduledTasks:1] 2015-10-30 00:55:35,346  MessagingService.java:930 - 
> MUTATION messages were dropped in last 5000 ms: 67432 for internal timeout 
> and 0 for cross node timeout
> INFO  [ScheduledTasks:1] 2015-10-30 00:55:35,350  MessagingService.java:930 - 
> READ_REPAIR messages were dropped in last 5000 ms: 209 for internal timeout 
> and 0 for cross node timeout
> 
> I’m starting to suspect I’ll have to take regular stack dumps and catch a bad 
> moment, but I figured I’d reach out here before going down that road to see 
> if somebody else had ever experienced a similar kind of pain.
> 
> -Jeff
> 

Attachment: smime.p7s
Description: S/MIME cryptographic signature

Reply via email to