Using DSE 4.8.1 / 2.1.11.872, Java version 1.8.0_66

We upgraded our cluster this weekend and have been having issues with dropped 
mutations since then. Intensely investigating a single node and toying with 
settings has revealed that GC stalls don’t make up enough time to explain the 
10 seconds of apparent stall that would cause a hangup.

tpstats output typically shows active threads in the low single digits and 
pending similar or 0. Before a failure, pending MutationStage will skyrocket 
into 5+ digits. System.log regularly shows the gossiper complaining, then slow 
log complaints, then logs dropped mutations.

For the entire minute of 00:55, the gc logging shows no single pause > .14 
seconds and most of them much smaller. Abbreviated GC log after switching to 
g1gc (problem also exhibited before G1GC):

2015-10-30T00:55:00.550+0000: 6752.857: [GC pause (G1 Evacuation Pause) (young)
2015-10-30T00:55:02.843+0000: 6755.150: [GC pause (GCLocker Initiated GC) 
(young)
2015-10-30T00:55:05.241+0000: 6757.548: [GC pause (G1 Evacuation Pause) (young)
2015-10-30T00:55:07.755+0000: 6760.062: [GC pause (G1 Evacuation Pause) (young)
2015-10-30T00:55:10.532+0000: 6762.839: [GC pause (G1 Evacuation Pause) (young)
2015-10-30T00:55:13.080+0000: 6765.387: [GC pause (G1 Evacuation Pause) (young)
2015-10-30T00:55:15.914+0000: 6768.221: [GC pause (G1 Evacuation Pause) (young)
2015-10-30T00:55:18.619+0000: 6770.926: [GC pause (GCLocker Initiated GC) 
(young)
2015-10-30T00:55:23.270+0000: 6775.578: [GC pause (GCLocker Initiated GC) 
(young)
2015-10-30T00:55:28.662+0000: 6780.969: [GC pause (GCLocker Initiated GC) 
(young)
2015-10-30T00:55:33.326+0000: 6785.633: [GC pause (G1 Evacuation Pause) (young)
2015-10-30T00:55:36.600+0000: 6788.907: [GC pause (G1 Evacuation Pause) (young)
2015-10-30T00:55:40.050+0000: 6792.357: [GC pause (G1 Evacuation Pause) (young)
2015-10-30T00:55:43.728+0000: 6796.035: [GC pause (G1 Evacuation Pause) (young)
2015-10-30T00:55:48.216+0000: 6800.523: [GC pause (G1 Evacuation Pause) (young)
2015-10-30T00:55:53.621+0000: 6805.928: [GC pause (G1 Evacuation Pause) (young)
2015-10-30T00:55:59.048+0000: 6811.355: [GC pause (GCLocker Initiated GC) 
(young)

System log snippet of the pattern I’m seeing:

WARN  [GossipTasks:1] 2015-10-30 00:55:25,129  Gossiper.java:747 - Gossip stage 
has 1 pending tasks; skipping status check (no nodes will be marked down)
INFO  [CompactionExecutor:210] 2015-10-30 00:55:26,006  CompactionTask.java:141 
- Compacting 
[SSTableReader(path='/mnt/cassandra/data/system/hints/system-hints-ka-8283-Data.db'),
 
SSTableReader(path='/mnt/cassandra/data/system/hints/system-hints-ka-8286-Data.db'),
 
SSTableReader(path='/mnt/cassandra/data/system/hints/system-hints-ka-8284-Data.db'),
 
SSTableReader(path='/mnt/cassandra/data/system/hints/system-hints-ka-8285-Data.db'),
 
SSTableReader(path='/mnt/cassandra/data/system/hints/system-hints-ka-8287-Data.db')]
WARN  [GossipTasks:1] 2015-10-30 00:55:26,230  Gossiper.java:747 - Gossip stage 
has 3 pending tasks; skipping status check (no nodes will be marked down)
WARN  [GossipTasks:1] 2015-10-30 00:55:27,330  Gossiper.java:747 - Gossip stage 
has 5 pending tasks; skipping status check (no nodes will be marked down)
WARN  [GossipTasks:1] 2015-10-30 00:55:28,431  Gossiper.java:747 - Gossip stage 
has 7 pending tasks; skipping status check (no nodes will be marked down)
WARN  [GossipTasks:1] 2015-10-30 00:55:29,531  Gossiper.java:747 - Gossip stage 
has 10 pending tasks; skipping status check (no nodes will be marked down)
INFO  [CqlSlowLog-Writer-thread-0] 2015-10-30 00:55:32,448  
CqlSlowLogWriter.java:151 - Recording statements with duration of 16042 in slow 
log
INFO  [CqlSlowLog-Writer-thread-0] 2015-10-30 00:55:32,451  
CqlSlowLogWriter.java:151 - Recording statements with duration of 16047 in slow 
log
INFO  [CqlSlowLog-Writer-thread-0] 2015-10-30 00:55:32,453  
CqlSlowLogWriter.java:151 - Recording statements with duration of 16018 in slow 
log
INFO  [CqlSlowLog-Writer-thread-0] 2015-10-30 00:55:32,454  
CqlSlowLogWriter.java:151 - Recording statements with duration of 16042 in slow 
log
INFO  [CqlSlowLog-Writer-thread-0] 2015-10-30 00:55:32,455  
CqlSlowLogWriter.java:151 - Recording statements with duration of 16024 in slow 
log
INFO  [CqlSlowLog-Writer-thread-0] 2015-10-30 00:55:32,457  
CqlSlowLogWriter.java:151 - Recording statements with duration of 16045 in slow 
log
INFO  [CqlSlowLog-Writer-thread-0] 2015-10-30 00:55:32,458  
CqlSlowLogWriter.java:151 - Recording statements with duration of 16041 in slow 
log
INFO  [CqlSlowLog-Writer-thread-0] 2015-10-30 00:55:32,459  
CqlSlowLogWriter.java:151 - Recording statements with duration of 16040 in slow 
log
INFO  [CqlSlowLog-Writer-thread-0] 2015-10-30 00:55:32,461  
CqlSlowLogWriter.java:151 - Recording statements with duration of 16037 in slow 
log
WARN  [GossipTasks:1] 2015-10-30 00:55:32,697  FailureDetector.java:257 - Not 
marking nodes down due to local pause of 8668629909 > 5000000000
INFO  [CompactionExecutor:212] 2015-10-30 00:55:32,864  
ColumnFamilyStore.java:900 - Enqueuing flush of compactions_in_progress: 164 
(0%) on-heap, 0 (0%) off-heap
INFO  [MemtableFlushWriter:846] 2015-10-30 00:55:32,865  Memtable.java:347 - 
Writing Memtable-compactions_in_progress@969357824(0.008KiB serialized bytes, 1 
ops, 0%/0% of on/off-heap limit)
INFO  [MemtableFlushWriter:846] 2015-10-30 00:55:32,865  Memtable.java:382 - 
Completed flushing 
/mnt/cassandra/data/system/compactions_in_progress/system-compactions_in_progress-tmp-ka-1461769-Data.db
 (0.000KiB) for commitlog position ReplayPosition(segmentId=1446159752729, 
position=8699402)
INFO  [ScheduledTasks:1] 2015-10-30 00:55:35,346  MessagingService.java:930 - 
MUTATION messages were dropped in last 5000 ms: 67432 for internal timeout and 
0 for cross node timeout
INFO  [ScheduledTasks:1] 2015-10-30 00:55:35,350  MessagingService.java:930 - 
READ_REPAIR messages were dropped in last 5000 ms: 209 for internal timeout and 
0 for cross node timeout

I’m starting to suspect I’ll have to take regular stack dumps and catch a bad 
moment, but I figured I’d reach out here before going down that road to see if 
somebody else had ever experienced a similar kind of pain.

-Jeff

Reply via email to