Hi all, I am looking into an issue we ran into last night with a single node in our three node 2.0.6 cluster. The top level symptoms were timed out writes, and high latency read and write.
Looking into it more, the node experienced all of these during this two hour window which it eventually recovered from on its own. ** "Gossip stage" pending tasks ** WARN [GossipTasks:1] 2014-04-23 18:51:36,231 Gossiper.java (line 612) Gossip stage has 2 pending tasks; skipping status check (no nodes will be marked down) WARN [GossipTasks:1] 2014-04-23 18:52:36,910 Gossiper.java (line 612) Gossip stage has 2 pending tasks; skipping status check (no nodes will be marked down) WARN [GossipTasks:1] 2014-04-23 18:52:47,886 Gossiper.java (line 612) Gossip stage has 2 pending tasks; skipping status check (no nodes will be marked down) WARN [GossipTasks:1] 2014-04-23 18:53:15,094 Gossiper.java (line 612) Gossip stage has 2 pending tasks; skipping status check (no nodes will be marked down) Strange thing here is it never showed as pending in the TPstats logged by status logger: INFO [ScheduledTasks:1] 2014-04-23 18:56:06,581 StatusLogger.java (line 70) GossipStage 0 0 9065668 0 0 ** High CPU - ~50%-%60 on these dual hexa-core boxes is pretty crazy. normal is barely moving the needle at 3%. ** High level of ParNew collections - Likely the cause of the CPU considering it was running these par-new collections every couple hundred ms. CMS gen seemed OK at 4GB of 6GB and not much remaining after the par-new collection: 'Heap after GC invocations=151586 (full 137): par new generation total 1887488K, used 147K" ** Backed up Mutations in Mutation stage of TPStats and dropped messages: 2014-04-23 18:56:06,579 MessagingService.java (line 841) 210 MUTATION messages dropped in last 5000ms 2014-04-23 18:56:06,579 MessagingService.java (line 841) 12 READ_REPAIR messages dropped in last 5000ms 2014-04-23 18:56:06,579 Pool Name Active Pending Completed Blocked All Time Blocked 2014-04-23 18:56:06,580 ReadStage 4 10 398908067 0 0 2014-04-23 18:56:06,580 RequestResponseStage 0 0 178297428 0 0 2014-04-23 18:56:06,581 ReadRepairStage 0 0 33509717 0 0 2014-04-23 18:56:06,581 MutationStage 96 12708 107009834 0 0 2014-04-23 18:56:06,581 ReplicateOnWriteStage 0 0 0 0 0 2014-04-23 18:56:06,581 GossipStage 0 0 9065668 0 0 2014-04-23 18:56:06,582 AntiEntropyStage 0 0 1413264 0 0 2014-04-23 18:56:06,582 MigrationStage 0 0 37 0 0 2014-04-23 18:56:06,582 MemtablePostFlusher 0 0 546841 0 0 2014-04-23 18:56:06,582 MemoryMeter 0 0 234 0 0 2014-04-23 18:56:06,583 FlushWriter 0 0 165232 0 12 2014-04-23 18:56:06,583 MiscStage 0 0 360672 0 0 2014-04-23 18:56:06,583 PendingRangeCalculator 0 0 5 0 0 2014-04-23 18:56:06,583 commitlog_archiver 0 0 0 0 0 2014-04-23 18:56:06,584 InternalResponseStage 0 0 358384 0 0 2014-04-23 18:56:06,584 AntiEntropySessions 0 0 78366 0 0 2014-04-23 18:56:06,584 HintedHandoff 0 0 28 0 0 2014-04-23 18:56:06,585 CompactionManager 0 0 2014-04-23 18:56:06,585 Commitlog n/a 0 2014-04-23 18:56:06,585 MessagingService n/a 0/0 Any ideas anyone? Could it have all been caused by the backed up gossip tasks? Would that also cause somehow the MutationStage backups? I find it really strange that the GossipTasks logger kept saying gossip tasks were pending but they never showed up on tpstats in status logger...?? thanks in advance for any insight, Thunder