Hi,
We have 11/11 node cluster running Cassandra 2.1.15 version.
We are observing that 3 nodes from each data center are becoming
unresponsive for short period of time.
This behavior is happening only in 6 nodes (i.e. 3 from each data center)
and we are seeing a lot of Gossip stage has pending task and periodic
commit log syncer issue. This is resulting in huge mutation drop and the
node also has high cpu usage compared to other nodes.
Log entry from system.log:
WARN [GossipTasks:1] 2017-01-18 10:15:51,328 Gossiper.java:748 - Gossip
stage has 2 pending tasks; skipping status check (no nodes will be marked
down)
WARN [GossipTasks:1] 2017-01-18 10:15:52,428 Gossiper.java:748 - Gossip
stage has 3 pending tasks; skipping status check (no nodes will be marked
down)
WARN [GossipTasks:1] 2017-01-18 10:15:53,529 Gossiper.java:748 - Gossip
stage has 3 pending tasks; skipping status check (no nodes will be marked
down)
WARN [GossipTasks:1] 2017-01-18 10:15:54,629 Gossiper.java:748 - Gossip
stage has 5 pending tasks; skipping status check (no nodes will be marked
down)
WARN [GossipTasks:1] 2017-01-18 10:15:55,730 Gossiper.java:748 - Gossip
stage has 7 pending tasks; skipping status check (no nodes will be marked
down)
WARN [GossipTasks:1] 2017-01-18 10:15:56,830 Gossiper.java:748 - Gossip
stage has 9 pending tasks; skipping status check (no nodes will be marked
down)
WARN [GossipTasks:1] 2017-01-18 10:15:57,930 Gossiper.java:748 - Gossip
stage has 12 pending tasks; skipping status check (no nodes will be marked
down)
WARN [GossipTasks:1] 2017-01-18 10:15:59,031 Gossiper.java:748 - Gossip
stage has 15 pending tasks; skipping status check (no nodes will be marked
down)
WARN [GossipTasks:1] 2017-01-18 10:16:00,131 Gossiper.java:748 - Gossip
stage has 16 pending tasks; skipping status check (no nodes will be marked
down)
WARN [PERIODIC-COMMIT-LOG-SYNCER] 2017-01-18
06:35:44,186 AbstractCommitLogService.java:105 - Out of 33 commit log
syncs over the past 309s with average duration of 9506.03ms, 13 have
exceeded the configured commit interval by an average of 8676.77ms
*Observation:*
These 6 nodes having issues are running in Red hat version 6.8 and all
other nodes in clusters are in 6.7 version. The node in 6.7 version is not
having any gossip issue/ mutation drop/ CPU issues/ IO.
Also, the IOStats in these nodes is bad compared to other nodes. We also
see that threads are blocked at HintedHandoff.
*-bash-4.1$ sar 2 7*
10:29:51 AM CPU %user %nice %system %iowait %steal
%idle
10:29:53 AM all 36.83 4.86 5.51 4.65 0.00
48.15
10:29:55 AM all 41.09 4.98 6.68 2.28 0.00
44.97
10:29:57 AM all 37.53 4.71 5.52 2.04 0.00
50.21
10:29:59 AM all 35.82 4.76 4.97 2.14 0.00
52.31
10:30:01 AM all 33.90 3.53 3.82 2.62 0.00
56.13
10:30:03 AM all 31.78 2.64 3.53 4.65 0.00
57.39
10:30:05 AM all 33.27 3.90 3.95 2.76 0.00
56.12
Average: all 35.74 4.20 4.85 3.02 0.00
52.19
*Thread are also blocked HintedHandoff:*
"HintedHandoff:1" daemon prio=1 os_prio=4 tid=0x440 nid=0x44 [ JVM locked
by VM (w/poll advisory bit) acquiring VM lock
'org.apache.cassandra.db.HintedHandOffManager', polling bits: safep rstak
gpgc_clean_new ]
java.lang.Thread.State: BLOCKED (on object monitor)
at
org.apache.cassandra.db.HintedHandOffManager.compact(HintedHandOffManager.java:269)
at
org.apache.cassandra.db.HintedHandOffManager$5.run(HintedHandOffManager.java:563)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
We are working with system team to see if there is any IO issue and is this
issue associated with any futex_wait.
Any other pointer to fix this issue will be great.
Thanks,
Sermandurai.