Nodes becoming unresponsive intermediately (Gossip stage pending)

Sermandurai Konar Wed, 18 Jan 2017 02:44:14 -0800

Hi,

We have 11/11 node cluster running Cassandra 2.1.15 version.
We are observing that 3 nodes from each data center are becoming
unresponsive for short period of time.
This behavior is happening only in 6 nodes (i.e. 3 from each data center)
and we are seeing a lot of Gossip stage has pending task and periodic
commit log syncer issue. This is resulting in huge mutation drop and the
node also has high cpu usage compared to other nodes.


Log entry from system.log:
WARN  [GossipTasks:1] 2017-01-18 10:15:51,328  Gossiper.java:748 - Gossip
stage has 2 pending tasks; skipping status check (no nodes will be marked
down)
WARN  [GossipTasks:1] 2017-01-18 10:15:52,428  Gossiper.java:748 - Gossip
stage has 3 pending tasks; skipping status check (no nodes will be marked
down)
WARN  [GossipTasks:1] 2017-01-18 10:15:53,529  Gossiper.java:748 - Gossip
stage has 3 pending tasks; skipping status check (no nodes will be marked
down)
WARN  [GossipTasks:1] 2017-01-18 10:15:54,629  Gossiper.java:748 - Gossip
stage has 5 pending tasks; skipping status check (no nodes will be marked
down)
WARN  [GossipTasks:1] 2017-01-18 10:15:55,730  Gossiper.java:748 - Gossip
stage has 7 pending tasks; skipping status check (no nodes will be marked
down)
WARN  [GossipTasks:1] 2017-01-18 10:15:56,830  Gossiper.java:748 - Gossip
stage has 9 pending tasks; skipping status check (no nodes will be marked
down)
WARN  [GossipTasks:1] 2017-01-18 10:15:57,930  Gossiper.java:748 - Gossip
stage has 12 pending tasks; skipping status check (no nodes will be marked
down)
WARN  [GossipTasks:1] 2017-01-18 10:15:59,031  Gossiper.java:748 - Gossip
stage has 15 pending tasks; skipping status check (no nodes will be marked
down)
WARN  [GossipTasks:1] 2017-01-18 10:16:00,131  Gossiper.java:748 - Gossip
stage has 16 pending tasks; skipping status check (no nodes will be marked
down)



WARN  [PERIODIC-COMMIT-LOG-SYNCER] 2017-01-18
06:35:44,186  AbstractCommitLogService.java:105 - Out of 33 commit log
syncs over the past 309s with average duration of 9506.03ms, 13 have
exceeded the configured commit interval by an average of 8676.77ms


*Observation:*
These 6 nodes having issues are running in Red hat version 6.8 and all
other nodes in clusters are in 6.7 version. The node in 6.7 version is not
having any gossip issue/ mutation drop/ CPU issues/ IO.
Also, the IOStats in these nodes is bad compared to other nodes. We also
see that threads are blocked at HintedHandoff.

*-bash-4.1$ sar 2 7*

10:29:51 AM     CPU     %user     %nice   %system   %iowait    %steal
%idle
10:29:53 AM     all     36.83      4.86      5.51      4.65      0.00
48.15
10:29:55 AM     all     41.09      4.98      6.68      2.28      0.00
44.97
10:29:57 AM     all     37.53      4.71      5.52      2.04      0.00
50.21
10:29:59 AM     all     35.82      4.76      4.97      2.14      0.00
52.31
10:30:01 AM     all     33.90      3.53      3.82      2.62      0.00
56.13
10:30:03 AM     all     31.78      2.64      3.53      4.65      0.00
57.39
10:30:05 AM     all     33.27      3.90      3.95      2.76      0.00
56.12
Average:        all     35.74      4.20      4.85      3.02      0.00
52.19

*Thread are also blocked HintedHandoff:*

"HintedHandoff:1" daemon prio=1 os_prio=4 tid=0x440 nid=0x44  [ JVM locked
by VM (w/poll advisory bit) acquiring VM lock
'org.apache.cassandra.db.HintedHandOffManager', polling bits: safep rstak
gpgc_clean_new ]

  java.lang.Thread.State: BLOCKED (on object monitor)

    at
org.apache.cassandra.db.HintedHandOffManager.compact(HintedHandOffManager.java:269)

    at
org.apache.cassandra.db.HintedHandOffManager$5.run(HintedHandOffManager.java:563)

    at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)

    at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)

    at java.lang.Thread.run(Thread.java:745)

We are working with system team to see if there is any IO issue and is this
issue associated with any futex_wait.

Any other pointer to fix this issue will be great.

Thanks,
Sermandurai.

Nodes becoming unresponsive intermediately (Gossip stage pending)

Reply via email to