Hey
We have a big cluster (>500 nodes, onprem, multiple datacenters, most with
vnodes=32, but some with 128), that was recently upgraded from 3.11.9 to
4.0.4. Servers are all centos 7.

We have been dealing with a few issues related to gossip since :
1 - The moment the last node in the cluster was up with 4.0.4, and all
nodes were in the same version, gossip pending tasks started to climb to
very high numbers (>1M) in all nodes in the cluster, and quickly the
cluster was practically down. Took us a few hours of stopping/starting up
nodes, and adding more nodes to the seed list, to finally get the cluster
back up.
2 - We notice that pending gossip tasks go up to very high numbers (50k),
in random nodes in the cluster, without any meaningful event that
happened and it doesn't look like it will go down on its own. After a few
hours we restart those nodes and it goes back to 0.
3 - Doing a rolling restart to a list of servers is now an issue, more
often then not, what will happen is one of the nodes we restart goes up
with gossip issues, and we need a 2nd restart to get the gossip pending
tasks to 0.

Is there a known issue related to gossip in big clusters, in recent
versions?
Is there any tuning that can be done?

Just to give a sense of how big the gossip information in this
cluster, "*nodetool
gossipinfo*" output size is ~300kb

gil

Reply via email to