Hey We have a big cluster (>500 nodes, onprem, multiple datacenters, most with vnodes=32, but some with 128), that was recently upgraded from 3.11.9 to 4.0.4. Servers are all centos 7.
We have been dealing with a few issues related to gossip since : 1 - The moment the last node in the cluster was up with 4.0.4, and all nodes were in the same version, gossip pending tasks started to climb to very high numbers (>1M) in all nodes in the cluster, and quickly the cluster was practically down. Took us a few hours of stopping/starting up nodes, and adding more nodes to the seed list, to finally get the cluster back up. 2 - We notice that pending gossip tasks go up to very high numbers (50k), in random nodes in the cluster, without any meaningful event that happened and it doesn't look like it will go down on its own. After a few hours we restart those nodes and it goes back to 0. 3 - Doing a rolling restart to a list of servers is now an issue, more often then not, what will happen is one of the nodes we restart goes up with gossip issues, and we need a 2nd restart to get the gossip pending tasks to 0. Is there a known issue related to gossip in big clusters, in recent versions? Is there any tuning that can be done? Just to give a sense of how big the gossip information in this cluster, "*nodetool gossipinfo*" output size is ~300kb gil