Hi Gil, thanks for reaching out.Can you check Cassandra's logs to see if any uncaught exceptions are
being thrown? What you described suggests the possibility of an uncaught exception being thrown in
the Gossiper thread, preventing further tasks from making progress; however I'm not aware of any open
issues in 4.0.4 that would result in this.Would be eager to investigate immediately if so.– ScottOn
Jun 6, 2022, at 11:04 AM, Gil Ganz <gilg...@gmail.com> wrote:HeyWe have a big cluster (>500
nodes, onprem, multiple datacenters, most with vnodes=32, but some with 128), that was recently
upgraded from 3.11.9 to 4.0.4. Servers are all centos 7. We have been dealing with a few issues
related to gossip since :1 - The moment the last node in the cluster was up with 4.0.4, and all nodes
were in the same version, gossip pending tasks started to climb to very high numbers (>1M) in all
nodes in the cluster, and quickly the cluster was practically down. Took us a few hours of
stopping/starting up nodes, and adding more nodes to the seed list, to finally get the cluster back
up. 2 - We notice that pending gossip tasks go up to very high numbers (50k), in random nodes in the
cluster, without any meaningful event that happened and it doesn't look like it will go down on its
own. After a few hours we restart those nodes and it goes back to 0. 3 - Doing a rolling restart to a
list of servers is now an issue, more often then not, what will happen is one of the nodes we restart
goes up with gossip issues, and we need a 2nd restart to get the gossip pending tasks to 0.Is there a
known issue related to gossip in big clusters, in recent versions?Is there any tuning that can be
done?Just to give a sense of how big the gossip information in this cluster, "nodetool
gossipinfo" output size is ~300kbgil