Hi Gil, thanks for reaching out.Can you check Cassandra's logs to see if any uncaught exceptions are 
being thrown? What you described suggests the possibility of an uncaught exception being thrown in 
the Gossiper thread, preventing further tasks from making progress; however I'm not aware of any open 
issues in 4.0.4 that would result in this.Would be eager to investigate immediately if so.– ScottOn 
Jun 6, 2022, at 11:04 AM, Gil Ganz <gilg...@gmail.com> wrote:HeyWe have a big cluster (>500 
nodes, onprem, multiple datacenters, most with vnodes=32, but some with 128), that was recently 
upgraded from 3.11.9 to 4.0.4. Servers are all centos 7. We have been dealing with a few issues 
related to gossip since :1 - The moment the last node in the cluster was up with 4.0.4, and all nodes 
were in the same version, gossip pending tasks started to climb to very high numbers (>1M) in all 
nodes in the cluster, and quickly the cluster was practically down. Took us a few hours of 
stopping/starting up nodes, and adding more nodes to the seed list, to finally get the cluster back 
up. 2 - We notice that pending gossip tasks go up to very high numbers (50k), in random nodes in the 
cluster, without any meaningful event that happened and it doesn't look like it will go down on its 
own. After a few hours we restart those nodes and it goes back to 0. 3 - Doing a rolling restart to a 
list of servers is now an issue, more often then not, what will happen is one of the nodes we restart 
goes up with gossip issues, and we need a 2nd restart to get the gossip pending tasks to 0.Is there a 
known issue related to gossip in big clusters, in recent versions?Is there any tuning that can be 
done?Just to give a sense of how big the gossip information in this cluster, "nodetool 
gossipinfo" output size is ~300kbgil

Reply via email to