You should always drain nodes before stopping the daemon whenever possible. This avoids commitlog replay on startup. This can take a while. But according to your description commit log replay seems not to be the cause.
I once had a similar effect. Some nodes appeared down for some other nodes and up for others. At that time the cluster had overall stability problems due to some bugs. After those bugs have gone, I haven't seen this effect any more. If that happens again to you, you could check your logs or "nodetool tpstats" for dropped messages, watch out for suspicious network-related logs and the load of your nodes in general. 2017-03-01 17:36 GMT+01:00 Ben Dalling <b.dall...@locp.co.uk>: > Hi Andrew, > > We were having problems with gossip TCP connections being held open and > changed our SOP for stopping cassandra to being: > > nodetool disablegossip > nodetool drain > service cassandra stop > > This seemed to close down the gossip cleanly (the nodetool drain is > advised as well) and meant that the node rejoined the cluster fine after > issuing "service cassandra start". > > *Ben* > > On 1 March 2017 at 16:29, Andrew Jorgensen <and...@andrewjorgensen.com> > wrote: > >> Helllo, >> >> I have a cassandra cluster running on cassandra 3.0.3 and am seeing some >> strange behavior that I cannot explain when restarting cassandra nodes. The >> cluster is currently setup in a single datacenter and consists of 55 nodes. >> I am currently in the process of restarting nodes in the cluster but have >> noticed that after restarting the cassandra process with `service cassandra >> start; service cassandra stop` when the node comes back and I run `nodetool >> status` there is usually a non-zero number of nodes in the rest of the >> cluster that are marked as DN. If I got to another node in the cluster, >> from its perspective all nodes included the restarted one are marked as UN. >> It seems to take ~15 to 20 minutes before the restarted node is updated to >> show all nodes as UN. During the 15 minutes writes and reads . to the >> cluster appear to be degraded and do not recover unless I stop the >> cassandra process again or wait for all nodes to be marked as UN. The >> cluster also has 3 seed nodes which during this process are up and >> available the whole time. >> >> I have also tried doing `gossipinfo` on the restarted node and according >> to the output all nodes have a status of NORMAL. Has anyone seen this >> before and is there anything I can do to fix/reduce the impact of running a >> restart on a cassandra node? >> >> Thanks, >> Andrew Jorgensen >> @ajorgensen >> > >