Drain will take care of stopping gossip, and does a few tasks before stopping gossip (stops batchlog, hints, auth, cache saver and a few other things). I'm not sure why this causes a side effect when you restart the node, but there should be no need to issue a disablegossip anyway, just leave that to the drain. As Jeff said, we need to fix drain because this chain of commands should be unnecessary.
On 12 February 2018 at 18:36, Mike Torra <mto...@salesforce.com> wrote: > Interestingly, it seems that changing the order of steps I take during the > node restart resolves the problem. Instead of: > > `nodetool disablebinary && nodetool disablethrift && *nodetool > disablegossip* && nodetool drain && sudo service cassandra restart`, > > if I do: > > `nodetool disablebinary && nodetool disablethrift && nodetool drain && > *nodetool > disablegossip* && sudo service cassandra restart`, > > I see no application errors, no latency, and no nodes marked as > Down/Normal on the restarted node. Note the only thing I changed is that I > moved `nodetool disablegossip` to after `nodetool drain`. This is pretty > anecdotal, but is there any explanation for why this might happen? I'll be > monitoring my cluster closely to see if this change does indeed fix the > problem. > > On Mon, Feb 12, 2018 at 9:33 AM, Mike Torra <mto...@salesforce.com> wrote: > >> Any other ideas? If I simply stop the node, there is no latency problem, >> but once I start the node the problem appears. This happens consistently >> for all nodes in the cluster >> >> On Wed, Feb 7, 2018 at 11:36 AM, Mike Torra <mto...@salesforce.com> >> wrote: >> >>> No, I am not >>> >>> On Wed, Feb 7, 2018 at 11:35 AM, Jeff Jirsa <jji...@gmail.com> wrote: >>> >>>> Are you using internode ssl? >>>> >>>> >>>> -- >>>> Jeff Jirsa >>>> >>>> >>>> On Feb 7, 2018, at 8:24 AM, Mike Torra <mto...@salesforce.com> wrote: >>>> >>>> Thanks for the feedback guys. That example data model was indeed >>>> abbreviated - the real queries have the partition key in them. I am using >>>> RF 3 on the keyspace, so I don't think a node being down would mean the key >>>> I'm looking for would be unavailable. The load balancing policy of the >>>> driver seems correct (https://docs.datastax.com/en/ >>>> developer/nodejs-driver/3.4/features/tuning-policies/#load-b >>>> alancing-policy, and I am using the default `TokenAware` policy with >>>> `DCAwareRoundRobinPolicy` as a child), but I will look more closely at the >>>> implementation. >>>> >>>> It was an oversight of mine to not include `nodetool disablebinary`, >>>> but I still experience the same issue with that. >>>> >>>> One other thing I've noticed is that after restarting a node and seeing >>>> application latency, I also see that the node I just restarted sees many >>>> other nodes in the same DC as being down (ie status 'DN'). However, >>>> checking `nodetool status` on those other nodes shows all nodes as >>>> up/normal. To me this could kind of explain the problem - node comes back >>>> online, thinks it is healthy but many others are not, so it gets traffic >>>> from the client application. But then it gets requests for ranges that >>>> belong to a node it thinks is down, so it responds with an error. The >>>> latency issue seems to start roughly when the node goes down, but persists >>>> long (ie 15-20 mins) after it is back online and accepting connections. It >>>> seems to go away once the bounced node shows the other nodes in the same DC >>>> as up again. >>>> >>>> As for speculative retry, my CF is using the default of '99th >>>> percentile'. I could try something different there, but nodes being seen as >>>> down seems like an issue. >>>> >>>> On Tue, Feb 6, 2018 at 6:26 PM, Jeff Jirsa <jji...@gmail.com> wrote: >>>> >>>>> Unless you abbreviated, your data model is questionable (SELECT >>>>> without any equality in the WHERE clause on the partition key will always >>>>> cause a range scan, which is super inefficient). Since you're doing >>>>> LOCAL_ONE and a range scan, timeouts sorta make sense - the owner of at >>>>> least one range would be down for a bit. >>>>> >>>>> If you actually have a partition key in your where clause, then the >>>>> next most likely guess is your clients aren't smart enough to route around >>>>> the node as it restarts, or your key cache is getting cold during the >>>>> bounce. Double check your driver's load balancing policy. >>>>> >>>>> It's also likely the case that speculative retry may help other nodes >>>>> route around the bouncing instance better - if you're not using it, you >>>>> probably should be (though with CL: LOCAL_ONE, it seems like it'd be less >>>>> of an issue). >>>>> >>>>> We need to make bouncing nodes easier (or rather, we need to make >>>>> drain do the right thing), but in this case, your data model looks like >>>>> the >>>>> biggest culprit (unless it's an incomplete recreation). >>>>> >>>>> - Jeff >>>>> >>>>> >>>>> On Tue, Feb 6, 2018 at 10:58 AM, Mike Torra <mto...@salesforce.com> >>>>> wrote: >>>>> >>>>>> Hi - >>>>>> >>>>>> I am running a 29 node cluster spread over 4 DC's in EC2, using C* >>>>>> 3.11.1 on Ubuntu. Occasionally I have the need to restart nodes in the >>>>>> cluster, but every time I do, I see errors and application (nodejs) >>>>>> timeouts. >>>>>> >>>>>> I restart a node like this: >>>>>> >>>>>> nodetool disablethrift && nodetool disablegossip && nodetool drain >>>>>> sudo service cassandra restart >>>>>> >>>>>> When I do that, I very often get timeouts and errors like this in my >>>>>> nodejs app: >>>>>> >>>>>> Error: Cannot achieve consistency level LOCAL_ONE >>>>>> >>>>>> My queries are all pretty much the same, things like: "select * from >>>>>> history where ts > {current_time}" >>>>>> >>>>>> The errors and timeouts seem to go away on their own after a while, >>>>>> but it is frustrating because I can't track down what I am doing wrong! >>>>>> >>>>>> I've tried waiting between steps of shutting down cassandra, and I've >>>>>> tried stopping, waiting, then starting the node. One thing I've noticed >>>>>> is >>>>>> that even after `nodetool drain`ing the node, there are open connections >>>>>> to >>>>>> other nodes in the cluster (ie looking at the output of netstat) until I >>>>>> stop cassandra. I don't see any errors or warnings in the logs. >>>>>> >>>>>> What can I do to prevent this? Is there something else I should be >>>>>> doing to gracefully restart the cluster? It could be something to do with >>>>>> the nodejs driver, but I can't find anything there to try. >>>>>> >>>>>> I appreciate any suggestions or advice. >>>>>> >>>>>> - Mike >>>>>> >>>>> >>>>> >>>> >>> >> >