Interestingly, it seems that changing the order of steps I take during the node restart resolves the problem. Instead of:
`nodetool disablebinary && nodetool disablethrift && *nodetool disablegossip* && nodetool drain && sudo service cassandra restart`, if I do: `nodetool disablebinary && nodetool disablethrift && nodetool drain && *nodetool disablegossip* && sudo service cassandra restart`, I see no application errors, no latency, and no nodes marked as Down/Normal on the restarted node. Note the only thing I changed is that I moved `nodetool disablegossip` to after `nodetool drain`. This is pretty anecdotal, but is there any explanation for why this might happen? I'll be monitoring my cluster closely to see if this change does indeed fix the problem. On Mon, Feb 12, 2018 at 9:33 AM, Mike Torra <mto...@salesforce.com> wrote: > Any other ideas? If I simply stop the node, there is no latency problem, > but once I start the node the problem appears. This happens consistently > for all nodes in the cluster > > On Wed, Feb 7, 2018 at 11:36 AM, Mike Torra <mto...@salesforce.com> wrote: > >> No, I am not >> >> On Wed, Feb 7, 2018 at 11:35 AM, Jeff Jirsa <jji...@gmail.com> wrote: >> >>> Are you using internode ssl? >>> >>> >>> -- >>> Jeff Jirsa >>> >>> >>> On Feb 7, 2018, at 8:24 AM, Mike Torra <mto...@salesforce.com> wrote: >>> >>> Thanks for the feedback guys. That example data model was indeed >>> abbreviated - the real queries have the partition key in them. I am using >>> RF 3 on the keyspace, so I don't think a node being down would mean the key >>> I'm looking for would be unavailable. The load balancing policy of the >>> driver seems correct (https://docs.datastax.com/en/ >>> developer/nodejs-driver/3.4/features/tuning-policies/#load-b >>> alancing-policy, and I am using the default `TokenAware` policy with >>> `DCAwareRoundRobinPolicy` as a child), but I will look more closely at the >>> implementation. >>> >>> It was an oversight of mine to not include `nodetool disablebinary`, but >>> I still experience the same issue with that. >>> >>> One other thing I've noticed is that after restarting a node and seeing >>> application latency, I also see that the node I just restarted sees many >>> other nodes in the same DC as being down (ie status 'DN'). However, >>> checking `nodetool status` on those other nodes shows all nodes as >>> up/normal. To me this could kind of explain the problem - node comes back >>> online, thinks it is healthy but many others are not, so it gets traffic >>> from the client application. But then it gets requests for ranges that >>> belong to a node it thinks is down, so it responds with an error. The >>> latency issue seems to start roughly when the node goes down, but persists >>> long (ie 15-20 mins) after it is back online and accepting connections. It >>> seems to go away once the bounced node shows the other nodes in the same DC >>> as up again. >>> >>> As for speculative retry, my CF is using the default of '99th >>> percentile'. I could try something different there, but nodes being seen as >>> down seems like an issue. >>> >>> On Tue, Feb 6, 2018 at 6:26 PM, Jeff Jirsa <jji...@gmail.com> wrote: >>> >>>> Unless you abbreviated, your data model is questionable (SELECT without >>>> any equality in the WHERE clause on the partition key will always cause a >>>> range scan, which is super inefficient). Since you're doing LOCAL_ONE and a >>>> range scan, timeouts sorta make sense - the owner of at least one range >>>> would be down for a bit. >>>> >>>> If you actually have a partition key in your where clause, then the >>>> next most likely guess is your clients aren't smart enough to route around >>>> the node as it restarts, or your key cache is getting cold during the >>>> bounce. Double check your driver's load balancing policy. >>>> >>>> It's also likely the case that speculative retry may help other nodes >>>> route around the bouncing instance better - if you're not using it, you >>>> probably should be (though with CL: LOCAL_ONE, it seems like it'd be less >>>> of an issue). >>>> >>>> We need to make bouncing nodes easier (or rather, we need to make drain >>>> do the right thing), but in this case, your data model looks like the >>>> biggest culprit (unless it's an incomplete recreation). >>>> >>>> - Jeff >>>> >>>> >>>> On Tue, Feb 6, 2018 at 10:58 AM, Mike Torra <mto...@salesforce.com> >>>> wrote: >>>> >>>>> Hi - >>>>> >>>>> I am running a 29 node cluster spread over 4 DC's in EC2, using C* >>>>> 3.11.1 on Ubuntu. Occasionally I have the need to restart nodes in the >>>>> cluster, but every time I do, I see errors and application (nodejs) >>>>> timeouts. >>>>> >>>>> I restart a node like this: >>>>> >>>>> nodetool disablethrift && nodetool disablegossip && nodetool drain >>>>> sudo service cassandra restart >>>>> >>>>> When I do that, I very often get timeouts and errors like this in my >>>>> nodejs app: >>>>> >>>>> Error: Cannot achieve consistency level LOCAL_ONE >>>>> >>>>> My queries are all pretty much the same, things like: "select * from >>>>> history where ts > {current_time}" >>>>> >>>>> The errors and timeouts seem to go away on their own after a while, >>>>> but it is frustrating because I can't track down what I am doing wrong! >>>>> >>>>> I've tried waiting between steps of shutting down cassandra, and I've >>>>> tried stopping, waiting, then starting the node. One thing I've noticed is >>>>> that even after `nodetool drain`ing the node, there are open connections >>>>> to >>>>> other nodes in the cluster (ie looking at the output of netstat) until I >>>>> stop cassandra. I don't see any errors or warnings in the logs. >>>>> >>>>> What can I do to prevent this? Is there something else I should be >>>>> doing to gracefully restart the cluster? It could be something to do with >>>>> the nodejs driver, but I can't find anything there to try. >>>>> >>>>> I appreciate any suggestions or advice. >>>>> >>>>> - Mike >>>>> >>>> >>>> >>> >> >