Re: node restart causes application latency

Jeff Jirsa Wed, 07 Feb 2018 08:36:06 -0800

Are you using internode ssl?


-- 
Jeff Jirsa


> On Feb 7, 2018, at 8:24 AM, Mike Torra <mto...@salesforce.com> wrote:
> 
> Thanks for the feedback guys. That example data model was indeed abbreviated 
> - the real queries have the partition key in them. I am using RF 3 on the 
> keyspace, so I don't think a node being down would mean the key I'm looking 
> for would be unavailable. The load balancing policy of the driver seems 
> correct 
> (https://docs.datastax.com/en/developer/nodejs-driver/3.4/features/tuning-policies/#load-balancing-policy,
>  and I am using the default `TokenAware` policy with 
> `DCAwareRoundRobinPolicy` as a child), but I will look more closely at the 
> implementation.
> 
> It was an oversight of mine to not include `nodetool disablebinary`, but I 
> still experience the same issue with that.
> 
> One other thing I've noticed is that after restarting a node and seeing 
> application latency, I also see that the node I just restarted sees many 
> other nodes in the same DC as being down (ie status 'DN'). However, checking 
> `nodetool status` on those other nodes shows all nodes as up/normal. To me 
> this could kind of explain the problem - node comes back online, thinks it is 
> healthy but many others are not, so it gets traffic from the client 
> application. But then it gets requests for ranges that belong to a node it 
> thinks is down, so it responds with an error. The latency issue seems to 
> start roughly when the node goes down, but persists long (ie 15-20 mins) 
> after it is back online and accepting connections. It seems to go away once 
> the bounced node shows the other nodes in the same DC as up again.
> 
> As for speculative retry, my CF is using the default of '99th percentile'. I 
> could try something different there, but nodes being seen as down seems like 
> an issue.
> 
>> On Tue, Feb 6, 2018 at 6:26 PM, Jeff Jirsa <jji...@gmail.com> wrote:
>> Unless you abbreviated, your data model is questionable (SELECT without any 
>> equality in the WHERE clause on the partition key will always cause a range 
>> scan, which is super inefficient). Since you're doing LOCAL_ONE and a range 
>> scan, timeouts sorta make sense - the owner of at least one range would be 
>> down for a bit. 
>> 
>> If you actually have a partition key in your where clause, then the next 
>> most likely guess is your clients aren't smart enough to route around the 
>> node as it restarts, or your key cache is getting cold during the bounce. 
>> Double check your driver's load balancing policy. 
>> 
>> It's also likely the case that speculative retry may help other nodes route 
>> around the bouncing instance better - if you're not using it, you probably 
>> should be (though with CL: LOCAL_ONE, it seems like it'd be less of an 
>> issue).
>> 
>> We need to make bouncing nodes easier (or rather, we need to make drain do 
>> the right thing), but in this case, your data model looks like the biggest 
>> culprit (unless it's an incomplete recreation).
>> 
>> - Jeff
>> 
>> 
>>> On Tue, Feb 6, 2018 at 10:58 AM, Mike Torra <mto...@salesforce.com> wrote:
>>> Hi -
>>> 
>>> I am running a 29 node cluster spread over 4 DC's in EC2, using C* 3.11.1 
>>> on Ubuntu. Occasionally I have the need to restart nodes in the cluster, 
>>> but every time I do, I see errors and application (nodejs) timeouts.
>>> 
>>> I restart a node like this:
>>> 
>>> nodetool disablethrift && nodetool disablegossip && nodetool drain
>>> sudo service cassandra restart
>>> 
>>> When I do that, I very often get timeouts and errors like this in my nodejs 
>>> app:
>>> 
>>> Error: Cannot achieve consistency level LOCAL_ONE
>>> 
>>> My queries are all pretty much the same, things like: "select * from 
>>> history where ts > {current_time}"
>>> 
>>> The errors and timeouts seem to go away on their own after a while, but it 
>>> is frustrating because I can't track down what I am doing wrong!
>>> 
>>> I've tried waiting between steps of shutting down cassandra, and I've tried 
>>> stopping, waiting, then starting the node. One thing I've noticed is that 
>>> even after `nodetool drain`ing the node, there are open connections to 
>>> other nodes in the cluster (ie looking at the output of netstat) until I 
>>> stop cassandra. I don't see any errors or warnings in the logs.
>>> 
>>> What can I do to prevent this? Is there something else I should be doing to 
>>> gracefully restart the cluster? It could be something to do with the nodejs 
>>> driver, but I can't find anything there to try.
>>> 
>>> I appreciate any suggestions or advice.
>>> 
>>> - Mike
>> 
>

Re: node restart causes application latency

Reply via email to