Re: node restart causes application latency

kurt greaves Mon, 12 Feb 2018 15:09:11 -0800

Drain will take care of stopping gossip, and does a few tasks before
stopping gossip (stops batchlog, hints, auth, cache saver and a few other
things). I'm not sure why this causes a side effect when you restart the
node, but there should be no need to issue a disablegossip anyway, just
leave that to the drain. As Jeff said, we need to fix drain because this
chain of commands should be unnecessary.


On 12 February 2018 at 18:36, Mike Torra <mto...@salesforce.com> wrote:

> Interestingly, it seems that changing the order of steps I take during the
> node restart resolves the problem. Instead of:
>
> `nodetool disablebinary && nodetool disablethrift && *nodetool
> disablegossip* && nodetool drain && sudo service cassandra restart`,
>
> if I do:
>
> `nodetool disablebinary && nodetool disablethrift && nodetool drain && 
> *nodetool
> disablegossip* && sudo service cassandra restart`,
>
> I see no application errors, no latency, and no nodes marked as
> Down/Normal on the restarted node. Note the only thing I changed is that I
> moved `nodetool disablegossip` to after `nodetool drain`. This is pretty
> anecdotal, but is there any explanation for why this might happen? I'll be
> monitoring my cluster closely to see if this change does indeed fix the
> problem.
>
> On Mon, Feb 12, 2018 at 9:33 AM, Mike Torra <mto...@salesforce.com> wrote:
>
>> Any other ideas? If I simply stop the node, there is no latency problem,
>> but once I start the node the problem appears. This happens consistently
>> for all nodes in the cluster
>>
>> On Wed, Feb 7, 2018 at 11:36 AM, Mike Torra <mto...@salesforce.com>
>> wrote:
>>
>>> No, I am not
>>>
>>> On Wed, Feb 7, 2018 at 11:35 AM, Jeff Jirsa <jji...@gmail.com> wrote:
>>>
>>>> Are you using internode ssl?
>>>>
>>>>
>>>> --
>>>> Jeff Jirsa
>>>>
>>>>
>>>> On Feb 7, 2018, at 8:24 AM, Mike Torra <mto...@salesforce.com> wrote:
>>>>
>>>> Thanks for the feedback guys. That example data model was indeed
>>>> abbreviated - the real queries have the partition key in them. I am using
>>>> RF 3 on the keyspace, so I don't think a node being down would mean the key
>>>> I'm looking for would be unavailable. The load balancing policy of the
>>>> driver seems correct (https://docs.datastax.com/en/
>>>> developer/nodejs-driver/3.4/features/tuning-policies/#load-b
>>>> alancing-policy, and I am using the default `TokenAware` policy with
>>>> `DCAwareRoundRobinPolicy` as a child), but I will look more closely at the
>>>> implementation.
>>>>
>>>> It was an oversight of mine to not include `nodetool disablebinary`,
>>>> but I still experience the same issue with that.
>>>>
>>>> One other thing I've noticed is that after restarting a node and seeing
>>>> application latency, I also see that the node I just restarted sees many
>>>> other nodes in the same DC as being down (ie status 'DN'). However,
>>>> checking `nodetool status` on those other nodes shows all nodes as
>>>> up/normal. To me this could kind of explain the problem - node comes back
>>>> online, thinks it is healthy but many others are not, so it gets traffic
>>>> from the client application. But then it gets requests for ranges that
>>>> belong to a node it thinks is down, so it responds with an error. The
>>>> latency issue seems to start roughly when the node goes down, but persists
>>>> long (ie 15-20 mins) after it is back online and accepting connections. It
>>>> seems to go away once the bounced node shows the other nodes in the same DC
>>>> as up again.
>>>>
>>>> As for speculative retry, my CF is using the default of '99th
>>>> percentile'. I could try something different there, but nodes being seen as
>>>> down seems like an issue.
>>>>
>>>> On Tue, Feb 6, 2018 at 6:26 PM, Jeff Jirsa <jji...@gmail.com> wrote:
>>>>
>>>>> Unless you abbreviated, your data model is questionable (SELECT
>>>>> without any equality in the WHERE clause on the partition key will always
>>>>> cause a range scan, which is super inefficient). Since you're doing
>>>>> LOCAL_ONE and a range scan, timeouts sorta make sense - the owner of at
>>>>> least one range would be down for a bit.
>>>>>
>>>>> If you actually have a partition key in your where clause, then the
>>>>> next most likely guess is your clients aren't smart enough to route around
>>>>> the node as it restarts, or your key cache is getting cold during the
>>>>> bounce. Double check your driver's load balancing policy.
>>>>>
>>>>> It's also likely the case that speculative retry may help other nodes
>>>>> route around the bouncing instance better - if you're not using it, you
>>>>> probably should be (though with CL: LOCAL_ONE, it seems like it'd be less
>>>>> of an issue).
>>>>>
>>>>> We need to make bouncing nodes easier (or rather, we need to make
>>>>> drain do the right thing), but in this case, your data model looks like 
>>>>> the
>>>>> biggest culprit (unless it's an incomplete recreation).
>>>>>
>>>>> - Jeff
>>>>>
>>>>>
>>>>> On Tue, Feb 6, 2018 at 10:58 AM, Mike Torra <mto...@salesforce.com>
>>>>> wrote:
>>>>>
>>>>>> Hi -
>>>>>>
>>>>>> I am running a 29 node cluster spread over 4 DC's in EC2, using C*
>>>>>> 3.11.1 on Ubuntu. Occasionally I have the need to restart nodes in the
>>>>>> cluster, but every time I do, I see errors and application (nodejs)
>>>>>> timeouts.
>>>>>>
>>>>>> I restart a node like this:
>>>>>>
>>>>>> nodetool disablethrift && nodetool disablegossip && nodetool drain
>>>>>> sudo service cassandra restart
>>>>>>
>>>>>> When I do that, I very often get timeouts and errors like this in my
>>>>>> nodejs app:
>>>>>>
>>>>>> Error: Cannot achieve consistency level LOCAL_ONE
>>>>>>
>>>>>> My queries are all pretty much the same, things like: "select * from
>>>>>> history where ts > {current_time}"
>>>>>>
>>>>>> The errors and timeouts seem to go away on their own after a while,
>>>>>> but it is frustrating because I can't track down what I am doing wrong!
>>>>>>
>>>>>> I've tried waiting between steps of shutting down cassandra, and I've
>>>>>> tried stopping, waiting, then starting the node. One thing I've noticed 
>>>>>> is
>>>>>> that even after `nodetool drain`ing the node, there are open connections 
>>>>>> to
>>>>>> other nodes in the cluster (ie looking at the output of netstat) until I
>>>>>> stop cassandra. I don't see any errors or warnings in the logs.
>>>>>>
>>>>>> What can I do to prevent this? Is there something else I should be
>>>>>> doing to gracefully restart the cluster? It could be something to do with
>>>>>> the nodejs driver, but I can't find anything there to try.
>>>>>>
>>>>>> I appreciate any suggestions or advice.
>>>>>>
>>>>>> - Mike
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>

Re: node restart causes application latency

Reply via email to