Unless you abbreviated, your data model is questionable (SELECT without any
equality in the WHERE clause on the partition key will always cause a range
scan, which is super inefficient). Since you're doing LOCAL_ONE and a range
scan, timeouts sorta make sense - the owner of at least one range would be
down for a bit.

If you actually have a partition key in your where clause, then the next
most likely guess is your clients aren't smart enough to route around the
node as it restarts, or your key cache is getting cold during the bounce.
Double check your driver's load balancing policy.

It's also likely the case that speculative retry may help other nodes route
around the bouncing instance better - if you're not using it, you probably
should be (though with CL: LOCAL_ONE, it seems like it'd be less of an
issue).

We need to make bouncing nodes easier (or rather, we need to make drain do
the right thing), but in this case, your data model looks like the biggest
culprit (unless it's an incomplete recreation).

- Jeff


On Tue, Feb 6, 2018 at 10:58 AM, Mike Torra <mto...@salesforce.com> wrote:

> Hi -
>
> I am running a 29 node cluster spread over 4 DC's in EC2, using C* 3.11.1
> on Ubuntu. Occasionally I have the need to restart nodes in the cluster,
> but every time I do, I see errors and application (nodejs) timeouts.
>
> I restart a node like this:
>
> nodetool disablethrift && nodetool disablegossip && nodetool drain
> sudo service cassandra restart
>
> When I do that, I very often get timeouts and errors like this in my
> nodejs app:
>
> Error: Cannot achieve consistency level LOCAL_ONE
>
> My queries are all pretty much the same, things like: "select * from
> history where ts > {current_time}"
>
> The errors and timeouts seem to go away on their own after a while, but it
> is frustrating because I can't track down what I am doing wrong!
>
> I've tried waiting between steps of shutting down cassandra, and I've
> tried stopping, waiting, then starting the node. One thing I've noticed is
> that even after `nodetool drain`ing the node, there are open connections to
> other nodes in the cluster (ie looking at the output of netstat) until I
> stop cassandra. I don't see any errors or warnings in the logs.
>
> What can I do to prevent this? Is there something else I should be doing
> to gracefully restart the cluster? It could be something to do with the
> nodejs driver, but I can't find anything there to try.
>
> I appreciate any suggestions or advice.
>
> - Mike
>

Reply via email to