Re: A Single Dropped Node Fails Entire Read Queries

Shalom Sagges Wed, 22 Mar 2017 10:56:40 -0700

Upgrading to 3.0.12 solved the issue.

Thanks a lot for the help Joel!



Shalom Sagges
DBA
T: +972-74-700-4035
<http://www.linkedin.com/company/164748> <http://twitter.com/liveperson>
<http://www.facebook.com/LivePersonInc> We Create Meaningful Connections
<https://liveperson.docsend.com/view/8iiswfp>


On Tue, Mar 14, 2017 at 10:44 AM, Shalom Sagges <shal...@liveperson.com>
wrote:

> Thanks a lot Joel!
>
> I'll go ahead and upgrade.
>
> Thanks again!
>
>
> Shalom Sagges
> DBA
> T: +972-74-700-4035 <+972%2074-700-4035>
> <http://www.linkedin.com/company/164748> <http://twitter.com/liveperson>
> <http://www.facebook.com/LivePersonInc> We Create Meaningful Connections
> <https://liveperson.docsend.com/view/8iiswfp>
>
>
> On Mon, Mar 13, 2017 at 7:27 PM, Joel Knighton <joel.knigh...@datastax.com
> > wrote:
>
>> It's possible that you're hitting https://issues.apache.
>> org/jira/browse/CASSANDRA-13009 .
>>
>> In (simplified) summary, the read query picks the right number of
>> endpoints fairly early in its execution. Because the down node has not been
>> detected as down yet, it may be one of the nodes. When this node doesn't
>> answer, it is likely that speculative retry will kick in after a certain
>> amount of time and query an up node. This feature is present and working in
>> the earlier releases you tested. Unfortunately, percentile-based
>> speculative retry wasn't working as intended in 2.2+ until fixed in
>> CASSANDRA-13009, which went into 2.2.9/3.0.11+.
>>
>> It may be worth evaluating the latest 3.0.x release.
>>
>> On Mon, Mar 13, 2017 at 11:48 AM, Shalom Sagges <shal...@liveperson.com>
>> wrote:
>>
>>> Just some more info, I've tried the same scenario on 2.0.14 and 2.1.15
>>> and didn't encounter such errors.
>>> What I did find is that the timeout errors appear only until the node is
>>> discovered as "DN" in nodetool status. Once the node is in DN status, the
>>> errors stop and the data is retrieved.
>>>
>>> Could this be a bug in 3.0.9? Or some sort of misconfiguration I missed?
>>>
>>> Thanks!
>>>
>>>
>>>
>>> Shalom Sagges
>>> DBA
>>> T: +972-74-700-4035 <+972%2074-700-4035>
>>> <http://www.linkedin.com/company/164748> <http://twitter.com/liveperson>
>>> <http://www.facebook.com/LivePersonInc> We Create Meaningful Connections
>>> <https://liveperson.docsend.com/view/8iiswfp>
>>>
>>>
>>> On Sun, Mar 12, 2017 at 10:21 AM, Shalom Sagges <shal...@liveperson.com>
>>> wrote:
>>>
>>>> Hi Michael,
>>>>
>>>> If a node suddenly fails, and there are other replicas that can still
>>>> satisfy the consistency level, shouldn't the request succeed regardless of
>>>> the failed node?
>>>>
>>>> Thanks!
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> Shalom Sagges
>>>> DBA
>>>> T: +972-74-700-4035 <+972%2074-700-4035>
>>>> <http://www.linkedin.com/company/164748>
>>>> <http://twitter.com/liveperson> <http://www.facebook.com/LivePersonInc> We
>>>> Create Meaningful Connections
>>>> <https://liveperson.docsend.com/view/8iiswfp>
>>>>
>>>>
>>>> On Fri, Mar 10, 2017 at 6:25 PM, Michael Shuler <mich...@pbandjelly.org
>>>> > wrote:
>>>>
>>>>> I may be mistaken on the exact configuration option for the timeout
>>>>> you're hitting, but I believe this may be the general
>>>>> `request_timeout_in_ms: 10000` in conf/cassandra.yaml.
>>>>>
>>>>> A reasonable timeout for a "node down" discovery/processing is needed
>>>>> to
>>>>> prevent random flapping of nodes with a super short timeout interval.
>>>>> Applications should also retry on a host unavailable exception like
>>>>> this, because in the long run, this should be expected from time to
>>>>> time
>>>>> for network partitions, node failure, maintenance cycles, etc.
>>>>>
>>>>> --
>>>>> Kind regards,
>>>>> Michael
>>>>>
>>>>> On 03/10/2017 04:07 AM, Shalom Sagges wrote:
>>>>> > Hi daniel,
>>>>> >
>>>>> > I don't think that's a network issue, because ~10 seconds after the
>>>>> node
>>>>> > stopped, the queries were successful again without any timeout
>>>>> issues.
>>>>> >
>>>>> > Thanks!
>>>>> >
>>>>> >
>>>>> > Shalom Sagges
>>>>> > DBA
>>>>> > T: +972-74-700-4035
>>>>> > <http://www.linkedin.com/company/164748>
>>>>> > <http://twitter.com/liveperson>       <http://www.facebook.com/Live
>>>>> PersonInc>
>>>>> >
>>>>> >       We Create Meaningful Connections
>>>>> >
>>>>> > <https://liveperson.docsend.com/view/8iiswfp>
>>>>> >
>>>>> >
>>>>> >
>>>>> > On Fri, Mar 10, 2017 at 12:01 PM, Daniel Hölbling-Inzko
>>>>> > <daniel.hoelbling-in...@bitmovin.com
>>>>> > <mailto:daniel.hoelbling-in...@bitmovin.com>> wrote:
>>>>> >
>>>>> >     Could there be network issues in connecting between the nodes? If
>>>>> >     node a gets To be the query coordinator but can't reach b and c
>>>>> is
>>>>> >     obviously down it won't get a quorum.
>>>>> >
>>>>> >     Greetings
>>>>> >
>>>>> >     Shalom Sagges <shal...@liveperson.com
>>>>> >     <mailto:shal...@liveperson.com>> schrieb am Fr. 10. März 2017
>>>>> um 10:55:
>>>>> >
>>>>> >         @Ryan, my keyspace replication settings are as follows:
>>>>> >         CREATE KEYSPACE mykeyspace WITH replication = {'class':
>>>>> >         'NetworkTopologyStrategy', 'DC1': '3', 'DC2: '3', 'DC3': '3'}
>>>>> >          AND durable_writes = true;
>>>>> >
>>>>> >         CREATE TABLE mykeyspace.test (
>>>>> >             column1 text,
>>>>> >             column2 text,
>>>>> >             column3 text,
>>>>> >             PRIMARY KEY (column1, column2)
>>>>> >
>>>>> >         The query is */select * from mykeyspace.test where
>>>>> >         column1='xxxxx';/*
>>>>> >
>>>>> >         @Daniel, the replication factor is 3. That's why I don't
>>>>> >         understand why I get these timeouts when only one node drops.
>>>>> >
>>>>> >         Also, when I enabled tracing, I got the following error:
>>>>> >         *Unable to fetch query trace: ('Unable to complete the
>>>>> operation
>>>>> >         against any hosts', {<Host: 127.0.0.1 DC1>:
>>>>> Unavailable('Error
>>>>> >         from server: code=1000 [Unavailable exception]
>>>>> message="Cannot
>>>>> >         achieve consistency level LOCAL_QUORUM"
>>>>> >         info={\'required_replicas\': 2, \'alive_replicas\': 1,
>>>>> >         \'consistency\': \'LOCAL_QUORUM\'}',)})*
>>>>> >
>>>>> >         But nodetool status shows that only 1 replica was down:
>>>>> >         --  Address          Load       Tokens       Owns    Host ID
>>>>> >                                   Rack
>>>>> >         DN  x.x.x.235  134.32 MB  256          ?
>>>>> >         c0920d11-08da-4f18-a7f3-dbfb8c155b19  RAC1
>>>>> >         UN  x.x.x.236  134.02 MB  256          ?
>>>>> >         2cc0a27b-b1e4-461f-a3d2-186d3d82ff3d  RAC1
>>>>> >         UN  x.x.x.237  134.34 MB  256          ?
>>>>> >         5b2162aa-8803-4b54-88a9-ff2e70b3d830  RAC1
>>>>> >
>>>>> >
>>>>> >         I tried to run the same scenario on all 3 nodes, and only the
>>>>> >         3rd node didn't fail the query when I dropped it. The nodes
>>>>> were
>>>>> >         installed and configured with Puppet so the configuration is
>>>>> the
>>>>> >         same on all 3 nodes.
>>>>> >
>>>>> >
>>>>> >         Thanks!
>>>>> >
>>>>> >
>>>>> >
>>>>> >         On Fri, Mar 10, 2017 at 10:25 AM, Daniel Hölbling-Inzko
>>>>> >         <daniel.hoelbling-in...@bitmovin.com
>>>>> >         <mailto:daniel.hoelbling-in...@bitmovin.com>> wrote:
>>>>> >
>>>>> >             The LOCAL_QUORUM works on the available replicas in the
>>>>> dc.
>>>>> >             So if your replication factor is 2 and you have 10 nodes
>>>>> you
>>>>> >             can still only loose 1. With a replication factor of 3
>>>>> you
>>>>> >             can loose one node and still satisfy the query.
>>>>> >             Ryan Svihla <r...@foundev.pro <mailto:r...@foundev.pro>>
>>>>> schrieb
>>>>> >             am Do. 9. März 2017 um 18:09:
>>>>> >
>>>>> >                 whats your keyspace replication settings and what's
>>>>> your
>>>>> >                 query?
>>>>> >
>>>>> >                 On Thu, Mar 9, 2017 at 9:32 AM, Shalom Sagges
>>>>> >                 <shal...@liveperson.com <mailto:
>>>>> shal...@liveperson.com>>
>>>>> >                 wrote:
>>>>> >
>>>>> >                     Hi Cassandra Users,
>>>>> >
>>>>> >                     I hope someone could help me understand the
>>>>> >                     following scenario:
>>>>> >
>>>>> >                     Version: 3.0.9
>>>>> >                     3 nodes per DC
>>>>> >                     3 DCs in the cluster.
>>>>> >                     Consistency Local_Quorum.
>>>>> >
>>>>> >                     I did a small resiliency test and dropped a node
>>>>> to
>>>>> >                     check the availability of the data.
>>>>> >                     What I assumed would happen is nothing at all.
>>>>> If a
>>>>> >                     node is down in a 3 nodes DC, Local_Quorum should
>>>>> >                     still be satisfied.
>>>>> >                     However, during the ~10 first seconds after
>>>>> stopping
>>>>> >                     the service, I got timeout errors (tried it both
>>>>> >                     from the client and from cqlsh.
>>>>> >
>>>>> >                     This is the error I get:
>>>>> >                     */ServerError:
>>>>> >                     com.google.common.util.concur
>>>>> rent.UncheckedExecutionException:
>>>>> >                     com.google.common.util.concur
>>>>> rent.UncheckedExecutionException:
>>>>> >                     java.lang.RuntimeException:
>>>>> >                     
>>>>> > org.apache.cassandra.exceptions.ReadTimeoutException:
>>>>> Operation
>>>>> >                     timed out - received only 4 responses./*
>>>>> >
>>>>> >
>>>>> >                     After ~10 seconds, the same query is successful
>>>>> with
>>>>> >                     no timeout errors. The dropped node is still
>>>>> down.
>>>>> >
>>>>> >                     Any idea what could cause this and how to fix it?
>>>>> >
>>>>> >                     Thanks!
>>>>> >
>>>>> >
>>>>> >                     This message may contain confidential and/or
>>>>> >                     privileged information.
>>>>> >                     If you are not the addressee or authorized to
>>>>> >                     receive this on behalf of the addressee you must
>>>>> not
>>>>> >                     use, copy, disclose or take action based on this
>>>>> >                     message or any information herein.
>>>>> >                     If you have received this message in error,
>>>>> please
>>>>> >                     advise the sender immediately by reply email and
>>>>> >                     delete this message. Thank you.
>>>>> >
>>>>> >
>>>>> >
>>>>> >
>>>>> >                 --
>>>>> >
>>>>> >                 Thanks,
>>>>> >
>>>>> >                 Ryan Svihla
>>>>> >
>>>>> >
>>>>> >
>>>>> >         This message may contain confidential and/or privileged
>>>>> >         information.
>>>>> >         If you are not the addressee or authorized to receive this on
>>>>> >         behalf of the addressee you must not use, copy, disclose or
>>>>> take
>>>>> >         action based on this message or any information herein.
>>>>> >         If you have received this message in error, please advise the
>>>>> >         sender immediately by reply email and delete this message.
>>>>> Thank
>>>>> >         you.
>>>>> >
>>>>> >
>>>>> >
>>>>> > This message may contain confidential and/or privileged information.
>>>>> > If you are not the addressee or authorized to receive this on behalf
>>>>> of
>>>>> > the addressee you must not use, copy, disclose or take action based
>>>>> on
>>>>> > this message or any information herein.
>>>>> > If you have received this message in error, please advise the sender
>>>>> > immediately by reply email and delete this message. Thank you.
>>>>>
>>>>>
>>>>
>>>
>>> This message may contain confidential and/or privileged information.
>>> If you are not the addressee or authorized to receive this on behalf of
>>> the addressee you must not use, copy, disclose or take action based on this
>>> message or any information herein.
>>> If you have received this message in error, please advise the sender
>>> immediately by reply email and delete this message. Thank you.
>>>
>>
>>
>

-- 
This message may contain confidential and/or privileged information. 
If you are not the addressee or authorized to receive this on behalf of the 
addressee you must not use, copy, disclose or take action based on this 
message or any information herein. 
If you have received this message in error, please advise the sender 
immediately by reply email and delete this message. Thank you.

Re: A Single Dropped Node Fails Entire Read Queries

Reply via email to