Re: A Single Dropped Node Fails Entire Read Queries

Michael Shuler Fri, 10 Mar 2017 08:25:39 -0800

I may be mistaken on the exact configuration option for the timeout
you're hitting, but I believe this may be the general
`request_timeout_in_ms: 10000` in conf/cassandra.yaml.


A reasonable timeout for a "node down" discovery/processing is needed to
prevent random flapping of nodes with a super short timeout interval.
Applications should also retry on a host unavailable exception like
this, because in the long run, this should be expected from time to time
for network partitions, node failure, maintenance cycles, etc.

-- 
Kind regards,
Michael

On 03/10/2017 04:07 AM, Shalom Sagges wrote:
> Hi daniel, 
> 
> I don't think that's a network issue, because ~10 seconds after the node
> stopped, the queries were successful again without any timeout issues.
> 
> Thanks!
> 
>  
> Shalom Sagges
> DBA
> T: +972-74-700-4035
> <http://www.linkedin.com/company/164748>
> <http://twitter.com/liveperson>       <http://www.facebook.com/LivePersonInc>
> 
>       We Create Meaningful Connections
> 
> <https://liveperson.docsend.com/view/8iiswfp>
> 
>  
> 
> On Fri, Mar 10, 2017 at 12:01 PM, Daniel Hölbling-Inzko
> <daniel.hoelbling-in...@bitmovin.com
> <mailto:daniel.hoelbling-in...@bitmovin.com>> wrote:
> 
>     Could there be network issues in connecting between the nodes? If
>     node a gets To be the query coordinator but can't reach b and c is
>     obviously down it won't get a quorum.
> 
>     Greetings
> 
>     Shalom Sagges <shal...@liveperson.com
>     <mailto:shal...@liveperson.com>> schrieb am Fr. 10. März 2017 um 10:55:
> 
>         @Ryan, my keyspace replication settings are as follows:
>         CREATE KEYSPACE mykeyspace WITH replication = {'class':
>         'NetworkTopologyStrategy', 'DC1': '3', 'DC2: '3', 'DC3': '3'}
>          AND durable_writes = true;
> 
>         CREATE TABLE mykeyspace.test (
>             column1 text,
>             column2 text,
>             column3 text,
>             PRIMARY KEY (column1, column2)
> 
>         The query is */select * from mykeyspace.test where
>         column1='xxxxx';/*
> 
>         @Daniel, the replication factor is 3. That's why I don't
>         understand why I get these timeouts when only one node drops. 
> 
>         Also, when I enabled tracing, I got the following error:
>         *Unable to fetch query trace: ('Unable to complete the operation
>         against any hosts', {<Host: 127.0.0.1 DC1>: Unavailable('Error
>         from server: code=1000 [Unavailable exception] message="Cannot
>         achieve consistency level LOCAL_QUORUM"
>         info={\'required_replicas\': 2, \'alive_replicas\': 1,
>         \'consistency\': \'LOCAL_QUORUM\'}',)})*
> 
>         But nodetool status shows that only 1 replica was down:
>         --  Address          Load       Tokens       Owns    Host ID    
>                                   Rack
>         DN  x.x.x.235  134.32 MB  256          ?      
>         c0920d11-08da-4f18-a7f3-dbfb8c155b19  RAC1
>         UN  x.x.x.236  134.02 MB  256          ?      
>         2cc0a27b-b1e4-461f-a3d2-186d3d82ff3d  RAC1
>         UN  x.x.x.237  134.34 MB  256          ?      
>         5b2162aa-8803-4b54-88a9-ff2e70b3d830  RAC1
> 
> 
>         I tried to run the same scenario on all 3 nodes, and only the
>         3rd node didn't fail the query when I dropped it. The nodes were
>         installed and configured with Puppet so the configuration is the
>         same on all 3 nodes. 
> 
> 
>         Thanks!
> 
>           
> 
>         On Fri, Mar 10, 2017 at 10:25 AM, Daniel Hölbling-Inzko
>         <daniel.hoelbling-in...@bitmovin.com
>         <mailto:daniel.hoelbling-in...@bitmovin.com>> wrote:
> 
>             The LOCAL_QUORUM works on the available replicas in the dc.
>             So if your replication factor is 2 and you have 10 nodes you
>             can still only loose 1. With a replication factor of 3 you
>             can loose one node and still satisfy the query.
>             Ryan Svihla <r...@foundev.pro <mailto:r...@foundev.pro>> schrieb
>             am Do. 9. März 2017 um 18:09:
> 
>                 whats your keyspace replication settings and what's your
>                 query?
> 
>                 On Thu, Mar 9, 2017 at 9:32 AM, Shalom Sagges
>                 <shal...@liveperson.com <mailto:shal...@liveperson.com>>
>                 wrote:
> 
>                     Hi Cassandra Users, 
> 
>                     I hope someone could help me understand the
>                     following scenario:
> 
>                     Version: 3.0.9
>                     3 nodes per DC
>                     3 DCs in the cluster. 
>                     Consistency Local_Quorum. 
> 
>                     I did a small resiliency test and dropped a node to
>                     check the availability of the data. 
>                     What I assumed would happen is nothing at all. If a
>                     node is down in a 3 nodes DC, Local_Quorum should
>                     still be satisfied. 
>                     However, during the ~10 first seconds after stopping
>                     the service, I got timeout errors (tried it both
>                     from the client and from cqlsh. 
> 
>                     This is the error I get:
>                     */ServerError:
>                     
> com.google.common.util.concurrent.UncheckedExecutionException:
>                     
> com.google.common.util.concurrent.UncheckedExecutionException:
>                     java.lang.RuntimeException:
>                     org.apache.cassandra.exceptions.ReadTimeoutException: 
> Operation
>                     timed out - received only 4 responses./*
> 
> 
>                     After ~10 seconds, the same query is successful with
>                     no timeout errors. The dropped node is still down. 
> 
>                     Any idea what could cause this and how to fix it? 
> 
>                     Thanks!
>                      
> 
>                     This message may contain confidential and/or
>                     privileged information. 
>                     If you are not the addressee or authorized to
>                     receive this on behalf of the addressee you must not
>                     use, copy, disclose or take action based on this
>                     message or any information herein. 
>                     If you have received this message in error, please
>                     advise the sender immediately by reply email and
>                     delete this message. Thank you.
> 
> 
> 
> 
>                 -- 
> 
>                 Thanks,
> 
>                 Ryan Svihla
> 
> 
> 
>         This message may contain confidential and/or privileged
>         information. 
>         If you are not the addressee or authorized to receive this on
>         behalf of the addressee you must not use, copy, disclose or take
>         action based on this message or any information herein. 
>         If you have received this message in error, please advise the
>         sender immediately by reply email and delete this message. Thank
>         you.
> 
> 
> 
> This message may contain confidential and/or privileged information. 
> If you are not the addressee or authorized to receive this on behalf of
> the addressee you must not use, copy, disclose or take action based on
> this message or any information herein. 
> If you have received this message in error, please advise the sender
> immediately by reply email and delete this message. Thank you.

Re: A Single Dropped Node Fails Entire Read Queries

Reply via email to