Richard,

>From looking at the stack trace in your Cassandra logs, you might be hitting a 
>variation of this bug:

https://issues.apache.org/jira/browse/CASSANDRA-11353 
<https://issues.apache.org/jira/browse/CASSANDRA-11353> 
https://issues.apache.org/jira/browse/CASSANDRA-10944 
<https://issues.apache.org/jira/browse/CASSANDRA-10944>

https://github.com/apache/cassandra/blob/cassandra-3.X/NEWS.txt

I notice your on 3.3 - although 10944 was marked as fixed in 3.3.0, there seems 
to have been a merge issue. You would probably want to upgrade to > 3.5 and see 
if it gets resolved. However, I am not sure it would account for the behaviour 
your describing - but it would be worth trying.

Also, depending on the Java driver version/Akka Cassandra Persistence, you may 
be encountering some strangeness there - it would be useful to drop the logging 
level on the Java driver down to debug to see if there is anything apparent 
information. If your not seeing any nodes down via nodetool status and your app 
is still thinking no replicas are available - its a bit strange. Also, have a 
look at what the getHost/getAddress methods one the ReadTimeoutException are 
returning - it should tell you the coordinator that was used to service the 
request, which might help

It would also be worth checking that the application conf for your Akka 
Persistence is setup correctly 
(https://github.com/akka/akka-persistence-cassandra/blob/master/src/main/resources/reference.conf
 
<https://github.com/akka/akka-persistence-cassandra/blob/master/src/main/resources/reference.conf>)
 - things like local-datacenter, replication-strategy, write-consistency, 
read-consistency (is there a reason its ONE and not LOCAL_ONE) etc.

Regards,

Johnny

-- 

Johnny Miller
Co-Founder & CTO @ digitalis.io <http://digitalis.io/> | Fully Managed Open 
Source Data Technologies
+44(0)20 8123 4053  | joh...@digitalis.io <mailto:joh...@digitalis.io>


> On 2 Jan 2017, at 18:59, Ney, Richard <richard....@aspect.com> wrote:
> 
> Hi Amit,
>  
> I’m seeing “not marking as down” in the logs like this one,
>  
> WARN  [GossipTasks:1] 2016-12-29 08:48:02,665 FailureDetector.java:287 - Not 
> marking nodes down due to local pause of 6641241564 > 5000000000
>  
> Now the end of the system.log files on all three nodes in one of the data 
> centers are full of NullPointerExceptions and AssertionErrors like these 
> below, would these errors be the cause or a symptom?
>  
>  
> WARN  [SharedPool-Worker-1] 2017-01-02 07:13:56,441 
> AbstractLocalAwareExecutorService.java:169 - Uncaught exception on thread 
> Thread[SharedPool-Worker-1,5,main]: {}
> java.lang.NullPointerException: null
> WARN  [SharedPool-Worker-1] 2017-01-02 07:15:02,865 
> AbstractLocalAwareExecutorService.java:169 - Uncaught exception on thread 
> Thread[SharedPool-Worker-1,5,main]: {}
> java.lang.AssertionError: null
>                 at 
> org.apache.cassandra.db.rows.BufferCell.<init>(BufferCell.java:49) 
> ~[apache-cassandra-3.3.0.jar:3.3.0]
>                 at 
> org.apache.cassandra.db.rows.BufferCell.tombstone(BufferCell.java:88) 
> ~[apache-cassandra-3.3.0.jar:3.3.0]
>                 at 
> org.apache.cassandra.db.rows.BufferCell.tombstone(BufferCell.java:83) 
> ~[apache-cassandra-3.3.0.jar:3.3.0]
>                 at 
> org.apache.cassandra.db.rows.BufferCell.purge(BufferCell.java:175) 
> ~[apache-cassandra-3.3.0.jar:3.3.0]
>                 at 
> org.apache.cassandra.db.rows.ComplexColumnData.lambda$purge$107(ComplexColumnData.java:165)
>  ~[apache-cassandra-3.3.0.jar:3.3.0]
>                 at 
> org.apache.cassandra.utils.btree.BTree$FiltrationTracker.apply(BTree.java:650)
>  ~[apache-cassandra-3.3.0.jar:3.3.0]
>                 at 
> org.apache.cassandra.utils.btree.BTree.transformAndFilter(BTree.java:693) 
> ~[apache-cassandra-3.3.0.jar:3.3.0]
>                 at 
> org.apache.cassandra.utils.btree.BTree.transformAndFilter(BTree.java:668) 
> ~[apache-cassandra-3.3.0.jar:3.3.0]
>                 at 
> org.apache.cassandra.db.rows.ComplexColumnData.transformAndFilter(ComplexColumnData.java:170)
>  ~[apache-cassandra-3.3.0.jar:3.3.0]
>                 at 
> org.apache.cassandra.db.rows.ComplexColumnData.purge(ComplexColumnData.java:165)
>  ~[apache-cassandra-3.3.0.jar:3.3.0]
>                 at 
> org.apache.cassandra.db.rows.ComplexColumnData.purge(ComplexColumnData.java:43)
>  ~[apache-cassandra-3.3.0.jar:3.3.0]
>                 at 
> org.apache.cassandra.db.rows.BTreeRow.lambda$purge$102(BTreeRow.java:333) 
> ~[apache-cassandra-3.3.0.jar:3.3.0]
>                 at 
> org.apache.cassandra.utils.btree.BTree$FiltrationTracker.apply(BTree.java:650)
>  ~[apache-cassandra-3.3.0.jar:3.3.0]
>                 at 
> org.apache.cassandra.utils.btree.BTree.transformAndFilter(BTree.java:693) 
> ~[apache-cassandra-3.3.0.jar:3.3.0]
>                 at 
> org.apache.cassandra.utils.btree.BTree.transformAndFilter(BTree.java:668) 
> ~[apache-cassandra-3.3.0.jar:3.3.0]
>                 at 
> org.apache.cassandra.db.rows.BTreeRow.transformAndFilter(BTreeRow.java:338) 
> ~[apache-cassandra-3.3.0.jar:3.3.0]
>                 at 
> org.apache.cassandra.db.rows.BTreeRow.purge(BTreeRow.java:333) 
> ~[apache-cassandra-3.3.0.jar:3.3.0]
>                 at 
> org.apache.cassandra.db.partitions.PurgeFunction.applyToRow(PurgeFunction.java:88)
>  ~[apache-cassandra-3.3.0.jar:3.3.0]
>                 at 
> org.apache.cassandra.db.transform.BaseRows.hasNext(BaseRows.java:116) 
> ~[apache-cassandra-3.3.0.jar:3.3.0]
>                 at 
> org.apache.cassandra.db.rows.UnfilteredRowIteratorSerializer.serialize(UnfilteredRowIteratorSerializer.java:133)
>  ~[apache-cassandra-3.3.0.jar:3.3.0]
>                 at 
> org.apache.cassandra.db.rows.UnfilteredRowIteratorSerializer.serialize(UnfilteredRowIteratorSerializer.java:89)
>  ~[apache-cassandra-3.3.0.jar:3.3.0]
>                 at 
> org.apache.cassandra.db.rows.UnfilteredRowIteratorSerializer.serialize(UnfilteredRowIteratorSerializer.java:79)
>  ~[apache-cassandra-3.3.0.jar:3.3.0]
>                 at 
> org.apache.cassandra.db.partitions.UnfilteredPartitionIterators$Serializer.serialize(UnfilteredPartitionIterators.java:294)
>  ~[apache-cassandra-3.3.0.jar:3.3.0]
>                 at 
> org.apache.cassandra.db.ReadResponse$LocalDataResponse.build(ReadResponse.java:134)
>  ~[apache-cassandra-3.3.0.jar:3.3.0]
>                 at 
> org.apache.cassandra.db.ReadResponse$LocalDataResponse.<init>(ReadResponse.java:127)
>  ~[apache-cassandra-3.3.0.jar:3.3.0]
>                 at 
> org.apache.cassandra.db.ReadResponse$LocalDataResponse.<init>(ReadResponse.java:123)
>  ~[apache-cassandra-3.3.0.jar:3.3.0]
>                 at 
> org.apache.cassandra.db.ReadResponse.createDataResponse(ReadResponse.java:65) 
> ~[apache-cassandra-3.3.0.jar:3.3.0]
>                 at 
> org.apache.cassandra.db.ReadCommand.createResponse(ReadCommand.java:292) 
> ~[apache-cassandra-3.3.0.jar:3.3.0]
>                 at 
> org.apache.cassandra.db.ReadCommandVerbHandler.doVerb(ReadCommandVerbHandler.java:50)
>  ~[apache-cassandra-3.3.0.jar:3.3.0]
>                 at 
> org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:64) 
> ~[apache-cassandra-3.3.0.jar:3.3.0]
>                 at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) 
> ~[na:1.8.0_111]
>                 at 
> org.apache.cassandra.concurrent.AbstractLocalAwareExecutorService$FutureTask.run(AbstractLocalAwareExecutorService.java:164)
>  ~[apache-cassandra-3.3.0.jar:3.3.0]
>                 at 
> org.apache.cassandra.concurrent.AbstractLocalAwareExecutorService$LocalSessionFutureTask.run(AbstractLocalAwareExecutorService.java:136)
>  [apache-cassandra-3.3.0.jar:3.3.0]
>                 at 
> org.apache.cassandra.concurrent.SEPWorker.run(SEPWorker.java:105) 
> [apache-cassandra-3.3.0.jar:3.3.0]
>                 at java.lang.Thread.run(Thread.java:745) [na:1.8.0_111]
> WARN  [SharedPool-Worker-2] 2017-01-02 07:15:03,132 
> AbstractLocalAwareExecutorService.java:169 - Uncaught exception on thread 
> Thread[SharedPool-Worker-2,5,main]: {}
> java.lang.RuntimeException: java.lang.NullPointerException
>                 at 
> org.apache.cassandra.service.StorageProxy$DroppableRunnable.run(StorageProxy.java:2461)
>  ~[apache-cassandra-3.3.0.jar:3.3.0]
>                 at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) 
> ~[na:1.8.0_111]
>                 at 
> org.apache.cassandra.concurrent.AbstractLocalAwareExecutorService$FutureTask.run(AbstractLocalAwareExecutorService.java:164)
>  ~[apache-cassandra-3.3.0.jar:3.3.0]
>                 at 
> org.apache.cassandra.concurrent.AbstractLocalAwareExecutorService$LocalSessionFutureTask.run(AbstractLocalAwareExecutorService.java:136)
>  [apache-cassandra-3.3.0.jar:3.3.0]
>                 at 
> org.apache.cassandra.concurrent.SEPWorker.run(SEPWorker.java:105) 
> [apache-cassandra-3.3.0.jar:3.3.0]
>                 at java.lang.Thread.run(Thread.java:745) [na:1.8.0_111]
> Caused by: java.lang.NullPointerException: null
>  
>  
> RICHARD NEY
> TECHNICAL DIRECTOR, RESEARCH & DEVELOPMENT
> +1 (978) 848.6640 WORK 
> +1 (916) 846.2353 MOBILE
> UNITED STATES
> richard....@aspect.com <mailto:richard....@aspect.com>
> aspect.com <http://www.aspect.com/>
>  
> <image001.png>
>  
> From: Amit Singh F <amit.f.si...@ericsson.com 
> <mailto:amit.f.si...@ericsson.com>>
> Reply-To: "user@cassandra.apache.org <mailto:user@cassandra.apache.org>" 
> <user@cassandra.apache.org <mailto:user@cassandra.apache.org>>
> Date: Monday, January 2, 2017 at 4:34 AM
> To: "user@cassandra.apache.org <mailto:user@cassandra.apache.org>" 
> <user@cassandra.apache.org <mailto:user@cassandra.apache.org>>
> Subject: RE: Trying to find cause of exception
>  
> Hello,
>  
> Few pointers :
>  
> a.)    Can you check in system.log for similar msgs like “marking as down”  
> on the node which gives err msg if yes, then please check for GC pause . 
> Heavy load is one of the reason for this.
> b.)    Can you try connecting cqlsh to that node once you get this kind of 
> msgs. Are you able to connect?
>  
>  
> Regards
> Amit
>  
> From: Ney, Richard [mailto:richard....@aspect.com 
> <mailto:richard....@aspect.com>] 
> Sent: Monday, January 02, 2017 3:30 PM
> To: user@cassandra.apache.org <mailto:user@cassandra.apache.org>
> Subject: Trying to find cause of exception
>  
> My development team has been trying to track down the cause of this Read 
> timeout (30 seconds or more at times) exception below. We’re running a 2 data 
> center deployment with 3 nodes in each data center. Our tables are setup with 
> replication factor = 2 and we have 16G dedicated to the heap with the G1GC 
> for garbage collection. Our systems are AWS M4.2xlarge with 8 CPUs and 32GB 
> of RAM and we have 2 general purpose EBS volumes on each node of 500GB each. 
> Once we start getting these timeouts the cluster doesn’t recover and we are 
> required to shut all Cassandra node down and restart. If anyone has any tips 
> on where to look or what commands to run to help us diagnose this issue we’d 
> be eternally grateful.
>  
> 2017-01-02 04:33:35.161 [ERROR] 
> [report-compute.ffbec924-ce44-11e6-9e21-0adb9d2dd624] [reportCompute] 
> [ahlworkerslave2.bos.manhattan.aspect-cloud.net:31312 
> <http://ahlworkerslave2.bos.manhattan.aspect-cloud.net:31312/>] 
> [WorktypeMetrics] Persistence failure when replaying events for persistenceId 
> [/fsms/pens/worktypes/bmwbpy.314]. Last known sequence number [0]
> java.util.concurrent.ExecutionException: 
> com.datastax.driver.core.exceptions.ReadTimeoutException: Cassandra timeout 
> during read query at consistency ONE (1 responses were required but only 0 
> replica responded)
>     at 
> com.google.common.util.concurrent.AbstractFuture$Sync.getValue(AbstractFuture.java:299)
>     at 
> com.google.common.util.concurrent.AbstractFuture$Sync.get(AbstractFuture.java:286)
>     at 
> com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:116)
>     at 
> akka.persistence.cassandra.package$$anon$1$$anonfun$run$1.apply(package.scala:17)
>     at scala.util.Try$.apply(Try.scala:192)
> Caused by: com.datastax.driver.core.exceptions.ReadTimeoutException: 
> Cassandra timeout during read query at consistency ONE (1 responses were 
> required but only 0 replica responded)
>     at 
> com.datastax.driver.core.exceptions.ReadTimeoutException.copy(ReadTimeoutException.java:115)
>     at 
> com.datastax.driver.core.Responses$Error.asException(Responses.java:124)
>     at 
> com.datastax.driver.core.RequestHandler$SpeculativeExecution.onSet(RequestHandler.java:477)
>     at 
> com.datastax.driver.core.Connection$Dispatcher.channelRead0(Connection.java:1005)
>     at 
> com.datastax.driver.core.Connection$Dispatcher.channelRead0(Connection.java:928)
> Caused by: com.datastax.driver.core.exceptions.ReadTimeoutException: 
> Cassandra timeout during read query at consistency ONE (1 responses were 
> required but only 0 replica responded)
>     at com.datastax.driver.core.Responses$Error$1.decode(Responses.java:62)
>     at com.datastax.driver.core.Responses$Error$1.decode(Responses.java:37)
>     at 
> com.datastax.driver.core.Message$ProtocolDecoder.decode(Message.java:266)
>     at 
> com.datastax.driver.core.Message$ProtocolDecoder.decode(Message.java:246)
>     at 
> io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:89)
>  
>  
> RICHARD NEY
> TECHNICAL DIRECTOR, RESEARCH & DEVELOPMENT
> +1 (978) 848.6640 WORK 
> +1 (916) 846.2353 MOBILE
> UNITED STATES
> richard....@aspect.com <mailto:richard....@aspect.com>
> aspect.com <http://www.aspect.com/>
>  
> <image002.png>
> This email (including any attachments) is proprietary to Aspect Software, 
> Inc. and may contain information that is confidential. If you have received 
> this message in error, please do not read, copy or forward this message. 
> Please notify the sender immediately, delete it from your system and destroy 
> any copies. You may not further disclose or distribute this email or its 
> attachments.
> This email (including any attachments) is proprietary to Aspect Software, 
> Inc. and may contain information that is confidential. If you have received 
> this message in error, please do not read, copy or forward this message. 
> Please notify the sender immediately, delete it from your system and destroy 
> any copies. You may not further disclose or distribute this email or its 
> attachments.


-- 


--

Any views or opinions presented are solely those of the author and do not 
necessarily represent those of the company. digitalis.io is a trading name 
of Digitalis.io Ltd. Company Number: 98499457 Registered in England and 
Wales. Registered Office: Kemp House, 152 City Road, London, EC1V 2NX, 
United Kingddom

Reply via email to