答复: 答复: 答复: A node down every day in a 6 nodes cluster

Xiangfei Ni Wed, 28 Mar 2018 05:25:22 -0700

Yes ,we discussed and plan to figured out the data model issue and upgrade to 
3.11.3 version.


Best Regards,

倪项菲/ David Ni
中移德电网络科技有限公司
Virtue Intelligent Network Ltd, co.
Add: 2003,20F No.35 Luojia creative city,Luoyu Road,Wuhan,HuBei
Mob: +86 13797007811|Tel: + 86 27 5024 2516

发件人: Kenneth Brotman <kenbrot...@yahoo.com.INVALID>
发送时间: 2018年3月28日 20:16
收件人: user@cassandra.apache.org
主题: RE: 答复: 答复: A node down every day in a 6 nodes cluster

David,

Did you figure out what to do about the data model problem?  It could be that 
your data files finally grow to the point that the data model problem caused 
the Java heap space issue – in which case everything is actually working as 
it’s supposed to; You just have to fix the data model.

Kenneth Brotman


From: Kenneth Brotman [mailto:kenbrot...@yahoo.com]
Sent: Wednesday, March 28, 2018 4:46 AM
To: 'user@cassandra.apache.org'
Subject: RE: 答复: 答复: A node down every day in a 6 nodes cluster

Was any change to hardware done around the time the problem started ?
Was any change to the client software done around the time the problem started?
Was any change to the database schema done around the time the problem started?

Kenneth Brotman

From: Xiangfei Ni [mailto:xiangfei...@cm-dt.com]
Sent: Wednesday, March 28, 2018 4:40 AM
To: user@cassandra.apache.org<mailto:user@cassandra.apache.org>
Subject: 答复: 答复: 答复: A node down every day in a 6 nodes cluster

Hi Kenneth,
    The cluster has been running for 4 months,
    The problem occurred from last week,

Best Regards,

倪项菲/ David Ni
中移德电网络科技有限公司
Virtue Intelligent Network Ltd, co.
Add: 2003,20F No.35 Luojia creative city,Luoyu Road,Wuhan,HuBei
Mob: +86 13797007811|Tel: + 86 27 5024 2516

发件人: Kenneth Brotman 
<kenbrot...@yahoo.com.INVALID<mailto:kenbrot...@yahoo.com.INVALID>>
发送时间: 2018年3月28日 19:34
收件人: user@cassandra.apache.org<mailto:user@cassandra.apache.org>
主题: RE: 答复: 答复: A node down every day in a 6 nodes cluster

David,

How long has the cluster been operating?
How long has the problem been occurring?

Kenneth Brotman

From: Jeff Jirsa [mailto:jji...@gmail.com]
Sent: Tuesday, March 27, 2018 7:00 PM
To: Xiangfei Ni
Cc: user@cassandra.apache.org<mailto:user@cassandra.apache.org>
Subject: Re: 答复: 答复: A node down every day in a 6 nodes cluster




java.langOutOfMemoryError: Java heap space





You’re oom’ ing

--
Jeff Jirsa


On Mar 27, 2018, at 6:45 PM, Xiangfei Ni 
<xiangfei...@cm-dt.com<mailto:xiangfei...@cm-dt.com>> wrote:
Hi Jeff,
    Today another node was shutdown,I have attached the exception log 
file,could you please help to analyze?Thanks.

Best Regards,

倪项菲/ David Ni
中移德电网络科技有限公司
Virtue Intelligent Network Ltd, co.
Add: 2003,20F No35 Luojia creative city,Luoyu Road,Wuhan,HuBei
Mob: +86 13797007811|Tel: + 86 27 5024 2516

发件人: Jeff Jirsa <jji...@gmail.com<mailto:jji...@gmail.com>>
发送时间: 2018年3月27日 11:50
收件人: Xiangfei Ni <xiangfei...@cm-dt.com<mailto:xiangfei...@cm-dt.com>>
抄送: user@cassandra.apache.org<mailto:user@cassandra.apache.org>
主题: Re: 答复: A node down every day in a 6 nodes cluster

Only one node having the problem is suspicious. May be that your application is 
improperly pooling connections, or you have a hardware problem.

I dont see anything in nodetool that explains it, though you certainly have a 
data model likely to cause problems over time (the cardinality of

rt_ac_stat.idx_rt_ac_stat_prot_verrt_ac_stat.idx_rt_ac_stat_prot_ver is such 
that you have very wide partitions and it'll be difficult to read).





On Mon, Mar 26, 2018 at 8:26 PM, Xiangfei Ni 
<xiangfei...@cm-dt.com<mailto:xiangfei...@cm-dt.com>> wrote:
Hi Jeff,
    I need to restart the node manually every time,only one node has this 
problem.
    I have attached the nodetool output,thanks.

Best Regards,

倪项菲/ David Ni
中移德电网络科技有限公司
Virtue Intelligent Network Ltd, co.
Add: 2003,20F No.35 Luojia creative city,Luoyu Road,Wuhan,HuBei
Mob: +86 13797007811<tel:+86%20137%209700%207811>|Tel: + 86 27 5024 
2516<tel:+86%2027%205024%202516>

发件人: Jeff Jirsa <jji...@gmail.com<mailto:jji...@gmail.com>>
发送时间: 2018年3月27日 11:03
收件人: user@cassandra.apache.org<mailto:user@cassandra.apache.org>
主题: Re: A node down every day in a 6 nodes cluster

That warning isn’t sufficient to understand why the node is going down


Cassandra 3.9 has some pretty serious known issues - upgrading to 3.11.3 is 
likely a good idea

Are the nodes coming up on their own? Or are you restarting them?

Paste the output of nodetool tpstats and nodetool cfstats



--
Jeff Jirsa


On Mar 26, 2018, at 7:56 PM, Xiangfei Ni 
<xiangfei...@cm-dt.com<mailto:xiangfei...@cm-dt.com>> wrote:
Hi Cassandra experts,
  I am facing an issue,a node downs every day in a 6 nodes cluster,the cluster 
is just in one DC,
  Every node has 4C 16G,and the heap configuration is MAX_HEAP_SIZE=8192m 
HEAP_NEWSIZE=512m,every node load about 200G data,the RF for the business CF is 
3,a node downs one time every day,the system.log shows below info:
WARN  [Native-Transport-Requests-19] 2018-03-26 18:53:17,128 
CassandraAuthorizer.java:101 - CassandraAuthorizer failed to authorize #<User 
nev_tsp_sa> for <table nev_prod_tsp.latest_rt_alarm>
ERROR [Native-Transport-Requests-19] 2018-03-26 18:53:17,129 
QueryMessage.java:128 - Unexpected error during query
com.google.common.util.concurrent.UncheckedExecutionException: 
java.lang.RuntimeException: 
org.apache.cassandra.exceptions.ReadTimeoutException: Operation timed out - 
received only 0 responses.
        at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2203) 
~[guava-18.0.jar:na]
        at com.google.common.cache.LocalCache.get(LocalCache.java:3937) 
~[guava-180.jar:na]
        at com.google.common.cache.LocalCache.getOrLoad(LocalCache.java:3941) 
~[guava-18.0.jar:na]
        at 
com.google.common.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4824) 
~[guava-18.0.jar:na]
        at org.apache.cassandra.auth.AuthCache.get(AuthCache.java:108) 
~[apache-cassandra-3.9.jar:3.9]
        at 
org.apache.cassandra.auth.PermissionsCache.getPermissions(PermissionsCache.java:45)
 ~[apache-cassandra-3.9.jar:3.9]
        at 
org.apache.cassandra.auth.AuthenticatedUser.getPermissions(AuthenticatedUser.java:104)
 ~[apache-cassandra-3.9.jar:3.9]
        at 
org.apache.cassandra.service.ClientState.authorize(ClientState.java:419) 
~[apache-cassandra-3.9.jar:3.9]
        at 
org.apache.cassandra.service.ClientState.checkPermissionOnResourceChain(ClientState.java:352)
 ~[apache-cassandra-3.9.jar:3.9]
        at 
org.apache.cassandra.service.ClientState.ensureHasPermission(ClientState.java:329)
 ~[apache-cassandra-3.9.jar:3.9]
        at 
org.apache.cassandra.service.ClientState.hasAccess(ClientState.java:316) 
~[apache-cassandra-3.9.jar:3.9]
        at 
org.apache.cassandra.service.ClientState.hasColumnFamilyAccess(ClientState.java:300)
 ~[apache-cassandra-3.9.jar:3.9]
        at 
org.apache.cassandra.cql3.statements.ModificationStatement.checkAccess(ModificationStatement.java:211)
 ~[apache-cassandra-3.9.jar:3.9]
        at 
org.apache.cassandra.cql3.QueryProcessor.processStatement(QueryProcessor.java:185)
 ~[apache-cassandra-3.9.jar:3.9]
        at 
org.apache.cassandra.cql3.QueryProcessor.process(QueryProcessor.java:219) 
~[apache-cassandra-3.9.jar:3.9]
        at 
org.apache.cassandra.cql3.QueryProcessor.process(QueryProcessor.java:204) 
~[apache-cassandra-3.9.jar:3.9]
        at 
org.apache.cassandra.transport.messages.QueryMessage.execute(QueryMessage.java:115)
 ~[apache-cassandra-3.9.jar:3.9]
        at 
org.apache.cassandra.transport.Message$Dispatcher.channelRead0(Message.java:513)
 [apache-cassandra-3.9.jar:3.9]
        at 
org.apache.cassandra.transport.Message$Dispatcher.channelRead0(Message.java:407)
 [apache-cassandra-3.9.jar:3.9]
        at 
io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105)
 [netty-all-4.0.39.Final.jar:4.0.39.Final]
        at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:366)
 [netty-all-4.0.39.Final.jar:4.0.39.Final]
        at 
io.netty.channel.AbstractChannelHandlerContext.access$600(AbstractChannelHandlerContext.java:35)
 [netty-all-4.0.39.Final.jar:4.0.39.Final]
        at 
io.netty.channel.AbstractChannelHandlerContext$7.run(AbstractChannelHandlerContext.java:357)
 [netty-all-4.0.39.Final.jar:4.0.39.Final]
        at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) 
[na:1.8.0_91]
        at 
org.apache.cassandra.concurrent.AbstractLocalAwareExecutorService$FutureTask.run(AbstractLocalAwareExecutorService.java:164)
 [apache-cassandra-3.9.jar:3.9]
        at org.apache.cassandra.concurrent.SEPWorker.run(SEPWorker.java:109) 
[apache-cassandra-3.9.jar:3.9]
        at java.lang.Thread.run(Thread.java:745) [na:1.8.0_91]
Caused by: java.lang.RuntimeException: 
org.apache.cassandra.exceptions.ReadTimeoutException: Operation timed out - 
received only 0 responses.
        at 
org.apache.cassandra.auth.CassandraAuthorizer.authorize(CassandraAuthorizer.java:102)
 ~[apache-cassandra-3.9.jar:3.9]
        at 
org.apache.cassandra.auth.PermissionsCache.lambda$new$0(PermissionsCache.java:37)
 ~[apache-cassandra-3.9.jar:3.9]
        at org.apache.cassandra.auth.AuthCache$1.load(AuthCache.java:183) 
~[apache-cassandra-3.9.jar:3.9]
        at 
com.google.common.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3527)
 ~[guava-18.0.jar:na]
        at 
com.google.common.cache.LocalCache$Segment.loadSync(LocalCache.java:2319) 
~[guava-18.0.jar:na]
        at 
com.google.common.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2282)
 ~[guava-18.0.jar:na]
        at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2197) 
~[guava-18.0.jar:na]
        ... 26 common frames omitted
Caused by: org.apache.cassandra.exceptions.ReadTimeoutException: Operation 
timed out - received only 0 responses.
        at 
org.apache.cassandra.service.ReadCallback.awaitResults(ReadCallback.java:132) 
~[apache-cassandra-3.9.jar:3.9]
        at org.apache.cassandra.service.ReadCallback.get(ReadCallback.java:137) 
~[apache-cassandra-3.9.jar:3.9]
        at 
org.apache.cassandra.service.AbstractReadExecutor.get(AbstractReadExecutor.java:145)
 ~[apache-cassandra-3.9.jar:3.9]
        at 
org.apache.cassandra.service.StorageProxy$SinglePartitionReadLifecycle.awaitResultsAndRetryOnDigestMismatch(StorageProxy.java:1718)
 ~[apache-cassandra-3.9.jar:3.9]
        at 
org.apache.cassandra.service.StorageProxy.fetchRows(StorageProxy.java:1667) 
~[apache-cassandra-3.9.jar:3.9]
        at 
org.apache.cassandra.service.StorageProxy.readRegular(StorageProxy.java:1608) 
~[apache-cassandra-3.9.jar:3.9]
        at 
org.apache.cassandra.service.StorageProxy.read(StorageProxy.java:1527) 
~[apache-cassandra-3.9.jar:3.9]
        at 
org.apache.cassandra.db.SinglePartitionReadCommand$Group.execute(SinglePartitionReadCommand.java:975)
 ~[apache-cassandra-3.9.jar:3.9]
        at 
org.apache.cassandra.cql3.statements.SelectStatement.execute(SelectStatement.java:271)
 ~[apache-cassandra-3.9.jar:3.9]
        at 
org.apache.cassandra.cql3.statements.SelectStatement.execute(SelectStatement.java:232)
 ~[apache-cassandra-3.9.jar:3.9]
        at 
org.apache.cassandra.auth.CassandraAuthorizer.addPermissionsForRole(CassandraAuthorizer.java:227)
 ~[apache-cassandra-3.9.jar:3.9]
        at 
org.apache.cassandra.auth.CassandraAuthorizer.authorize(CassandraAuthorizer.java:93)
 ~[apache-cassandra-3.9.jar:3.9]
        ... 32 common frames omitted
WARN  [Native-Transport-Requests-23] 2018-03-26 18:53:17,131 
CassandraAuthorizer.java:101 - CassandraAuthorizer failed to authorize #<User 
nev_tsp_sa> for <table nev_prod_tsp.rt_alarm_unite>
ERROR [Native-Transport-Requests-64] 2018-03-26 18:53:17,135 
QueryMessage.java:128 - Unexpected error during query
com.google.common.util.concurrent.UncheckedExecutionException: 
java.lang.RuntimeException: 
org.apache.cassandra.exceptions.ReadTimeoutException: Operation timed out - 
received only 0 responses.
        at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2203) 
~[guava-18.0.jar:na]

I have confirmed that nev_tsp_sa has all rights on nev_prod_tsp keyspace:
cassandra@cqlsh:system_auth> select * from role_permissions where role = 
'nev_tsp_sa';

role       | resource          | permissions
------------+-------------------+--------------------------------------------------------------
nev_tsp_sa | data/nev_prod_tsp | {'ALTER', 'AUTHORIZE', 'CREATE', 'DROP', 
'MODIFY', 'SELECT'}

the cache disk can be read/write as normal.

Highly appreciated if anyone can help,thanks very much !


Best Regards,

倪项菲/ David Ni
中移德电网络科技有限公司
Virtue Intelligent Network Ltd, co.
Add: 2003,20F No.35 Luojia creative city,Luoyu Road,Wuhan,HuBei
Mob: +86 13797007811<tel:+86%20137%209700%207811>|Tel: + 86 27 5024 
2516<tel:+86%2027%205024%202516>


<log.txt>

答复: 答复: 答复: A node down every day in a 6 nodes cluster

Reply via email to