RE: 答复: 答复: A node down every day in a 6 nodes cluster

Kenneth Brotman Wed, 28 Mar 2018 06:00:34 -0700

Properly Sizing Your Heap to Prevent OutOfMemoryErrors

https://support.datastax.com/hc/en-us/articles/204225929-Properly-Sizing-Your-Heap-to-Prevent-OutOfMemoryErrors


 

 

From: Kenneth Brotman [mailto:kenbrot...@yahoo.com.INVALID] 
Sent: Wednesday, March 28, 2018 5:35 AM
To: user@cassandra.apache.org
Subject: RE: 答复: 答复: A node down every day in a 6 nodes cluster

 

If you think that will fix the problem, maybe you could add a little more 
memory to each machine as a short term fix.

 

From: Xiangfei Ni [mailto:xiangfei...@cm-dt.com] 
Sent: Wednesday, March 28, 2018 5:24 AM
To: user@cassandra.apache.org
Subject: 答复: 答复: 答复: A node down every day in a 6 nodes cluster

 

Yes ,we discussed and plan to figured out the data model issue and upgrade to 
3.11.3 version.

 

Best Regards, 

 

倪项菲/ David Ni

中移德电网络科技有限公司

Virtue Intelligent Network Ltd, co.

Add: 2003,20F No.35 Luojia creative city,Luoyu Road,Wuhan,HuBei

Mob: +86 13797007811|Tel: + 86 27 5024 2516

 

发件人: Kenneth Brotman <kenbrot...@yahoo.com.INVALID> 
发送时间: 2018年3月28日 20:16
收件人: user@cassandra.apache.org
主题: RE: 答复: 答复: A node down every day in a 6 nodes cluster

 

David, 

 

Did you figure out what to do about the data model problem?  It could be that 
your data files finally grow to the point that the data model problem caused 
the Java heap space issue – in which case everything is actually working as 
it’s supposed to; You just have to fix the data model.

 

Kenneth Brotman

 

 

From: Kenneth Brotman [ <mailto:kenbrot...@yahoo.com> 
mailto:kenbrot...@yahoo.com] 
Sent: Wednesday, March 28, 2018 4:46 AM
To: 'user@cassandra.apache.org'
Subject: RE: 答复: 答复: A node down every day in a 6 nodes cluster

 

Was any change to hardware done around the time the problem started ?

Was any change to the client software done around the time the problem started?

Was any change to the database schema done around the time the problem started?

 

Kenneth Brotman

 

From: Xiangfei Ni [ <mailto:xiangfei...@cm-dt.com> 
mailto:xiangfei...@cm-dt.com] 
Sent: Wednesday, March 28, 2018 4:40 AM
To:  <mailto:user@cassandra.apache.org> user@cassandra.apache.org
Subject: 答复: 答复: 答复: A node down every day in a 6 nodes cluster

 

Hi Kenneth,

    The cluster has been running for 4 months,

    The problem occurred from last week,

 

Best Regards, 

 

倪项菲/ David Ni

中移德电网络科技有限公司

Virtue Intelligent Network Ltd, co.

Add: 2003,20F No.35 Luojia creative city,Luoyu Road,Wuhan,HuBei

Mob: +86 13797007811|Tel: + 86 27 5024 2516

 

发件人: Kenneth Brotman < <mailto:kenbrot...@yahoo.com.INVALID> 
kenbrot...@yahoo.com.INVALID> 
发送时间: 2018年3月28日 19:34
收件人:  <mailto:user@cassandra.apache.org> user@cassandra.apache.org
主题: RE: 答复: 答复: A node down every day in a 6 nodes cluster

 

David,

 

How long has the cluster been operating?

How long has the problem been occurring?

 

Kenneth Brotman

 

From: Jeff Jirsa [ <mailto:jji...@gmail.com> mailto:jji...@gmail.com] 
Sent: Tuesday, March 27, 2018 7:00 PM
To: Xiangfei Ni
Cc:  <mailto:user@cassandra.apache.org> user@cassandra.apache.org
Subject: Re: 答复: 答复: A node down every day in a 6 nodes cluster

 

 

java.langOutOfMemoryError: Java heap space

 

 

You’re oom’ ing 

 

-- 

Jeff Jirsa

 


On Mar 27, 2018, at 6:45 PM, Xiangfei Ni <xiangfei...@cm-dt.com> wrote:

Hi Jeff,

    Today another node was shutdown,I have attached the exception log 
file,could you please help to analyze?Thanks.

 

Best Regards, 

 

倪项菲/ David Ni

中移德电网络科技有限公司

Virtue Intelligent Network Ltd, co.

Add: 2003,20F No35 Luojia creative city,Luoyu Road,Wuhan,HuBei

Mob: +86 13797007811|Tel: + 86 27 5024 2516

 

发件人: Jeff Jirsa < <mailto:jji...@gmail.com> jji...@gmail.com> 
发送时间: 2018年3月27日 11:50
收件人: Xiangfei Ni < <mailto:xiangfei...@cm-dt.com> xiangfei...@cm-dt.com>
抄送:  <mailto:user@cassandra.apache.org> user@cassandra.apache.org
主题: Re: 答复: A node down every day in a 6 nodes cluster

 

Only one node having the problem is suspicious. May be that your application is 
improperly pooling connections, or you have a hardware problem.

 

I dont see anything in nodetool that explains it, though you certainly have a 
data model likely to cause problems over time (the cardinality of 

rt_ac_stat.idx_rt_ac_stat_prot_verrt_ac_stat.idx_rt_ac_stat_prot_ver is such 
that you have very wide partitions and it'll be difficult to read).
 
 

 

On Mon, Mar 26, 2018 at 8:26 PM, Xiangfei Ni <xiangfei...@cm-dt.com> wrote:

Hi Jeff,

    I need to restart the node manually every time,only one node has this 
problem.

    I have attached the nodetool output,thanks.

 

Best Regards, 

 

倪项菲/ David Ni

中移德电网络科技有限公司

Virtue Intelligent Network Ltd, co.

Add: 2003,20F No.35 Luojia creative city,Luoyu Road,Wuhan,HuBei

Mob:  <tel:+86%20137%209700%207811> +86 13797007811|Tel:  
<tel:+86%2027%205024%202516> + 86 27 5024 2516

 

发件人: Jeff Jirsa < <mailto:jji...@gmail.com> jji...@gmail.com> 
发送时间: 2018年3月27日 11:03
收件人:  <mailto:user@cassandra.apache.org> user@cassandra.apache.org
主题: Re: A node down every day in a 6 nodes cluster

 

That warning isn’t sufficient to understand why the node is going down

 

 

Cassandra 3.9 has some pretty serious known issues - upgrading to 3.11.3 is 
likely a good idea

 

Are the nodes coming up on their own? Or are you restarting them?

 

Paste the output of nodetool tpstats and nodetool cfstats

 

 

 

-- 

Jeff Jirsa

 


On Mar 26, 2018, at 7:56 PM, Xiangfei Ni <xiangfei...@cm-dt.com> wrote:

Hi Cassandra experts,

  I am facing an issue,a node downs every day in a 6 nodes cluster,the cluster 
is just in one DC,

  Every node has 4C 16G,and the heap configuration is MAX_HEAP_SIZE=8192m 
HEAP_NEWSIZE=512m,every node load about 200G data,the RF for the business CF is 
3,a node downs one time every day,the system.log shows below info:

WARN  [Native-Transport-Requests-19] 2018-03-26 18:53:17,128 
CassandraAuthorizer.java:101 - CassandraAuthorizer failed to authorize #<User 
nev_tsp_sa> for <table nev_prod_tsp.latest_rt_alarm>

ERROR [Native-Transport-Requests-19] 2018-03-26 18:53:17,129 
QueryMessage.java:128 - Unexpected error during query

com.google.common.util.concurrent.UncheckedExecutionException: 
java.lang.RuntimeException: 
org.apache.cassandra.exceptions.ReadTimeoutException: Operation timed out - 
received only 0 responses.

        at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2203) 
~[guava-18.0.jar:na]

        at com.google.common.cache.LocalCache.get(LocalCache.java:3937) 
~[guava-180.jar:na]

        at com.google.common.cache.LocalCache.getOrLoad(LocalCache.java:3941) 
~[guava-18.0.jar:na]

        at 
com.google.common.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4824) 
~[guava-18.0.jar:na]

        at org.apache.cassandra.auth.AuthCache.get(AuthCache.java:108) 
~[apache-cassandra-3.9.jar:3.9]

        at 
org.apache.cassandra.auth.PermissionsCache.getPermissions(PermissionsCache.java:45)
 ~[apache-cassandra-3.9.jar:3.9]

        at 
org.apache.cassandra.auth.AuthenticatedUser.getPermissions(AuthenticatedUser.java:104)
 ~[apache-cassandra-3.9.jar:3.9]

        at 
org.apache.cassandra.service.ClientState.authorize(ClientState.java:419) 
~[apache-cassandra-3.9.jar:3.9]

        at 
org.apache.cassandra.service.ClientState.checkPermissionOnResourceChain(ClientState.java:352)
 ~[apache-cassandra-3.9.jar:3.9]

        at 
org.apache.cassandra.service.ClientState.ensureHasPermission(ClientState.java:329)
 ~[apache-cassandra-3.9.jar:3.9]

        at 
org.apache.cassandra.service.ClientState.hasAccess(ClientState.java:316) 
~[apache-cassandra-3.9.jar:3.9]

        at 
org.apache.cassandra.service.ClientState.hasColumnFamilyAccess(ClientState.java:300)
 ~[apache-cassandra-3.9.jar:3.9]

        at 
org.apache.cassandra.cql3.statements.ModificationStatement.checkAccess(ModificationStatement.java:211)
 ~[apache-cassandra-3.9.jar:3.9]

        at 
org.apache.cassandra.cql3.QueryProcessor.processStatement(QueryProcessor.java:185)
 ~[apache-cassandra-3.9.jar:3.9]

        at 
org.apache.cassandra.cql3.QueryProcessor.process(QueryProcessor.java:219) 
~[apache-cassandra-3.9.jar:3.9]

        at 
org.apache.cassandra.cql3.QueryProcessor.process(QueryProcessor.java:204) 
~[apache-cassandra-3.9.jar:3.9]

        at 
org.apache.cassandra.transport.messages.QueryMessage.execute(QueryMessagejava:115)
 ~[apache-cassandra-3.9.jar:3.9]

        at 
org.apache.cassandra.transport.Message$Dispatcher.channelRead0(Message.java:513)
 [apache-cassandra-3.9.jar:3.9]

        at 
org.apache.cassandra.transport.Message$Dispatcher.channelRead0(Message.java:407)
 [apache-cassandra-3.9.jar:3.9]

        at 
io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105)
 [netty-all-4.0.39.Final.jar:4.0.39.Final]

        at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:366)
 [netty-all-4.0.39.Final.jar:4.0.39.Final]

        at 
io.netty.channel.AbstractChannelHandlerContext.access$600(AbstractChannelHandlerContext.java:35)
 [netty-all-4.0.39.Final.jar:4.0.39.Final]

        at 
io.netty.channel.AbstractChannelHandlerContext$7.run(AbstractChannelHandlerContext.java:357)
 [netty-all-4.0.39.Final.jar:4.0.39.Final]

        at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) 
[na:1.8.0_91]

        at 
org.apache.cassandra.concurrent.AbstractLocalAwareExecutorService$FutureTask.run(AbstractLocalAwareExecutorService.java:164)
 [apache-cassandra-3.9.jar:3.9]

        at org.apache.cassandra.concurrent.SEPWorker.run(SEPWorker.java:109) 
[apache-cassandra-3.9.jar:3.9]

        at java.lang.Thread.run(Thread.java:745) [na:1.8.0_91]

Caused by: java.lang.RuntimeException: 
org.apache.cassandra.exceptions.ReadTimeoutException: Operation timed out - 
received only 0 responses.

        at 
org.apache.cassandra.auth.CassandraAuthorizer.authorize(CassandraAuthorizer.java:102)
 ~[apache-cassandra-3.9.jar:3.9]

        at 
org.apache.cassandra.auth.PermissionsCache.lambda$new$0(PermissionsCache.java:37)
 ~[apache-cassandra-3.9.jar:3.9]

        at org.apache.cassandra.auth.AuthCache$1.load(AuthCache.java:183) 
~[apache-cassandra-3.9.jar:3.9]

        at 
com.google.common.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3527)
 ~[guava-18.0.jar:na]

        at 
com.google.common.cache.LocalCache$Segment.loadSync(LocalCache.java:2319) 
~[guava-18.0.jar:na]

        at 
com.google.common.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2282)
 ~[guava-18.0.jar:na]

        at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2197) 
~[guava-18.0.jar:na]

        .. 26 common frames omitted

Caused by: org.apache.cassandra.exceptions.ReadTimeoutException: Operation 
timed out - received only 0 responses.

        at 
org.apache.cassandra.service.ReadCallback.awaitResults(ReadCallback.java:132) 
~[apache-cassandra-3.9.jar:3.9]

        at org.apache.cassandra.service.ReadCallback.get(ReadCallback.java:137) 
~[apache-cassandra-3.9.jar:3.9]

        at 
org.apache.cassandra.service.AbstractReadExecutor.get(AbstractReadExecutor.java:145)
 ~[apache-cassandra-3.9.jar:3.9]

        at 
org.apache.cassandra.service.StorageProxy$SinglePartitionReadLifecycle.awaitResultsAndRetryOnDigestMismatch(StorageProxy.java:1718)
 ~[apache-cassandra-3.9.jar:3.9]

        at 
org.apache.cassandra.service.StorageProxy.fetchRows(StorageProxy.java:1667) 
~[apache-cassandra-3.9.jar:3.9]

        at 
org.apache.cassandra.service.StorageProxy.readRegular(StorageProxy.java:1608) 
~[apache-cassandra-3.9.jar:3.9]

        at 
org.apache.cassandra.service.StorageProxy.read(StorageProxy.java:1527) 
~[apache-cassandra-3.9.jar:3.9]

        at 
org.apache.cassandra.db.SinglePartitionReadCommand$Group.execute(SinglePartitionReadCommand.java:975)
 ~[apache-cassandra-3.9.jar:3.9]

        at 
org.apache.cassandra.cql3.statements.SelectStatement.execute(SelectStatement.java:271)
 ~[apache-cassandra-3.9.jar:3.9]

        at 
org.apache.cassandra.cql3.statements.SelectStatement.execute(SelectStatement.java:232)
 ~[apache-cassandra-3.9.jar:3.9]

        at 
org.apache.cassandra.auth.CassandraAuthorizer.addPermissionsForRole(CassandraAuthorizer.java:227)
 ~[apache-cassandra-3.9.jar:3.9]

        at 
org.apache.cassandra.auth.CassandraAuthorizer.authorize(CassandraAuthorizer.java:93)
 ~[apache-cassandra-3.9.jar:3.9]

        .. 32 common frames omitted

WARN  [Native-Transport-Requests-23] 2018-03-26 18:53:17,131 
CassandraAuthorizer.java:101 - CassandraAuthorizer failed to authorize #<User 
nev_tsp_sa> for <table nev_prod_tsp.rt_alarm_unite>

ERROR [Native-Transport-Requests-64] 2018-03-26 18:53:17,135 
QueryMessage.java:128 - Unexpected error during query

com.google.common.util.concurrent.UncheckedExecutionException: 
java.lang.RuntimeException: 
org.apache.cassandra.exceptions.ReadTimeoutException: Operation timed out - 
received only 0 responses.

        at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2203) 
~[guava-18.0.jar:na]

 

I have confirmed that nev_tsp_sa has all rights on nev_prod_tsp keyspace:

cassandra@cqlsh:system_auth> select * from role_permissions where role = 
'nev_tsp_sa';

 

role       | resource          | permissions

------------+-------------------+--------------------------------------------------------------

nev_tsp_sa | data/nev_prod_tsp | {'ALTER', 'AUTHORIZE', 'CREATE', 'DROP', 
'MODIFY', 'SELECT'}

 

the cache disk can be read/write as normal.

 

Highly appreciated if anyone can help,thanks very much !

 

 

Best Regards, 

 

倪项菲/ David Ni

中移德电网络科技有限公司

Virtue Intelligent Network Ltd, co.

Add: 2003,20F No.35 Luojia creative city,Luoyu Road,Wuhan,HuBei

Mob:  <tel:+86%20137%209700%207811> +86 13797007811|Tel:  
<tel:+86%2027%205024%202516> + 86 27 5024 2516

 

 

<log.txt>

RE: 答复: 答复: A node down every day in a 6 nodes cluster

Reply via email to