Properly Sizing Your Heap to Prevent OutOfMemoryErrors https://support.datastax.com/hc/en-us/articles/204225929-Properly-Sizing-Your-Heap-to-Prevent-OutOfMemoryErrors
From: Kenneth Brotman [mailto:kenbrot...@yahoo.com.INVALID] Sent: Wednesday, March 28, 2018 5:35 AM To: user@cassandra.apache.org Subject: RE: 答复: 答复: A node down every day in a 6 nodes cluster If you think that will fix the problem, maybe you could add a little more memory to each machine as a short term fix. From: Xiangfei Ni [mailto:xiangfei...@cm-dt.com] Sent: Wednesday, March 28, 2018 5:24 AM To: user@cassandra.apache.org Subject: 答复: 答复: 答复: A node down every day in a 6 nodes cluster Yes ,we discussed and plan to figured out the data model issue and upgrade to 3.11.3 version. Best Regards, 倪项菲/ David Ni 中移德电网络科技有限公司 Virtue Intelligent Network Ltd, co. Add: 2003,20F No.35 Luojia creative city,Luoyu Road,Wuhan,HuBei Mob: +86 13797007811|Tel: + 86 27 5024 2516 发件人: Kenneth Brotman <kenbrot...@yahoo.com.INVALID> 发送时间: 2018年3月28日 20:16 收件人: user@cassandra.apache.org 主题: RE: 答复: 答复: A node down every day in a 6 nodes cluster David, Did you figure out what to do about the data model problem? It could be that your data files finally grow to the point that the data model problem caused the Java heap space issue – in which case everything is actually working as it’s supposed to; You just have to fix the data model. Kenneth Brotman From: Kenneth Brotman [ <mailto:kenbrot...@yahoo.com> mailto:kenbrot...@yahoo.com] Sent: Wednesday, March 28, 2018 4:46 AM To: 'user@cassandra.apache.org' Subject: RE: 答复: 答复: A node down every day in a 6 nodes cluster Was any change to hardware done around the time the problem started ? Was any change to the client software done around the time the problem started? Was any change to the database schema done around the time the problem started? Kenneth Brotman From: Xiangfei Ni [ <mailto:xiangfei...@cm-dt.com> mailto:xiangfei...@cm-dt.com] Sent: Wednesday, March 28, 2018 4:40 AM To: <mailto:user@cassandra.apache.org> user@cassandra.apache.org Subject: 答复: 答复: 答复: A node down every day in a 6 nodes cluster Hi Kenneth, The cluster has been running for 4 months, The problem occurred from last week, Best Regards, 倪项菲/ David Ni 中移德电网络科技有限公司 Virtue Intelligent Network Ltd, co. Add: 2003,20F No.35 Luojia creative city,Luoyu Road,Wuhan,HuBei Mob: +86 13797007811|Tel: + 86 27 5024 2516 发件人: Kenneth Brotman < <mailto:kenbrot...@yahoo.com.INVALID> kenbrot...@yahoo.com.INVALID> 发送时间: 2018年3月28日 19:34 收件人: <mailto:user@cassandra.apache.org> user@cassandra.apache.org 主题: RE: 答复: 答复: A node down every day in a 6 nodes cluster David, How long has the cluster been operating? How long has the problem been occurring? Kenneth Brotman From: Jeff Jirsa [ <mailto:jji...@gmail.com> mailto:jji...@gmail.com] Sent: Tuesday, March 27, 2018 7:00 PM To: Xiangfei Ni Cc: <mailto:user@cassandra.apache.org> user@cassandra.apache.org Subject: Re: 答复: 答复: A node down every day in a 6 nodes cluster java.langOutOfMemoryError: Java heap space You’re oom’ ing -- Jeff Jirsa On Mar 27, 2018, at 6:45 PM, Xiangfei Ni <xiangfei...@cm-dt.com> wrote: Hi Jeff, Today another node was shutdown,I have attached the exception log file,could you please help to analyze?Thanks. Best Regards, 倪项菲/ David Ni 中移德电网络科技有限公司 Virtue Intelligent Network Ltd, co. Add: 2003,20F No35 Luojia creative city,Luoyu Road,Wuhan,HuBei Mob: +86 13797007811|Tel: + 86 27 5024 2516 发件人: Jeff Jirsa < <mailto:jji...@gmail.com> jji...@gmail.com> 发送时间: 2018年3月27日 11:50 收件人: Xiangfei Ni < <mailto:xiangfei...@cm-dt.com> xiangfei...@cm-dt.com> 抄送: <mailto:user@cassandra.apache.org> user@cassandra.apache.org 主题: Re: 答复: A node down every day in a 6 nodes cluster Only one node having the problem is suspicious. May be that your application is improperly pooling connections, or you have a hardware problem. I dont see anything in nodetool that explains it, though you certainly have a data model likely to cause problems over time (the cardinality of rt_ac_stat.idx_rt_ac_stat_prot_verrt_ac_stat.idx_rt_ac_stat_prot_ver is such that you have very wide partitions and it'll be difficult to read). On Mon, Mar 26, 2018 at 8:26 PM, Xiangfei Ni <xiangfei...@cm-dt.com> wrote: Hi Jeff, I need to restart the node manually every time,only one node has this problem. I have attached the nodetool output,thanks. Best Regards, 倪项菲/ David Ni 中移德电网络科技有限公司 Virtue Intelligent Network Ltd, co. Add: 2003,20F No.35 Luojia creative city,Luoyu Road,Wuhan,HuBei Mob: <tel:+86%20137%209700%207811> +86 13797007811|Tel: <tel:+86%2027%205024%202516> + 86 27 5024 2516 发件人: Jeff Jirsa < <mailto:jji...@gmail.com> jji...@gmail.com> 发送时间: 2018年3月27日 11:03 收件人: <mailto:user@cassandra.apache.org> user@cassandra.apache.org 主题: Re: A node down every day in a 6 nodes cluster That warning isn’t sufficient to understand why the node is going down Cassandra 3.9 has some pretty serious known issues - upgrading to 3.11.3 is likely a good idea Are the nodes coming up on their own? Or are you restarting them? Paste the output of nodetool tpstats and nodetool cfstats -- Jeff Jirsa On Mar 26, 2018, at 7:56 PM, Xiangfei Ni <xiangfei...@cm-dt.com> wrote: Hi Cassandra experts, I am facing an issue,a node downs every day in a 6 nodes cluster,the cluster is just in one DC, Every node has 4C 16G,and the heap configuration is MAX_HEAP_SIZE=8192m HEAP_NEWSIZE=512m,every node load about 200G data,the RF for the business CF is 3,a node downs one time every day,the system.log shows below info: WARN [Native-Transport-Requests-19] 2018-03-26 18:53:17,128 CassandraAuthorizer.java:101 - CassandraAuthorizer failed to authorize #<User nev_tsp_sa> for <table nev_prod_tsp.latest_rt_alarm> ERROR [Native-Transport-Requests-19] 2018-03-26 18:53:17,129 QueryMessage.java:128 - Unexpected error during query com.google.common.util.concurrent.UncheckedExecutionException: java.lang.RuntimeException: org.apache.cassandra.exceptions.ReadTimeoutException: Operation timed out - received only 0 responses. at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2203) ~[guava-18.0.jar:na] at com.google.common.cache.LocalCache.get(LocalCache.java:3937) ~[guava-180.jar:na] at com.google.common.cache.LocalCache.getOrLoad(LocalCache.java:3941) ~[guava-18.0.jar:na] at com.google.common.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4824) ~[guava-18.0.jar:na] at org.apache.cassandra.auth.AuthCache.get(AuthCache.java:108) ~[apache-cassandra-3.9.jar:3.9] at org.apache.cassandra.auth.PermissionsCache.getPermissions(PermissionsCache.java:45) ~[apache-cassandra-3.9.jar:3.9] at org.apache.cassandra.auth.AuthenticatedUser.getPermissions(AuthenticatedUser.java:104) ~[apache-cassandra-3.9.jar:3.9] at org.apache.cassandra.service.ClientState.authorize(ClientState.java:419) ~[apache-cassandra-3.9.jar:3.9] at org.apache.cassandra.service.ClientState.checkPermissionOnResourceChain(ClientState.java:352) ~[apache-cassandra-3.9.jar:3.9] at org.apache.cassandra.service.ClientState.ensureHasPermission(ClientState.java:329) ~[apache-cassandra-3.9.jar:3.9] at org.apache.cassandra.service.ClientState.hasAccess(ClientState.java:316) ~[apache-cassandra-3.9.jar:3.9] at org.apache.cassandra.service.ClientState.hasColumnFamilyAccess(ClientState.java:300) ~[apache-cassandra-3.9.jar:3.9] at org.apache.cassandra.cql3.statements.ModificationStatement.checkAccess(ModificationStatement.java:211) ~[apache-cassandra-3.9.jar:3.9] at org.apache.cassandra.cql3.QueryProcessor.processStatement(QueryProcessor.java:185) ~[apache-cassandra-3.9.jar:3.9] at org.apache.cassandra.cql3.QueryProcessor.process(QueryProcessor.java:219) ~[apache-cassandra-3.9.jar:3.9] at org.apache.cassandra.cql3.QueryProcessor.process(QueryProcessor.java:204) ~[apache-cassandra-3.9.jar:3.9] at org.apache.cassandra.transport.messages.QueryMessage.execute(QueryMessagejava:115) ~[apache-cassandra-3.9.jar:3.9] at org.apache.cassandra.transport.Message$Dispatcher.channelRead0(Message.java:513) [apache-cassandra-3.9.jar:3.9] at org.apache.cassandra.transport.Message$Dispatcher.channelRead0(Message.java:407) [apache-cassandra-3.9.jar:3.9] at io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105) [netty-all-4.0.39.Final.jar:4.0.39.Final] at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:366) [netty-all-4.0.39.Final.jar:4.0.39.Final] at io.netty.channel.AbstractChannelHandlerContext.access$600(AbstractChannelHandlerContext.java:35) [netty-all-4.0.39.Final.jar:4.0.39.Final] at io.netty.channel.AbstractChannelHandlerContext$7.run(AbstractChannelHandlerContext.java:357) [netty-all-4.0.39.Final.jar:4.0.39.Final] at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [na:1.8.0_91] at org.apache.cassandra.concurrent.AbstractLocalAwareExecutorService$FutureTask.run(AbstractLocalAwareExecutorService.java:164) [apache-cassandra-3.9.jar:3.9] at org.apache.cassandra.concurrent.SEPWorker.run(SEPWorker.java:109) [apache-cassandra-3.9.jar:3.9] at java.lang.Thread.run(Thread.java:745) [na:1.8.0_91] Caused by: java.lang.RuntimeException: org.apache.cassandra.exceptions.ReadTimeoutException: Operation timed out - received only 0 responses. at org.apache.cassandra.auth.CassandraAuthorizer.authorize(CassandraAuthorizer.java:102) ~[apache-cassandra-3.9.jar:3.9] at org.apache.cassandra.auth.PermissionsCache.lambda$new$0(PermissionsCache.java:37) ~[apache-cassandra-3.9.jar:3.9] at org.apache.cassandra.auth.AuthCache$1.load(AuthCache.java:183) ~[apache-cassandra-3.9.jar:3.9] at com.google.common.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3527) ~[guava-18.0.jar:na] at com.google.common.cache.LocalCache$Segment.loadSync(LocalCache.java:2319) ~[guava-18.0.jar:na] at com.google.common.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2282) ~[guava-18.0.jar:na] at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2197) ~[guava-18.0.jar:na] .. 26 common frames omitted Caused by: org.apache.cassandra.exceptions.ReadTimeoutException: Operation timed out - received only 0 responses. at org.apache.cassandra.service.ReadCallback.awaitResults(ReadCallback.java:132) ~[apache-cassandra-3.9.jar:3.9] at org.apache.cassandra.service.ReadCallback.get(ReadCallback.java:137) ~[apache-cassandra-3.9.jar:3.9] at org.apache.cassandra.service.AbstractReadExecutor.get(AbstractReadExecutor.java:145) ~[apache-cassandra-3.9.jar:3.9] at org.apache.cassandra.service.StorageProxy$SinglePartitionReadLifecycle.awaitResultsAndRetryOnDigestMismatch(StorageProxy.java:1718) ~[apache-cassandra-3.9.jar:3.9] at org.apache.cassandra.service.StorageProxy.fetchRows(StorageProxy.java:1667) ~[apache-cassandra-3.9.jar:3.9] at org.apache.cassandra.service.StorageProxy.readRegular(StorageProxy.java:1608) ~[apache-cassandra-3.9.jar:3.9] at org.apache.cassandra.service.StorageProxy.read(StorageProxy.java:1527) ~[apache-cassandra-3.9.jar:3.9] at org.apache.cassandra.db.SinglePartitionReadCommand$Group.execute(SinglePartitionReadCommand.java:975) ~[apache-cassandra-3.9.jar:3.9] at org.apache.cassandra.cql3.statements.SelectStatement.execute(SelectStatement.java:271) ~[apache-cassandra-3.9.jar:3.9] at org.apache.cassandra.cql3.statements.SelectStatement.execute(SelectStatement.java:232) ~[apache-cassandra-3.9.jar:3.9] at org.apache.cassandra.auth.CassandraAuthorizer.addPermissionsForRole(CassandraAuthorizer.java:227) ~[apache-cassandra-3.9.jar:3.9] at org.apache.cassandra.auth.CassandraAuthorizer.authorize(CassandraAuthorizer.java:93) ~[apache-cassandra-3.9.jar:3.9] .. 32 common frames omitted WARN [Native-Transport-Requests-23] 2018-03-26 18:53:17,131 CassandraAuthorizer.java:101 - CassandraAuthorizer failed to authorize #<User nev_tsp_sa> for <table nev_prod_tsp.rt_alarm_unite> ERROR [Native-Transport-Requests-64] 2018-03-26 18:53:17,135 QueryMessage.java:128 - Unexpected error during query com.google.common.util.concurrent.UncheckedExecutionException: java.lang.RuntimeException: org.apache.cassandra.exceptions.ReadTimeoutException: Operation timed out - received only 0 responses. at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2203) ~[guava-18.0.jar:na] I have confirmed that nev_tsp_sa has all rights on nev_prod_tsp keyspace: cassandra@cqlsh:system_auth> select * from role_permissions where role = 'nev_tsp_sa'; role | resource | permissions ------------+-------------------+-------------------------------------------------------------- nev_tsp_sa | data/nev_prod_tsp | {'ALTER', 'AUTHORIZE', 'CREATE', 'DROP', 'MODIFY', 'SELECT'} the cache disk can be read/write as normal. Highly appreciated if anyone can help,thanks very much ! Best Regards, 倪项菲/ David Ni 中移德电网络科技有限公司 Virtue Intelligent Network Ltd, co. Add: 2003,20F No.35 Luojia creative city,Luoyu Road,Wuhan,HuBei Mob: <tel:+86%20137%209700%207811> +86 13797007811|Tel: <tel:+86%2027%205024%202516> + 86 27 5024 2516 <log.txt>