Okay. Sure. Thanks a lot for all the information. Really helped. :) On Tue, 27 Jul 2021 at 21:05, Bowen Song <bo...@bso.ng> wrote:
> Based on the information I know, I'd say that you don't have any specific > issue with the authentication related tables, but you do have a general > overloading problem during peak load. I think it's fairly likely that your > 7 nodes cluster (6 nodes in one DC) is not able to keep up with the peak > load, and you will need to either scale up for the peak load or tune the > application to avoid the bursty behaviour. > On 27/07/2021 16:27, Chahat Bhatia wrote: > > Yes, the application in quite read heavy and the request pattern is bursty > too. Hence that big a request failure in such less time. > Also, nothing out of the ordinary in cfstats and proxyhistograms. > > But there are Native-Transport-Requests dropped messages (Almost similar > stats on all the nodes) : > > /opt/evive/pkgs/apache-cassandra-3.0.13/bin/nodetool tpstats >> >> Pool Name Active Pending Completed >> Blocked All time blocked >> ... >> Native-Transport-Requests 6 0 27777476 >> 0 8262 >> >> Message type Dropped >> READ 24260 >> > > > * We have the below values for the following params:* > concurrent_reads: 96 > native_transport_max_threads: 320 > JVM_OPTS="$JVM_OPTS -Dcassandra.max_queued_native_transport_requests=3072" > > Do you think it would be advisable to tune the number in the above params > to have lesser load on the node? > > > On Tue, 27 Jul 2021 at 20:13, Bowen Song <bo...@bso.ng> <bo...@bso.ng> > wrote: > >> Wow, 15 seconds timeout? That's pretty long... You may want to check the >> nodetool >> tpstats and make sure the NTP thread pool isn't blocking things. >> >> 16k read requests dropped in 5 seconds, or over 3k requests per second on >> a single node, is a bit suspicious. Does your read requests tend to be >> bursty? >> On 27/07/2021 15:32, Chahat Bhatia wrote: >> >> Yes, RF=6 for system auth. Sorry my bad. >> >> >> No, we are not using cassandra user for the application. We have a custom >> super user for our operational and administrative tasks and a separate role >> with needed perms for the application. >> >>> role | super | login | options >>> ------------------------+-------+-------+--------- >>> EviveCassandra | True | True | {} >> >> ccusr | False | True | {} >> >> >> >> >> - gc_warn_threshold_in_ms: 1000 >> - read_request_timeout_in_ms: 15000 // been the same since the start >> - cross_node_timeout: false // but we have ntp on each >> server and the clocks are synced. >> >> - On checking the logs a little more I found out these: >> >> INFO [ScheduledTasks:1] 2021-07-26 07:21:33,279 >>> MessagingService.java:1013 - READ messages were dropped in last 5000 ms: >>> 289 for internal timeout and 0 for cross node timeout >> >> INFO [ScheduledTasks:1] 2021-07-26 07:22:33,894 >>> MessagingService.java:1013 - READ messages were dropped in last 5000 ms: >>> 273 for internal timeout and 0 for cross node timeout >> >> INFO [ScheduledTasks:1] 2021-07-27 06:12:03,988 >>> MessagingService.java:1013 - READ messages were dropped in last 5000 ms: >>> 160 for internal timeout and 0 for cross node timeout >> >> INFO [ScheduledTasks:1] 2021-07-27 06:13:09,125 >>> MessagingService.java:1013 - READ messages were dropped in last 5000 ms: >>> 223 for internal timeout and 0 for cross node timeout >> >> INFO [ScheduledTasks:1] 2021-07-27 06:13:44,212 >>> MessagingService.java:1013 - READ messages were dropped in last 5000 ms: 72 >>> for internal timeout and 0 for cross node timeout >> >> INFO [ScheduledTasks:1] 2021-07-27 06:14:14,496 >>> MessagingService.java:1013 - READ messages were dropped in last 5000 ms: >>> 16036 for internal timeout and 0 for cross node timeout >> >> INFO [ScheduledTasks:1] 2021-07-27 06:15:59,744 >>> MessagingService.java:1013 - READ messages were dropped in last 5000 ms: >>> 287 for internal timeout and 0 for cross node timeout >> >> >> >> - Also, I am checking cfstats and proxyhistorgrams is in progress, >> will update incase anythings suspicious. >> >> >> On Tue, 27 Jul 2021 at 18:09, Bowen Song <bo...@bso.ng> <bo...@bso.ng> >> wrote: >> >>> Hello Chahat, >>> >>> >>> You haven't replied to the first point, are you using the "cassandra" >>> user? >>> >>> >>> The schema and your description don't quite match. When you said: >>> >>> *the system_auth for 2 DCs : **us-east* >>> * with 6 nodes (and RF=3) and ... * >>> >>> I assume you meant to say 6 nodes and RF=6? That sounds totally fine. >>> >>> >>> As of the GC logs, do you see any WARN level logs? You may also want to >>> double check the choice of the gc_warn_threshold_in_ms value is >>> reasonable. I'm a bit surprised to see the GC logs full of ~1.6ms GC >>> pauses. AFAIK, by default Cassandra only logs GC pauses longer than 200ms. >>> I have to assume that's the result of your custom script. You may want to >>> double check the script actually does what it says on the tin, just in case >>> it misses something. >>> >>> >>> Then, the other relevant questions. >>> >>> Do you use cross node timeout? If so, are the clocks in sync between all >>> servers? >>> >>> What is the timeout values in the cassandra.yaml file? ( >>> read_request_timeout_in_ms, etc.) Are they reasonably long enough for >>> the corresponding request type to complete? >>> >>> Since you've only got 7 nodes, I'd also recommend you to check the nodetool >>> cfstats & nodetool cfhistograms output for the tables in the >>> system_auth keyspace on each node, and see if you can spot anything out of >>> ordinary. >>> >>> >>> Regards, >>> >>> Bowen >>> On 27/07/2021 09:23, Chahat Bhatia wrote: >>> >>> Thanks for the prompt response. >>> >>> * Here is the system_schema.keyspaces entry:* >>> >>> system_auth | True | {'class': >>>> 'org.apache.cassandra.locator.NetworkTopologyStrategy', 'us-east': '6', >>>> 'us-east-backup': '1'} >>>> census | True | {'class': >>>> 'org.apache.cassandra.locator.NetworkTopologyStrategy', 'us-east': '3', >>>> 'us-east-backup': '1'} >>> >>> >>> So, the system_auth for 2 DCs : *us-east with 6 nodes (and RF=3) and >>> us-east-backup with a single node and 100% of the data.* >>> *Keyspace census* showing RF=3 for the main DC and RF=1 for the backup >>> site. >>> >>> >>> And for running the application, we have a user specifically created for >>>> that application itself and that user also has its appropriate permissions >>>> in cassandra to select, modify and delete from concerned tables. And it >>>> uses LOCAL_QUORUM for querying the data and the local-dc is set to >>>> 'us-east'. >>> >>> >>> >>> Also, there is no excessive GC for any of the nodes, we run a custom >>> script to trackthe GC stats (from the cassandra log itself) and output it. >>> Below is the output from the current running script and its similar for all >>> the servers: >>> >>> 2021-07-27 03:04:45,072 INFO gcstats:58 - Application Thread stop time >>>> 0.001656 seconds. >>>> 2021-07-27 03:04:45,080 INFO gcstats:58 - Application Thread stop time >>>> 0.001669 seconds. >>>> 2021-07-27 03:04:45,087 INFO gcstats:58 - Application Thread stop time >>>> 0.001601 seconds. >>>> 2021-07-27 03:04:45,095 INFO gcstats:58 - Application Thread stop time >>>> 0.001713 seconds. >>>> 2021-07-27 03:04:45,103 INFO gcstats:58 - Application Thread stop time >>>> 0.001586 seconds. >>>> 2021-07-27 03:04:45,110 INFO gcstats:58 - Application Thread stop time >>>> 0.001671 seconds. >>>> 2021-07-27 03:04:45,118 INFO gcstats:58 - Application Thread stop time >>>> 0.001691 seconds. >>>> 2021-07-27 03:04:45,127 INFO gcstats:58 - Application Thread stop time >>>> 0.001860 seconds. >>>> 2021-07-27 03:04:45,134 INFO gcstats:58 - Application Thread stop time >>>> 0.001630 seconds. >>>> 2021-07-27 03:04:45,141 INFO gcstats:58 - Application Thread stop time >>>> 0.001515 seconds. >>>> 2021-07-27 03:04:45,148 INFO gcstats:58 - Application Thread stop time >>>> 0.001533 seconds. >>>> 2021-07-27 03:04:45,156 INFO gcstats:58 - Application Thread stop time >>>> 0.001630 seconds. >>>> 2021-07-27 03:04:45,163 INFO gcstats:58 - Application Thread stop time >>>> 0.001577 seconds. >>>> 2021-07-27 03:04:45,170 INFO gcstats:58 - Application Thread stop time >>>> 0.001538 seconds. >>>> 2021-07-27 03:04:45,177 INFO gcstats:58 - Application Thread stop time >>>> 0.001615 seconds. >>>> 2021-07-27 03:04:45,186 INFO gcstats:58 - Application Thread stop time >>>> 0.001584 seconds. >>>> 2021-07-27 03:04:45,193 INFO gcstats:58 - Application Thread stop time >>>> 0.001558 seconds. >>>> 2021-07-27 03:04:45,200 INFO gcstats:58 - Application Thread stop time >>>> 0.001696 seconds. >>>> 2021-07-27 03:04:45,208 INFO gcstats:58 - Application Thread stop time >>>> 0.001658 seconds. >>>> 2021-07-27 03:04:45,215 INFO gcstats:58 - Application Thread stop time >>>> 0.001592 seconds. >>>> 2021-07-27 03:04:45,222 INFO gcstats:58 - Application Thread stop time >>>> 0.001618 seconds. >>>> 2021-07-27 03:05:08,907 INFO gcstats:58 - Application Thread stop time >>>> 0.001624 seconds. >>>> 2021-07-27 03:06:34,436 INFO gcstats:58 - Application Thread stop time >>>> 0.297773 seconds. >>>> >>> >>> >>> >>> >>> >>> On Tue, 27 Jul 2021 at 13:23, 'Bowen Song' via Infra Updates < >>> infra-upda...@goevive.com> wrote: >>> >>>> Hello Chahat, >>>> >>>> >>>> First, can you please make sure the Cassandra user used by the >>>> application is not "cassandra"? Because the "cassandra" user uses >>>> QUORUM consistency level to read the auth tables. >>>> >>>> Then, can you please make sure the replication strategy is set >>>> correctly for the system_auth namespace? I.e.: ensure the old DC is not >>>> present, and the new DC has sufficient number of replicas for fault >>>> tolerance. >>>> >>>> Finally, can you please check the GC logs, and make sure there isn't >>>> JVM GC issues, espicially long STW pauses? >>>> >>>> >>>> Regards, >>>> >>>> Bowen >>>> >>>> >>>> On 27/07/2021 08:34, Chahat Bhatia wrote: >>>> >>>> Hi Community, >>>> >>>> Context: We are running a cluster of 6 nodes in production with a RF=3 >>>> in AWS. >>>> We recently moved from physical servers to cloud by adding a new DC and >>>> then removing the old one. Everything is working fine in all the other >>>> applications except this one. >>>> >>>> >>>> *As we recently started experiencing read timeouts in one of our >>>> production applications where the client threw * >>>> >>>> Error An unexpected error occurred server side on ip-IP.ec2.internal: >>>>> com.google.common.util.concurrent.*UncheckedExecutionException*: >>>>> *com.google.common.util.concurrent.UncheckedExecutionException:** >>>>> java.lang.RuntimeException: >>>>> org.apache.cassandra.exceptions.ReadTimeoutException: Operation timed out >>>>> - >>>>> received only 0 responses.* >>>>> com.datastax.driver.core.exceptions.ServerError: An unexpected error >>>>> occurred server side : >>>>> com.google.common.util.concurrent.UncheckedExecutionException: >>>>> com.google.common.util.concurrent.UncheckedExecutionException: >>>>> java.lang.RuntimeException: >>>>> org.apache.cassandra.exceptions.ReadTimeoutException: Operation timed >>>>> out - received only 0 responses. >>>> >>>> at com.datastax.driver.core.exceptions.ServerError.copy( >>>>> ServerError.java:63) ~[cassandra-driver-core-3.3.0-shaded.jar!/:?] at >>>>> com.datastax.driver.core.exceptions.ServerError.copy(ServerError.java: >>>>> 25) ~[cassandra-driver-core-3.3.0-shaded.jar!/:?] at >>>>> com.datastax.driver.core.DriverThrowables.propagateCause( >>>>> DriverThrowables.java:37) ~[cassandra-driver-core-3.3.0-shaded.jar!/:? >>>>> ] at >>>>> com.datastax.driver.core.DefaultResultSetFuture.getUninterruptibly( >>>>> DefaultResultSetFuture.java:245) ~[cassandra-driver-core-3 >>>>> .3.0-shaded.jar!/:?] at >>>>> com.datastax.driver.core.AbstractSession.execute(AbstractSession.java: >>>>> 68) ~[cassandra-driver-core-3.3.0-shaded.jar!/:? ] >>>> >>>> ............ cntd >>>> >>>> >>>> >>>> com.datastax.driver.core.exceptions.ReadTimeoutException: Cassandra >>>>> timeout during read query at consistency LOCAL_QUORUM (2 responses were >>>>> required but only 1 replica responded) >>>> >>>> >>>> >>>> >>>> >>>> *And around the same time these were the errors on the server side >>>> (from cassandra logs): * >>>> >>>> >>>>> >>>>> *ERROR [RolesCacheRefresh:1] 2021-07-26 06:32:43,094 >>>>> CassandraDaemon.java:207 - Exception in thread >>>>> Thread[RolesCacheRefresh:1,5,main] java.lang.RuntimeException: >>>>> org.apache.cassandra.exceptions.ReadTimeoutException: Operation timed out >>>>> - >>>>> received only 0 responses. * at >>>>> org.apache.cassandra.auth.CassandraRoleManager.getRole(CassandraRoleManager.java:512) >>>>> ~[apache-cassandra-3.0.13.jar:3.0.13] >>>>> at >>>>> org.apache.cassandra.auth.CassandraRoleManager.getRoles(CassandraRoleManager.java:280) >>>>> ~[apache-cassandra-3.0.13.jar:3.0.13] >>>>> at >>>>> org.apache.cassandra.auth.RolesCache$1$1.call(RolesCache.java:135) >>>>> ~[apache-cassandra-3.0.13.jar:3.0.13] >>>>> at >>>>> org.apache.cassandra.auth.RolesCache$1$1.call(RolesCache.java:130) >>>>> ~[apache-cassandra-3.0.13.jar:3.0.13] >>>>> at java.util.concurrent.FutureTask.run(FutureTask.java:266) >>>>> ~[na:1.8.0_131] >>>>> at >>>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) >>>>> ~[na:1.8.0_131] >>>>> at >>>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) >>>>> [na:1.8.0_131] >>>>> at >>>>> org.apache.cassandra.concurrent.NamedThreadFactory.lambda$threadLocalDeallocator$0(NamedThreadFactory.java:79) >>>>> [apache-cassandra-3.0.13.jar:3.0.13] >>>>> at java.lang.Thread.run(Thread.java:748) ~[na:1.8.0_131] >>>>> Caused by: org.apache.cassandra.exceptions.ReadTimeoutException: >>>>> Operation timed out - received only 0 responses. >>>>> at >>>>> org.apache.cassandra.service.ReadCallback.awaitResults(ReadCallback.java:132) >>>>> ~[apache-cassandra-3.0.13.jar:3.0.13] >>>>> at >>>>> org.apache.cassandra.service.ReadCallback.get(ReadCallback.java:137) >>>>> ~[apache-cassandra-3.0.13.jar:3.0.13] >>>>> at >>>>> org.apache.cassandra.service.AbstractReadExecutor.get(AbstractReadExecutor.java:145) >>>>> ~[apache-cassandra-3.0.13.jar:3.0.13] >>>>> at >>>>> org.apache.cassandra.service.StorageProxy$SinglePartitionReadLifecycle.awaitResultsAndRetryOnDigestMismatch(StorageProxy.java:1715) >>>>> ~[apache-cassandra-3.0.13.jar:3.0.13] >>>>> at >>>>> org.apache.cassandra.service.StorageProxy.fetchRows(StorageProxy.java:1664) >>>>> ~[apache-cassandra-3.0.13.jar:3.0.13] >>>>> at >>>>> org.apache.cassandra.service.StorageProxy.readRegular(StorageProxy.java:1605) >>>>> ~[apache-cassandra-3.0.13.jar:3.0.13] >>>>> at >>>>> org.apache.cassandra.service.StorageProxy.read(StorageProxy.java:1524) >>>>> ~[apache-cassandra-3.0.13.jar:3.0.13] >>>>> at >>>>> org.apache.cassandra.db.SinglePartitionReadCommand$Group.execute(SinglePartitionReadCommand.java:955) >>>>> ~[apache-cassandra-3.0.13.jar:3.0.13] >>>>> at >>>>> org.apache.cassandra.cql3.statements.SelectStatement.execute(SelectStatement.java:263) >>>>> ~[apache-cassandra-3.0.13.jar:3.0.13] >>>>> at >>>>> org.apache.cassandra.cql3.statements.SelectStatement.execute(SelectStatement.java:224) >>>>> ~[apache-cassandra-3.0.13.jar:3.0.13] >>>>> at >>>>> org.apache.cassandra.auth.CassandraRoleManager.getRoleFromTable(CassandraRoleManager.java:520) >>>>> ~[apache-cassandra-3.0.13.jar:3.0.13] >>>>> at >>>>> org.apache.cassandra.auth.CassandraRoleManager.getRole(CassandraRoleManager.java:502) >>>>> ~[apache-cassandra-3.0.13.jar:3.0.13] >>>>> >>>>> >>>>> >>>>> *ERROR [PermissionsCacheRefresh:1] 2021-07-26 07:11:25,804 >>>>> CassandraDaemon.java:207 - Exception in thread >>>>> Thread[PermissionsCacheRefresh:1,5,main] java.lang.RuntimeException: >>>>> org.apache.cassandra.exceptions.ReadTimeoutException: Operation timed out >>>>> - >>>>> received only 0 responses. * at >>>>> org.apache.cassandra.auth.CassandraRoleManager.getRole(CassandraRoleManager.java:512) >>>>> ~[apache-cassandra-3.0.13.jar:3.0.13] >>>>> at >>>>> org.apache.cassandra.auth.CassandraRoleManager.isSuper(CassandraRoleManager.java:304) >>>>> ~[apache-cassandra-3.0.13.jar:3.0.13] >>>>> at >>>>> org.apache.cassandra.auth.Roles.hasSuperuserStatus(Roles.java:52) >>>>> ~[apache-cassandra-3.0.13.jar:3.0.13] >>>>> at >>>>> org.apache.cassandra.auth.AuthenticatedUser.isSuper(AuthenticatedUser.java:71) >>>>> ~[apache-cassandra-3.0.13.jar:3.0.13] >>>>> at >>>>> org.apache.cassandra.auth.CassandraAuthorizer.authorize(CassandraAuthorizer.java:76) >>>>> ~[apache-cassandra-3.0.13.jar:3.0.13] >>>>> at >>>>> org.apache.cassandra.auth.PermissionsCache$1$1.call(PermissionsCache.java:136) >>>>> ~[apache-cassandra-3.0.13.jar:3.0.13] >>>>> at >>>>> org.apache.cassandra.auth.PermissionsCache$1$1.call(PermissionsCache.java:131) >>>>> ~[apache-cassandra-3.0.13.jar:3.0.13] >>>>> at java.util.concurrent.FutureTask.run(FutureTask.java:266) >>>>> ~[na:1.8.0_131] >>>>> at >>>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) >>>>> ~[na:1.8.0_131] >>>>> at >>>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) >>>>> [na:1.8.0_131] >>>>> at >>>>> org.apache.cassandra.concurrent.NamedThreadFactory.lambda$threadLocalDeallocator$0(NamedThreadFactory.java:79) >>>>> [apache-cassandra-3.0.13.jar:3.0.13] >>>>> at java.lang.Thread.run(Thread.java:748) ~[na:1.8.0_131] >>>>> Caused by: org.apache.cassandra.exceptions.ReadTimeoutException: >>>>> Operation timed out - received only 0 responses. >>>>> at >>>>> org.apache.cassandra.service.ReadCallback.awaitResults(ReadCallback.java:132) >>>>> ~[apache-cassandra-3.0.13.jar:3.0.13] >>>>> at >>>>> org.apache.cassandra.service.ReadCallback.get(ReadCallback.java:137) >>>>> ~[apache-cassandra-3.0.13.jar:3.0.13] >>>>> at >>>>> org.apache.cassandra.service.AbstractReadExecutor.get(AbstractReadExecutor.java:145) >>>>> ~[apache-cassandra-3.0.13.jar:3.0.13] >>>>> at >>>>> org.apache.cassandra.service.StorageProxy$SinglePartitionReadLifecycle.awaitResultsAndRetryOnDigestMismatch(StorageProxy.java:1715) >>>>> ~[apache-cassandra-3.0.13.jar:3.0.13] >>>>> at >>>>> org.apache.cassandra.service.StorageProxy.fetchRows(StorageProxy.java:1664) >>>>> ~[apache-cassandra-3.0.13.jar:3.0.13] >>>>> at >>>>> org.apache.cassandra.service.StorageProxy.readRegular(StorageProxy.java:1605) >>>>> ~[apache-cassandra-3.0.13.jar:3.0.13] >>>>> at >>>>> org.apache.cassandra.service.StorageProxy.read(StorageProxy.java:1524) >>>>> ~[apache-cassandra-3.0.13.jar:3.0.13] >>>>> at >>>>> org.apache.cassandra.db.SinglePartitionReadCommand$Group.execute(SinglePartitionReadCommand.java:955) >>>>> ~[apache-cassandra-3.0.13.jar:3.0.13] >>>>> at >>>>> org.apache.cassandra.cql3.statements.SelectStatement.execute(SelectStatement.java:263) >>>>> ~[apache-cassandra-3.0.13.jar:3.0.13] >>>>> at >>>>> org.apache.cassandra.cql3.statements.SelectStatement.execute(SelectStatement.java:224) >>>>> ~[apache-cassandra-3.0.13.jar:3.0.13] >>>>> at >>>>> org.apache.cassandra.auth.CassandraRoleManager.getRoleFromTable(CassandraRoleManager.java:520) >>>>> ~[apache-cassandra-3.0.13.jar:3.0.13] >>>>> at >>>>> org.apache.cassandra.auth.CassandraRoleManager.getRole(CassandraRoleManager.java:502) >>>>> ~[apache-cassandra-3.0.13.jar:3.0.13] >>>>> >>>> >>>> >>>> >>>> *These are the values of these params in my configuration file * >>>> >>>> permissions_validity_in_ms: 300000 >>>>> permissions_update_interval_in_ms: 20000 >>>>> roles_validity_in_ms: 300000 >>>>> roles_update_interval_in_ms: 20000 >>>>> >>>> >>>> This was not the case earlier and since this comes from a single app >>>> alone we are not sure if this is actually the issue. Can anyone please >>>> point out if these values are misconfigured and hence causing the issue or >>>> is it somewhere else we should be looking at? >>>> >>>> Any help would be appreciated. >>>> >>>> Thanks & Regards, >>>> Chahat. >>>> >>>> >>> >>> -- >>> Thanks & Regards, >>> Chahat Bhatia >>> Systems Engineer >>> *Evive* >>> +91 7087629779 >>> >>> >> >> -- >> Thanks & Regards, >> Chahat Bhatia >> Systems Engineer >> *Evive* >> +91 7087629779 >> >> > > -- > Thanks & Regards, > Chahat Bhatia > Systems Engineer > *Evive* > +91 7087629779 > > -- Thanks & Regards, Chahat Bhatia Systems Engineer *Evive* +91 7087629779