Thanks for the prompt response. *Here is the system_schema.keyspaces entry:*
system_auth | True | {'class': > 'org.apache.cassandra.locator.NetworkTopologyStrategy', 'us-east': '6', > 'us-east-backup': '1'} > census | True | {'class': > 'org.apache.cassandra.locator.NetworkTopologyStrategy', 'us-east': '3', > 'us-east-backup': '1'} So, the system_auth for 2 DCs : *us-east with 6 nodes (and RF=3) and us-east-backup with a single node and 100% of the data.* *Keyspace census* showing RF=3 for the main DC and RF=1 for the backup site. And for running the application, we have a user specifically created for > that application itself and that user also has its appropriate permissions > in cassandra to select, modify and delete from concerned tables. And it > uses LOCAL_QUORUM for querying the data and the local-dc is set to > 'us-east'. Also, there is no excessive GC for any of the nodes, we run a custom script to trackthe GC stats (from the cassandra log itself) and output it. Below is the output from the current running script and its similar for all the servers: 2021-07-27 03:04:45,072 INFO gcstats:58 - Application Thread stop time > 0.001656 seconds. > 2021-07-27 03:04:45,080 INFO gcstats:58 - Application Thread stop time > 0.001669 seconds. > 2021-07-27 03:04:45,087 INFO gcstats:58 - Application Thread stop time > 0.001601 seconds. > 2021-07-27 03:04:45,095 INFO gcstats:58 - Application Thread stop time > 0.001713 seconds. > 2021-07-27 03:04:45,103 INFO gcstats:58 - Application Thread stop time > 0.001586 seconds. > 2021-07-27 03:04:45,110 INFO gcstats:58 - Application Thread stop time > 0.001671 seconds. > 2021-07-27 03:04:45,118 INFO gcstats:58 - Application Thread stop time > 0.001691 seconds. > 2021-07-27 03:04:45,127 INFO gcstats:58 - Application Thread stop time > 0.001860 seconds. > 2021-07-27 03:04:45,134 INFO gcstats:58 - Application Thread stop time > 0.001630 seconds. > 2021-07-27 03:04:45,141 INFO gcstats:58 - Application Thread stop time > 0.001515 seconds. > 2021-07-27 03:04:45,148 INFO gcstats:58 - Application Thread stop time > 0.001533 seconds. > 2021-07-27 03:04:45,156 INFO gcstats:58 - Application Thread stop time > 0.001630 seconds. > 2021-07-27 03:04:45,163 INFO gcstats:58 - Application Thread stop time > 0.001577 seconds. > 2021-07-27 03:04:45,170 INFO gcstats:58 - Application Thread stop time > 0.001538 seconds. > 2021-07-27 03:04:45,177 INFO gcstats:58 - Application Thread stop time > 0.001615 seconds. > 2021-07-27 03:04:45,186 INFO gcstats:58 - Application Thread stop time > 0.001584 seconds. > 2021-07-27 03:04:45,193 INFO gcstats:58 - Application Thread stop time > 0.001558 seconds. > 2021-07-27 03:04:45,200 INFO gcstats:58 - Application Thread stop time > 0.001696 seconds. > 2021-07-27 03:04:45,208 INFO gcstats:58 - Application Thread stop time > 0.001658 seconds. > 2021-07-27 03:04:45,215 INFO gcstats:58 - Application Thread stop time > 0.001592 seconds. > 2021-07-27 03:04:45,222 INFO gcstats:58 - Application Thread stop time > 0.001618 seconds. > 2021-07-27 03:05:08,907 INFO gcstats:58 - Application Thread stop time > 0.001624 seconds. > 2021-07-27 03:06:34,436 INFO gcstats:58 - Application Thread stop time > 0.297773 seconds. > On Tue, 27 Jul 2021 at 13:23, 'Bowen Song' via Infra Updates < infra-upda...@goevive.com> wrote: > Hello Chahat, > > > First, can you please make sure the Cassandra user used by the application > is not "cassandra"? Because the "cassandra" user uses QUORUM consistency > level to read the auth tables. > > Then, can you please make sure the replication strategy is set correctly > for the system_auth namespace? I.e.: ensure the old DC is not present, and > the new DC has sufficient number of replicas for fault tolerance. > > Finally, can you please check the GC logs, and make sure there isn't JVM > GC issues, espicially long STW pauses? > > > Regards, > > Bowen > > > On 27/07/2021 08:34, Chahat Bhatia wrote: > > Hi Community, > > Context: We are running a cluster of 6 nodes in production with a RF=3 in > AWS. > We recently moved from physical servers to cloud by adding a new DC and > then removing the old one. Everything is working fine in all the other > applications except this one. > > > *As we recently started experiencing read timeouts in one of our > production applications where the client threw * > > Error An unexpected error occurred server side on ip-IP.ec2.internal: >> com.google.common.util.concurrent.*UncheckedExecutionException*: >> *com.google.common.util.concurrent.UncheckedExecutionException:** >> java.lang.RuntimeException: >> org.apache.cassandra.exceptions.ReadTimeoutException: Operation timed out - >> received only 0 responses.* >> com.datastax.driver.core.exceptions.ServerError: An unexpected error >> occurred server side : >> com.google.common.util.concurrent.UncheckedExecutionException: >> com.google.common.util.concurrent.UncheckedExecutionException: >> java.lang.RuntimeException: >> org.apache.cassandra.exceptions.ReadTimeoutException: Operation timed out >> - received only 0 responses. > > at com.datastax.driver.core.exceptions.ServerError.copy(ServerError.java: >> 63) ~[cassandra-driver-core-3.3.0-shaded.jar!/:?] at >> com.datastax.driver.core.exceptions.ServerError.copy(ServerError.java:25) >> ~[cassandra-driver-core-3.3.0-shaded.jar!/:?] at >> com.datastax.driver.core.DriverThrowables.propagateCause( >> DriverThrowables.java:37) ~[cassandra-driver-core-3.3.0-shaded.jar!/:?] >> at com.datastax.driver.core.DefaultResultSetFuture.getUninterruptibly( >> DefaultResultSetFuture.java:245) ~[cassandra-driver-core-3 >> .3.0-shaded.jar!/:?] at com.datastax.driver.core.AbstractSession.execute( >> AbstractSession.java:68) ~[cassandra-driver-core-3.3.0-shaded.jar!/:? ] > > ............ cntd > > > > com.datastax.driver.core.exceptions.ReadTimeoutException: Cassandra >> timeout during read query at consistency LOCAL_QUORUM (2 responses were >> required but only 1 replica responded) > > > > > > *And around the same time these were the errors on the server side (from > cassandra logs): * > > >> >> *ERROR [RolesCacheRefresh:1] 2021-07-26 06:32:43,094 >> CassandraDaemon.java:207 - Exception in thread >> Thread[RolesCacheRefresh:1,5,main] java.lang.RuntimeException: >> org.apache.cassandra.exceptions.ReadTimeoutException: Operation timed out - >> received only 0 responses. * at >> org.apache.cassandra.auth.CassandraRoleManager.getRole(CassandraRoleManager.java:512) >> ~[apache-cassandra-3.0.13.jar:3.0.13] >> at >> org.apache.cassandra.auth.CassandraRoleManager.getRoles(CassandraRoleManager.java:280) >> ~[apache-cassandra-3.0.13.jar:3.0.13] >> at >> org.apache.cassandra.auth.RolesCache$1$1.call(RolesCache.java:135) >> ~[apache-cassandra-3.0.13.jar:3.0.13] >> at >> org.apache.cassandra.auth.RolesCache$1$1.call(RolesCache.java:130) >> ~[apache-cassandra-3.0.13.jar:3.0.13] >> at java.util.concurrent.FutureTask.run(FutureTask.java:266) >> ~[na:1.8.0_131] >> at >> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) >> ~[na:1.8.0_131] >> at >> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) >> [na:1.8.0_131] >> at >> org.apache.cassandra.concurrent.NamedThreadFactory.lambda$threadLocalDeallocator$0(NamedThreadFactory.java:79) >> [apache-cassandra-3.0.13.jar:3.0.13] >> at java.lang.Thread.run(Thread.java:748) ~[na:1.8.0_131] >> Caused by: org.apache.cassandra.exceptions.ReadTimeoutException: >> Operation timed out - received only 0 responses. >> at >> org.apache.cassandra.service.ReadCallback.awaitResults(ReadCallback.java:132) >> ~[apache-cassandra-3.0.13.jar:3.0.13] >> at >> org.apache.cassandra.service.ReadCallback.get(ReadCallback.java:137) >> ~[apache-cassandra-3.0.13.jar:3.0.13] >> at >> org.apache.cassandra.service.AbstractReadExecutor.get(AbstractReadExecutor.java:145) >> ~[apache-cassandra-3.0.13.jar:3.0.13] >> at >> org.apache.cassandra.service.StorageProxy$SinglePartitionReadLifecycle.awaitResultsAndRetryOnDigestMismatch(StorageProxy.java:1715) >> ~[apache-cassandra-3.0.13.jar:3.0.13] >> at >> org.apache.cassandra.service.StorageProxy.fetchRows(StorageProxy.java:1664) >> ~[apache-cassandra-3.0.13.jar:3.0.13] >> at >> org.apache.cassandra.service.StorageProxy.readRegular(StorageProxy.java:1605) >> ~[apache-cassandra-3.0.13.jar:3.0.13] >> at >> org.apache.cassandra.service.StorageProxy.read(StorageProxy.java:1524) >> ~[apache-cassandra-3.0.13.jar:3.0.13] >> at >> org.apache.cassandra.db.SinglePartitionReadCommand$Group.execute(SinglePartitionReadCommand.java:955) >> ~[apache-cassandra-3.0.13.jar:3.0.13] >> at >> org.apache.cassandra.cql3.statements.SelectStatement.execute(SelectStatement.java:263) >> ~[apache-cassandra-3.0.13.jar:3.0.13] >> at >> org.apache.cassandra.cql3.statements.SelectStatement.execute(SelectStatement.java:224) >> ~[apache-cassandra-3.0.13.jar:3.0.13] >> at >> org.apache.cassandra.auth.CassandraRoleManager.getRoleFromTable(CassandraRoleManager.java:520) >> ~[apache-cassandra-3.0.13.jar:3.0.13] >> at >> org.apache.cassandra.auth.CassandraRoleManager.getRole(CassandraRoleManager.java:502) >> ~[apache-cassandra-3.0.13.jar:3.0.13] >> >> >> >> *ERROR [PermissionsCacheRefresh:1] 2021-07-26 07:11:25,804 >> CassandraDaemon.java:207 - Exception in thread >> Thread[PermissionsCacheRefresh:1,5,main] java.lang.RuntimeException: >> org.apache.cassandra.exceptions.ReadTimeoutException: Operation timed out - >> received only 0 responses. * at >> org.apache.cassandra.auth.CassandraRoleManager.getRole(CassandraRoleManager.java:512) >> ~[apache-cassandra-3.0.13.jar:3.0.13] >> at >> org.apache.cassandra.auth.CassandraRoleManager.isSuper(CassandraRoleManager.java:304) >> ~[apache-cassandra-3.0.13.jar:3.0.13] >> at >> org.apache.cassandra.auth.Roles.hasSuperuserStatus(Roles.java:52) >> ~[apache-cassandra-3.0.13.jar:3.0.13] >> at >> org.apache.cassandra.auth.AuthenticatedUser.isSuper(AuthenticatedUser.java:71) >> ~[apache-cassandra-3.0.13.jar:3.0.13] >> at >> org.apache.cassandra.auth.CassandraAuthorizer.authorize(CassandraAuthorizer.java:76) >> ~[apache-cassandra-3.0.13.jar:3.0.13] >> at >> org.apache.cassandra.auth.PermissionsCache$1$1.call(PermissionsCache.java:136) >> ~[apache-cassandra-3.0.13.jar:3.0.13] >> at >> org.apache.cassandra.auth.PermissionsCache$1$1.call(PermissionsCache.java:131) >> ~[apache-cassandra-3.0.13.jar:3.0.13] >> at java.util.concurrent.FutureTask.run(FutureTask.java:266) >> ~[na:1.8.0_131] >> at >> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) >> ~[na:1.8.0_131] >> at >> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) >> [na:1.8.0_131] >> at >> org.apache.cassandra.concurrent.NamedThreadFactory.lambda$threadLocalDeallocator$0(NamedThreadFactory.java:79) >> [apache-cassandra-3.0.13.jar:3.0.13] >> at java.lang.Thread.run(Thread.java:748) ~[na:1.8.0_131] >> Caused by: org.apache.cassandra.exceptions.ReadTimeoutException: >> Operation timed out - received only 0 responses. >> at >> org.apache.cassandra.service.ReadCallback.awaitResults(ReadCallback.java:132) >> ~[apache-cassandra-3.0.13.jar:3.0.13] >> at >> org.apache.cassandra.service.ReadCallback.get(ReadCallback.java:137) >> ~[apache-cassandra-3.0.13.jar:3.0.13] >> at >> org.apache.cassandra.service.AbstractReadExecutor.get(AbstractReadExecutor.java:145) >> ~[apache-cassandra-3.0.13.jar:3.0.13] >> at >> org.apache.cassandra.service.StorageProxy$SinglePartitionReadLifecycle.awaitResultsAndRetryOnDigestMismatch(StorageProxy.java:1715) >> ~[apache-cassandra-3.0.13.jar:3.0.13] >> at >> org.apache.cassandra.service.StorageProxy.fetchRows(StorageProxy.java:1664) >> ~[apache-cassandra-3.0.13.jar:3.0.13] >> at >> org.apache.cassandra.service.StorageProxy.readRegular(StorageProxy.java:1605) >> ~[apache-cassandra-3.0.13.jar:3.0.13] >> at >> org.apache.cassandra.service.StorageProxy.read(StorageProxy.java:1524) >> ~[apache-cassandra-3.0.13.jar:3.0.13] >> at >> org.apache.cassandra.db.SinglePartitionReadCommand$Group.execute(SinglePartitionReadCommand.java:955) >> ~[apache-cassandra-3.0.13.jar:3.0.13] >> at >> org.apache.cassandra.cql3.statements.SelectStatement.execute(SelectStatement.java:263) >> ~[apache-cassandra-3.0.13.jar:3.0.13] >> at >> org.apache.cassandra.cql3.statements.SelectStatement.execute(SelectStatement.java:224) >> ~[apache-cassandra-3.0.13.jar:3.0.13] >> at >> org.apache.cassandra.auth.CassandraRoleManager.getRoleFromTable(CassandraRoleManager.java:520) >> ~[apache-cassandra-3.0.13.jar:3.0.13] >> at >> org.apache.cassandra.auth.CassandraRoleManager.getRole(CassandraRoleManager.java:502) >> ~[apache-cassandra-3.0.13.jar:3.0.13] >> > > > > *These are the values of these params in my configuration file * > > permissions_validity_in_ms: 300000 >> permissions_update_interval_in_ms: 20000 >> roles_validity_in_ms: 300000 >> roles_update_interval_in_ms: 20000 >> > > This was not the case earlier and since this comes from a single app alone > we are not sure if this is actually the issue. Can anyone please point out > if these values are misconfigured and hence causing the issue or is it > somewhere else we should be looking at? > > Any help would be appreciated. > > Thanks & Regards, > Chahat. > > -- Thanks & Regards, Chahat Bhatia Systems Engineer *Evive* +91 7087629779