Yes, RF=6 for system auth. Sorry my bad.
No, we are not using cassandra user for the application. We have a custom super user for our operational and administrative tasks and a separate role with needed perms for the application. > role | super | login | options > ------------------------+-------+-------+--------- > EviveCassandra | True | True | {} ccusr | False | True | {} - gc_warn_threshold_in_ms: 1000 - read_request_timeout_in_ms: 15000 // been the same since the start - cross_node_timeout: false // but we have ntp on each server and the clocks are synced. - On checking the logs a little more I found out these: INFO [ScheduledTasks:1] 2021-07-26 07:21:33,279 MessagingService.java:1013 > - READ messages were dropped in last 5000 ms: 289 for internal timeout and > 0 for cross node timeout INFO [ScheduledTasks:1] 2021-07-26 07:22:33,894 MessagingService.java:1013 > - READ messages were dropped in last 5000 ms: 273 for internal timeout and > 0 for cross node timeout INFO [ScheduledTasks:1] 2021-07-27 06:12:03,988 MessagingService.java:1013 > - READ messages were dropped in last 5000 ms: 160 for internal timeout and > 0 for cross node timeout INFO [ScheduledTasks:1] 2021-07-27 06:13:09,125 MessagingService.java:1013 > - READ messages were dropped in last 5000 ms: 223 for internal timeout and > 0 for cross node timeout INFO [ScheduledTasks:1] 2021-07-27 06:13:44,212 MessagingService.java:1013 > - READ messages were dropped in last 5000 ms: 72 for internal timeout and 0 > for cross node timeout INFO [ScheduledTasks:1] 2021-07-27 06:14:14,496 MessagingService.java:1013 > - READ messages were dropped in last 5000 ms: 16036 for internal timeout > and 0 for cross node timeout INFO [ScheduledTasks:1] 2021-07-27 06:15:59,744 MessagingService.java:1013 > - READ messages were dropped in last 5000 ms: 287 for internal timeout and > 0 for cross node timeout - Also, I am checking cfstats and proxyhistorgrams is in progress, will update incase anythings suspicious. On Tue, 27 Jul 2021 at 18:09, Bowen Song <bo...@bso.ng> wrote: > Hello Chahat, > > > You haven't replied to the first point, are you using the "cassandra" user? > > > The schema and your description don't quite match. When you said: > > *the system_auth for 2 DCs : **us-east* > * with 6 nodes (and RF=3) and ... * > > I assume you meant to say 6 nodes and RF=6? That sounds totally fine. > > > As of the GC logs, do you see any WARN level logs? You may also want to > double check the choice of the gc_warn_threshold_in_ms value is > reasonable. I'm a bit surprised to see the GC logs full of ~1.6ms GC > pauses. AFAIK, by default Cassandra only logs GC pauses longer than 200ms. > I have to assume that's the result of your custom script. You may want to > double check the script actually does what it says on the tin, just in case > it misses something. > > > Then, the other relevant questions. > > Do you use cross node timeout? If so, are the clocks in sync between all > servers? > > What is the timeout values in the cassandra.yaml file? ( > read_request_timeout_in_ms, etc.) Are they reasonably long enough for the > corresponding request type to complete? > > Since you've only got 7 nodes, I'd also recommend you to check the nodetool > cfstats & nodetool cfhistograms output for the tables in the system_auth > keyspace on each node, and see if you can spot anything out of ordinary. > > > Regards, > > Bowen > On 27/07/2021 09:23, Chahat Bhatia wrote: > > Thanks for the prompt response. > > * Here is the system_schema.keyspaces entry:* > > system_auth | True | {'class': >> 'org.apache.cassandra.locator.NetworkTopologyStrategy', 'us-east': '6', >> 'us-east-backup': '1'} >> census | True | {'class': >> 'org.apache.cassandra.locator.NetworkTopologyStrategy', 'us-east': '3', >> 'us-east-backup': '1'} > > > So, the system_auth for 2 DCs : *us-east with 6 nodes (and RF=3) and > us-east-backup with a single node and 100% of the data.* > *Keyspace census* showing RF=3 for the main DC and RF=1 for the backup > site. > > > And for running the application, we have a user specifically created for >> that application itself and that user also has its appropriate permissions >> in cassandra to select, modify and delete from concerned tables. And it >> uses LOCAL_QUORUM for querying the data and the local-dc is set to >> 'us-east'. > > > > Also, there is no excessive GC for any of the nodes, we run a custom > script to trackthe GC stats (from the cassandra log itself) and output it. > Below is the output from the current running script and its similar for all > the servers: > > 2021-07-27 03:04:45,072 INFO gcstats:58 - Application Thread stop time >> 0.001656 seconds. >> 2021-07-27 03:04:45,080 INFO gcstats:58 - Application Thread stop time >> 0.001669 seconds. >> 2021-07-27 03:04:45,087 INFO gcstats:58 - Application Thread stop time >> 0.001601 seconds. >> 2021-07-27 03:04:45,095 INFO gcstats:58 - Application Thread stop time >> 0.001713 seconds. >> 2021-07-27 03:04:45,103 INFO gcstats:58 - Application Thread stop time >> 0.001586 seconds. >> 2021-07-27 03:04:45,110 INFO gcstats:58 - Application Thread stop time >> 0.001671 seconds. >> 2021-07-27 03:04:45,118 INFO gcstats:58 - Application Thread stop time >> 0.001691 seconds. >> 2021-07-27 03:04:45,127 INFO gcstats:58 - Application Thread stop time >> 0.001860 seconds. >> 2021-07-27 03:04:45,134 INFO gcstats:58 - Application Thread stop time >> 0.001630 seconds. >> 2021-07-27 03:04:45,141 INFO gcstats:58 - Application Thread stop time >> 0.001515 seconds. >> 2021-07-27 03:04:45,148 INFO gcstats:58 - Application Thread stop time >> 0.001533 seconds. >> 2021-07-27 03:04:45,156 INFO gcstats:58 - Application Thread stop time >> 0.001630 seconds. >> 2021-07-27 03:04:45,163 INFO gcstats:58 - Application Thread stop time >> 0.001577 seconds. >> 2021-07-27 03:04:45,170 INFO gcstats:58 - Application Thread stop time >> 0.001538 seconds. >> 2021-07-27 03:04:45,177 INFO gcstats:58 - Application Thread stop time >> 0.001615 seconds. >> 2021-07-27 03:04:45,186 INFO gcstats:58 - Application Thread stop time >> 0.001584 seconds. >> 2021-07-27 03:04:45,193 INFO gcstats:58 - Application Thread stop time >> 0.001558 seconds. >> 2021-07-27 03:04:45,200 INFO gcstats:58 - Application Thread stop time >> 0.001696 seconds. >> 2021-07-27 03:04:45,208 INFO gcstats:58 - Application Thread stop time >> 0.001658 seconds. >> 2021-07-27 03:04:45,215 INFO gcstats:58 - Application Thread stop time >> 0.001592 seconds. >> 2021-07-27 03:04:45,222 INFO gcstats:58 - Application Thread stop time >> 0.001618 seconds. >> 2021-07-27 03:05:08,907 INFO gcstats:58 - Application Thread stop time >> 0.001624 seconds. >> 2021-07-27 03:06:34,436 INFO gcstats:58 - Application Thread stop time >> 0.297773 seconds. >> > > > > > > On Tue, 27 Jul 2021 at 13:23, 'Bowen Song' via Infra Updates < > infra-upda...@goevive.com> wrote: > >> Hello Chahat, >> >> >> First, can you please make sure the Cassandra user used by the >> application is not "cassandra"? Because the "cassandra" user uses QUORUM >> consistency level to read the auth tables. >> >> Then, can you please make sure the replication strategy is set correctly >> for the system_auth namespace? I.e.: ensure the old DC is not present, and >> the new DC has sufficient number of replicas for fault tolerance. >> >> Finally, can you please check the GC logs, and make sure there isn't JVM >> GC issues, espicially long STW pauses? >> >> >> Regards, >> >> Bowen >> >> >> On 27/07/2021 08:34, Chahat Bhatia wrote: >> >> Hi Community, >> >> Context: We are running a cluster of 6 nodes in production with a RF=3 in >> AWS. >> We recently moved from physical servers to cloud by adding a new DC and >> then removing the old one. Everything is working fine in all the other >> applications except this one. >> >> >> *As we recently started experiencing read timeouts in one of our >> production applications where the client threw * >> >> Error An unexpected error occurred server side on ip-IP.ec2.internal: >>> com.google.common.util.concurrent.*UncheckedExecutionException*: >>> *com.google.common.util.concurrent.UncheckedExecutionException:** >>> java.lang.RuntimeException: >>> org.apache.cassandra.exceptions.ReadTimeoutException: Operation timed out - >>> received only 0 responses.* >>> com.datastax.driver.core.exceptions.ServerError: An unexpected error >>> occurred server side : >>> com.google.common.util.concurrent.UncheckedExecutionException: >>> com.google.common.util.concurrent.UncheckedExecutionException: >>> java.lang.RuntimeException: >>> org.apache.cassandra.exceptions.ReadTimeoutException: Operation timed >>> out - received only 0 responses. >> >> at com.datastax.driver.core.exceptions.ServerError.copy(ServerError.java: >>> 63) ~[cassandra-driver-core-3.3.0-shaded.jar!/:?] at >>> com.datastax.driver.core.exceptions.ServerError.copy(ServerError.java:25 >>> ) ~[cassandra-driver-core-3.3.0-shaded.jar!/:?] at >>> com.datastax.driver.core.DriverThrowables.propagateCause( >>> DriverThrowables.java:37) ~[cassandra-driver-core-3.3.0-shaded.jar!/:?] >>> at com.datastax.driver.core.DefaultResultSetFuture.getUninterruptibly( >>> DefaultResultSetFuture.java:245) ~[cassandra-driver-core-3 >>> .3.0-shaded.jar!/:?] at com.datastax.driver.core.AbstractSession.execute >>> (AbstractSession.java:68) ~[cassandra-driver-core-3.3.0-shaded.jar!/:? ] >>> >> >> ............ cntd >> >> >> >> com.datastax.driver.core.exceptions.ReadTimeoutException: Cassandra >>> timeout during read query at consistency LOCAL_QUORUM (2 responses were >>> required but only 1 replica responded) >> >> >> >> >> >> *And around the same time these were the errors on the server side (from >> cassandra logs): * >> >> >>> >>> *ERROR [RolesCacheRefresh:1] 2021-07-26 06:32:43,094 >>> CassandraDaemon.java:207 - Exception in thread >>> Thread[RolesCacheRefresh:1,5,main] java.lang.RuntimeException: >>> org.apache.cassandra.exceptions.ReadTimeoutException: Operation timed out - >>> received only 0 responses. * at >>> org.apache.cassandra.auth.CassandraRoleManager.getRole(CassandraRoleManager.java:512) >>> ~[apache-cassandra-3.0.13.jar:3.0.13] >>> at >>> org.apache.cassandra.auth.CassandraRoleManager.getRoles(CassandraRoleManager.java:280) >>> ~[apache-cassandra-3.0.13.jar:3.0.13] >>> at >>> org.apache.cassandra.auth.RolesCache$1$1.call(RolesCache.java:135) >>> ~[apache-cassandra-3.0.13.jar:3.0.13] >>> at >>> org.apache.cassandra.auth.RolesCache$1$1.call(RolesCache.java:130) >>> ~[apache-cassandra-3.0.13.jar:3.0.13] >>> at java.util.concurrent.FutureTask.run(FutureTask.java:266) >>> ~[na:1.8.0_131] >>> at >>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) >>> ~[na:1.8.0_131] >>> at >>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) >>> [na:1.8.0_131] >>> at >>> org.apache.cassandra.concurrent.NamedThreadFactory.lambda$threadLocalDeallocator$0(NamedThreadFactory.java:79) >>> [apache-cassandra-3.0.13.jar:3.0.13] >>> at java.lang.Thread.run(Thread.java:748) ~[na:1.8.0_131] >>> Caused by: org.apache.cassandra.exceptions.ReadTimeoutException: >>> Operation timed out - received only 0 responses. >>> at >>> org.apache.cassandra.service.ReadCallback.awaitResults(ReadCallback.java:132) >>> ~[apache-cassandra-3.0.13.jar:3.0.13] >>> at >>> org.apache.cassandra.service.ReadCallback.get(ReadCallback.java:137) >>> ~[apache-cassandra-3.0.13.jar:3.0.13] >>> at >>> org.apache.cassandra.service.AbstractReadExecutor.get(AbstractReadExecutor.java:145) >>> ~[apache-cassandra-3.0.13.jar:3.0.13] >>> at >>> org.apache.cassandra.service.StorageProxy$SinglePartitionReadLifecycle.awaitResultsAndRetryOnDigestMismatch(StorageProxy.java:1715) >>> ~[apache-cassandra-3.0.13.jar:3.0.13] >>> at >>> org.apache.cassandra.service.StorageProxy.fetchRows(StorageProxy.java:1664) >>> ~[apache-cassandra-3.0.13.jar:3.0.13] >>> at >>> org.apache.cassandra.service.StorageProxy.readRegular(StorageProxy.java:1605) >>> ~[apache-cassandra-3.0.13.jar:3.0.13] >>> at >>> org.apache.cassandra.service.StorageProxy.read(StorageProxy.java:1524) >>> ~[apache-cassandra-3.0.13.jar:3.0.13] >>> at >>> org.apache.cassandra.db.SinglePartitionReadCommand$Group.execute(SinglePartitionReadCommand.java:955) >>> ~[apache-cassandra-3.0.13.jar:3.0.13] >>> at >>> org.apache.cassandra.cql3.statements.SelectStatement.execute(SelectStatement.java:263) >>> ~[apache-cassandra-3.0.13.jar:3.0.13] >>> at >>> org.apache.cassandra.cql3.statements.SelectStatement.execute(SelectStatement.java:224) >>> ~[apache-cassandra-3.0.13.jar:3.0.13] >>> at >>> org.apache.cassandra.auth.CassandraRoleManager.getRoleFromTable(CassandraRoleManager.java:520) >>> ~[apache-cassandra-3.0.13.jar:3.0.13] >>> at >>> org.apache.cassandra.auth.CassandraRoleManager.getRole(CassandraRoleManager.java:502) >>> ~[apache-cassandra-3.0.13.jar:3.0.13] >>> >>> >>> >>> *ERROR [PermissionsCacheRefresh:1] 2021-07-26 07:11:25,804 >>> CassandraDaemon.java:207 - Exception in thread >>> Thread[PermissionsCacheRefresh:1,5,main] java.lang.RuntimeException: >>> org.apache.cassandra.exceptions.ReadTimeoutException: Operation timed out - >>> received only 0 responses. * at >>> org.apache.cassandra.auth.CassandraRoleManager.getRole(CassandraRoleManager.java:512) >>> ~[apache-cassandra-3.0.13.jar:3.0.13] >>> at >>> org.apache.cassandra.auth.CassandraRoleManager.isSuper(CassandraRoleManager.java:304) >>> ~[apache-cassandra-3.0.13.jar:3.0.13] >>> at >>> org.apache.cassandra.auth.Roles.hasSuperuserStatus(Roles.java:52) >>> ~[apache-cassandra-3.0.13.jar:3.0.13] >>> at >>> org.apache.cassandra.auth.AuthenticatedUser.isSuper(AuthenticatedUser.java:71) >>> ~[apache-cassandra-3.0.13.jar:3.0.13] >>> at >>> org.apache.cassandra.auth.CassandraAuthorizer.authorize(CassandraAuthorizer.java:76) >>> ~[apache-cassandra-3.0.13.jar:3.0.13] >>> at >>> org.apache.cassandra.auth.PermissionsCache$1$1.call(PermissionsCache.java:136) >>> ~[apache-cassandra-3.0.13.jar:3.0.13] >>> at >>> org.apache.cassandra.auth.PermissionsCache$1$1.call(PermissionsCache.java:131) >>> ~[apache-cassandra-3.0.13.jar:3.0.13] >>> at java.util.concurrent.FutureTask.run(FutureTask.java:266) >>> ~[na:1.8.0_131] >>> at >>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) >>> ~[na:1.8.0_131] >>> at >>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) >>> [na:1.8.0_131] >>> at >>> org.apache.cassandra.concurrent.NamedThreadFactory.lambda$threadLocalDeallocator$0(NamedThreadFactory.java:79) >>> [apache-cassandra-3.0.13.jar:3.0.13] >>> at java.lang.Thread.run(Thread.java:748) ~[na:1.8.0_131] >>> Caused by: org.apache.cassandra.exceptions.ReadTimeoutException: >>> Operation timed out - received only 0 responses. >>> at >>> org.apache.cassandra.service.ReadCallback.awaitResults(ReadCallback.java:132) >>> ~[apache-cassandra-3.0.13.jar:3.0.13] >>> at >>> org.apache.cassandra.service.ReadCallback.get(ReadCallback.java:137) >>> ~[apache-cassandra-3.0.13.jar:3.0.13] >>> at >>> org.apache.cassandra.service.AbstractReadExecutor.get(AbstractReadExecutor.java:145) >>> ~[apache-cassandra-3.0.13.jar:3.0.13] >>> at >>> org.apache.cassandra.service.StorageProxy$SinglePartitionReadLifecycle.awaitResultsAndRetryOnDigestMismatch(StorageProxy.java:1715) >>> ~[apache-cassandra-3.0.13.jar:3.0.13] >>> at >>> org.apache.cassandra.service.StorageProxy.fetchRows(StorageProxy.java:1664) >>> ~[apache-cassandra-3.0.13.jar:3.0.13] >>> at >>> org.apache.cassandra.service.StorageProxy.readRegular(StorageProxy.java:1605) >>> ~[apache-cassandra-3.0.13.jar:3.0.13] >>> at >>> org.apache.cassandra.service.StorageProxy.read(StorageProxy.java:1524) >>> ~[apache-cassandra-3.0.13.jar:3.0.13] >>> at >>> org.apache.cassandra.db.SinglePartitionReadCommand$Group.execute(SinglePartitionReadCommand.java:955) >>> ~[apache-cassandra-3.0.13.jar:3.0.13] >>> at >>> org.apache.cassandra.cql3.statements.SelectStatement.execute(SelectStatement.java:263) >>> ~[apache-cassandra-3.0.13.jar:3.0.13] >>> at >>> org.apache.cassandra.cql3.statements.SelectStatement.execute(SelectStatement.java:224) >>> ~[apache-cassandra-3.0.13.jar:3.0.13] >>> at >>> org.apache.cassandra.auth.CassandraRoleManager.getRoleFromTable(CassandraRoleManager.java:520) >>> ~[apache-cassandra-3.0.13.jar:3.0.13] >>> at >>> org.apache.cassandra.auth.CassandraRoleManager.getRole(CassandraRoleManager.java:502) >>> ~[apache-cassandra-3.0.13.jar:3.0.13] >>> >> >> >> >> *These are the values of these params in my configuration file * >> >> permissions_validity_in_ms: 300000 >>> permissions_update_interval_in_ms: 20000 >>> roles_validity_in_ms: 300000 >>> roles_update_interval_in_ms: 20000 >>> >> >> This was not the case earlier and since this comes from a single app >> alone we are not sure if this is actually the issue. Can anyone please >> point out if these values are misconfigured and hence causing the issue or >> is it somewhere else we should be looking at? >> >> Any help would be appreciated. >> >> Thanks & Regards, >> Chahat. >> >> > > -- > Thanks & Regards, > Chahat Bhatia > Systems Engineer > *Evive* > +91 7087629779 > > -- Thanks & Regards, Chahat Bhatia Systems Engineer *Evive* +91 7087629779