Re: Permission/Role Cache causing timeouts in apps.

Chahat Bhatia Tue, 27 Jul 2021 07:33:05 -0700

Yes, RF=6 for system auth. Sorry my bad.


No, we are not using cassandra user for the application. We have a custom
super user for our operational and administrative tasks and a separate role
with needed perms for the application.

>  role                      | super | login | options
> ------------------------+-------+-------+---------
>  EviveCassandra |  True |  True |        {}

                  ccusr | False |  True |        {}




   -  gc_warn_threshold_in_ms: 1000
   - read_request_timeout_in_ms: 15000  // been the same since the start
   - cross_node_timeout: false           // but we have ntp on each server
   and the clocks are synced.

   - On checking the logs a little more I found out these:

INFO  [ScheduledTasks:1] 2021-07-26 07:21:33,279 MessagingService.java:1013
> - READ messages were dropped in last 5000 ms: 289 for internal timeout and
> 0 for cross node timeout

INFO  [ScheduledTasks:1] 2021-07-26 07:22:33,894 MessagingService.java:1013
> - READ messages were dropped in last 5000 ms: 273 for internal timeout and
> 0 for cross node timeout

INFO  [ScheduledTasks:1] 2021-07-27 06:12:03,988 MessagingService.java:1013
> - READ messages were dropped in last 5000 ms: 160 for internal timeout and
> 0 for cross node timeout

INFO  [ScheduledTasks:1] 2021-07-27 06:13:09,125 MessagingService.java:1013
> - READ messages were dropped in last 5000 ms: 223 for internal timeout and
> 0 for cross node timeout

INFO  [ScheduledTasks:1] 2021-07-27 06:13:44,212 MessagingService.java:1013
> - READ messages were dropped in last 5000 ms: 72 for internal timeout and 0
> for cross node timeout

INFO  [ScheduledTasks:1] 2021-07-27 06:14:14,496 MessagingService.java:1013
> - READ messages were dropped in last 5000 ms: 16036 for internal timeout
> and 0 for cross node timeout

INFO  [ScheduledTasks:1] 2021-07-27 06:15:59,744 MessagingService.java:1013
> - READ messages were dropped in last 5000 ms: 287 for internal timeout and
> 0 for cross node timeout



   - Also, I am checking cfstats and proxyhistorgrams is in progress, will
   update incase anythings suspicious.


On Tue, 27 Jul 2021 at 18:09, Bowen Song <bo...@bso.ng> wrote:

> Hello Chahat,
>
>
> You haven't replied to the first point, are you using the "cassandra" user?
>
>
> The schema and your description don't quite match. When you said:
>
> *the system_auth  for 2 DCs :   **us-east*
> * with 6 nodes (and RF=3) and ... *
>
> I assume you meant to say 6 nodes and RF=6? That sounds totally fine.
>
>
> As of the GC logs, do you see any WARN level logs? You may also want to
> double check the choice of the gc_warn_threshold_in_ms value is
> reasonable. I'm a bit surprised to see the GC logs full of ~1.6ms GC
> pauses. AFAIK, by default Cassandra only logs GC pauses longer than 200ms.
> I have to assume that's the result of your custom script. You may want to
> double check the script actually does what it says on the tin, just in case
> it misses something.
>
>
> Then, the other relevant questions.
>
> Do you use cross node timeout? If so, are the clocks in sync between all
> servers?
>
> What is the timeout values in the cassandra.yaml file? (
> read_request_timeout_in_ms, etc.) Are they reasonably long enough for the
> corresponding request type to complete?
>
> Since you've only got 7 nodes, I'd also recommend you to check the nodetool
> cfstats & nodetool cfhistograms output for the tables in the system_auth
> keyspace on each node, and see if you can spot anything out of ordinary.
>
>
> Regards,
>
> Bowen
> On 27/07/2021 09:23, Chahat Bhatia wrote:
>
> Thanks for the prompt response.
>
> * Here is the system_schema.keyspaces entry:*
>
>                 system_auth |           True | {'class':
>> 'org.apache.cassandra.locator.NetworkTopologyStrategy', 'us-east': '6',
>> 'us-east-backup': '1'}
>>                      census |           True | {'class':
>> 'org.apache.cassandra.locator.NetworkTopologyStrategy', 'us-east': '3',
>> 'us-east-backup': '1'}
>
>
> So, the system_auth  for 2 DCs :   *us-east with 6 nodes (and RF=3) and
> us-east-backup with a single node and 100% of the data.*
> *Keyspace census* showing RF=3 for the main DC and RF=1 for the backup
> site.
>
>
> And for running the application, we have a user specifically created for
>> that application itself and that user also has its appropriate permissions
>> in cassandra to select, modify and delete from concerned tables. And it
>> uses LOCAL_QUORUM for querying the data and the local-dc is set to
>> 'us-east'.
>
>
>
> Also, there is no excessive GC for any of the nodes, we run a custom
> script to trackthe GC stats (from the cassandra log itself) and output it.
> Below is the output from the current running script and its similar for all
> the servers:
>
> 2021-07-27 03:04:45,072 INFO gcstats:58 - Application Thread stop time
>> 0.001656 seconds.
>> 2021-07-27 03:04:45,080 INFO gcstats:58 - Application Thread stop time
>> 0.001669 seconds.
>> 2021-07-27 03:04:45,087 INFO gcstats:58 - Application Thread stop time
>> 0.001601 seconds.
>> 2021-07-27 03:04:45,095 INFO gcstats:58 - Application Thread stop time
>> 0.001713 seconds.
>> 2021-07-27 03:04:45,103 INFO gcstats:58 - Application Thread stop time
>> 0.001586 seconds.
>> 2021-07-27 03:04:45,110 INFO gcstats:58 - Application Thread stop time
>> 0.001671 seconds.
>> 2021-07-27 03:04:45,118 INFO gcstats:58 - Application Thread stop time
>> 0.001691 seconds.
>> 2021-07-27 03:04:45,127 INFO gcstats:58 - Application Thread stop time
>> 0.001860 seconds.
>> 2021-07-27 03:04:45,134 INFO gcstats:58 - Application Thread stop time
>> 0.001630 seconds.
>> 2021-07-27 03:04:45,141 INFO gcstats:58 - Application Thread stop time
>> 0.001515 seconds.
>> 2021-07-27 03:04:45,148 INFO gcstats:58 - Application Thread stop time
>> 0.001533 seconds.
>> 2021-07-27 03:04:45,156 INFO gcstats:58 - Application Thread stop time
>> 0.001630 seconds.
>> 2021-07-27 03:04:45,163 INFO gcstats:58 - Application Thread stop time
>> 0.001577 seconds.
>> 2021-07-27 03:04:45,170 INFO gcstats:58 - Application Thread stop time
>> 0.001538 seconds.
>> 2021-07-27 03:04:45,177 INFO gcstats:58 - Application Thread stop time
>> 0.001615 seconds.
>> 2021-07-27 03:04:45,186 INFO gcstats:58 - Application Thread stop time
>> 0.001584 seconds.
>> 2021-07-27 03:04:45,193 INFO gcstats:58 - Application Thread stop time
>> 0.001558 seconds.
>> 2021-07-27 03:04:45,200 INFO gcstats:58 - Application Thread stop time
>> 0.001696 seconds.
>> 2021-07-27 03:04:45,208 INFO gcstats:58 - Application Thread stop time
>> 0.001658 seconds.
>> 2021-07-27 03:04:45,215 INFO gcstats:58 - Application Thread stop time
>> 0.001592 seconds.
>> 2021-07-27 03:04:45,222 INFO gcstats:58 - Application Thread stop time
>> 0.001618 seconds.
>> 2021-07-27 03:05:08,907 INFO gcstats:58 - Application Thread stop time
>> 0.001624 seconds.
>> 2021-07-27 03:06:34,436 INFO gcstats:58 - Application Thread stop time
>> 0.297773 seconds.
>>
>
>
>
>
>
> On Tue, 27 Jul 2021 at 13:23, 'Bowen Song' via Infra Updates <
> infra-upda...@goevive.com> wrote:
>
>> Hello Chahat,
>>
>>
>> First, can you please make sure the Cassandra user used by the
>> application is not "cassandra"? Because the "cassandra" user uses QUORUM
>> consistency level to read the auth tables.
>>
>> Then, can you please make sure the replication strategy is set correctly
>> for the system_auth namespace? I.e.: ensure the old DC is not present, and
>> the new DC has sufficient number of replicas for fault tolerance.
>>
>> Finally, can you please check the GC logs, and make sure there isn't JVM
>> GC issues, espicially long STW pauses?
>>
>>
>> Regards,
>>
>> Bowen
>>
>>
>> On 27/07/2021 08:34, Chahat Bhatia wrote:
>>
>> Hi Community,
>>
>> Context: We are running a cluster of 6 nodes in production with a RF=3 in
>> AWS.
>> We recently moved from physical servers to cloud by adding a new DC and
>> then removing the old one. Everything is working fine in all the other
>> applications except this one.
>>
>>
>> *As we recently started experiencing read timeouts in one of our
>> production applications where the client threw  *
>>
>> Error An unexpected error occurred server side on ip-IP.ec2.internal:
>>> com.google.common.util.concurrent.*UncheckedExecutionException*:
>>> *com.google.common.util.concurrent.UncheckedExecutionException:**
>>> java.lang.RuntimeException:
>>> org.apache.cassandra.exceptions.ReadTimeoutException: Operation timed out -
>>> received only 0 responses.*
>>> com.datastax.driver.core.exceptions.ServerError: An unexpected error
>>> occurred server side  :
>>> com.google.common.util.concurrent.UncheckedExecutionException:
>>> com.google.common.util.concurrent.UncheckedExecutionException:
>>> java.lang.RuntimeException:
>>> org.apache.cassandra.exceptions.ReadTimeoutException: Operation timed
>>> out - received only 0 responses.
>>
>> at com.datastax.driver.core.exceptions.ServerError.copy(ServerError.java:
>>> 63) ~[cassandra-driver-core-3.3.0-shaded.jar!/:?] at
>>> com.datastax.driver.core.exceptions.ServerError.copy(ServerError.java:25
>>> ) ~[cassandra-driver-core-3.3.0-shaded.jar!/:?] at
>>> com.datastax.driver.core.DriverThrowables.propagateCause(
>>> DriverThrowables.java:37) ~[cassandra-driver-core-3.3.0-shaded.jar!/:?]
>>> at com.datastax.driver.core.DefaultResultSetFuture.getUninterruptibly(
>>> DefaultResultSetFuture.java:245) ~[cassandra-driver-core-3
>>> .3.0-shaded.jar!/:?] at com.datastax.driver.core.AbstractSession.execute
>>> (AbstractSession.java:68) ~[cassandra-driver-core-3.3.0-shaded.jar!/:? ]
>>>
>>
>>  ............ cntd
>>
>>
>>
>> com.datastax.driver.core.exceptions.ReadTimeoutException: Cassandra
>>> timeout during read query at consistency LOCAL_QUORUM (2 responses were
>>> required but only 1 replica responded)
>>
>>
>>
>>
>>
>> *And around the same time these were the errors on the server side (from
>> cassandra logs): *
>>
>>
>>>
>>> *ERROR [RolesCacheRefresh:1] 2021-07-26 06:32:43,094
>>> CassandraDaemon.java:207 - Exception in thread
>>> Thread[RolesCacheRefresh:1,5,main] java.lang.RuntimeException:
>>> org.apache.cassandra.exceptions.ReadTimeoutException: Operation timed out -
>>> received only 0 responses. *        at
>>> org.apache.cassandra.auth.CassandraRoleManager.getRole(CassandraRoleManager.java:512)
>>> ~[apache-cassandra-3.0.13.jar:3.0.13]
>>>         at
>>> org.apache.cassandra.auth.CassandraRoleManager.getRoles(CassandraRoleManager.java:280)
>>> ~[apache-cassandra-3.0.13.jar:3.0.13]
>>>         at
>>> org.apache.cassandra.auth.RolesCache$1$1.call(RolesCache.java:135)
>>> ~[apache-cassandra-3.0.13.jar:3.0.13]
>>>         at
>>> org.apache.cassandra.auth.RolesCache$1$1.call(RolesCache.java:130)
>>> ~[apache-cassandra-3.0.13.jar:3.0.13]
>>>         at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>>> ~[na:1.8.0_131]
>>>         at
>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>>> ~[na:1.8.0_131]
>>>         at
>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>>> [na:1.8.0_131]
>>>         at
>>> org.apache.cassandra.concurrent.NamedThreadFactory.lambda$threadLocalDeallocator$0(NamedThreadFactory.java:79)
>>> [apache-cassandra-3.0.13.jar:3.0.13]
>>>         at java.lang.Thread.run(Thread.java:748) ~[na:1.8.0_131]
>>> Caused by: org.apache.cassandra.exceptions.ReadTimeoutException:
>>> Operation timed out - received only 0 responses.
>>>         at
>>> org.apache.cassandra.service.ReadCallback.awaitResults(ReadCallback.java:132)
>>> ~[apache-cassandra-3.0.13.jar:3.0.13]
>>>         at
>>> org.apache.cassandra.service.ReadCallback.get(ReadCallback.java:137)
>>> ~[apache-cassandra-3.0.13.jar:3.0.13]
>>>         at
>>> org.apache.cassandra.service.AbstractReadExecutor.get(AbstractReadExecutor.java:145)
>>> ~[apache-cassandra-3.0.13.jar:3.0.13]
>>>         at
>>> org.apache.cassandra.service.StorageProxy$SinglePartitionReadLifecycle.awaitResultsAndRetryOnDigestMismatch(StorageProxy.java:1715)
>>> ~[apache-cassandra-3.0.13.jar:3.0.13]
>>>         at
>>> org.apache.cassandra.service.StorageProxy.fetchRows(StorageProxy.java:1664)
>>> ~[apache-cassandra-3.0.13.jar:3.0.13]
>>>         at
>>> org.apache.cassandra.service.StorageProxy.readRegular(StorageProxy.java:1605)
>>> ~[apache-cassandra-3.0.13.jar:3.0.13]
>>>         at
>>> org.apache.cassandra.service.StorageProxy.read(StorageProxy.java:1524)
>>> ~[apache-cassandra-3.0.13.jar:3.0.13]
>>>         at
>>> org.apache.cassandra.db.SinglePartitionReadCommand$Group.execute(SinglePartitionReadCommand.java:955)
>>> ~[apache-cassandra-3.0.13.jar:3.0.13]
>>>         at
>>> org.apache.cassandra.cql3.statements.SelectStatement.execute(SelectStatement.java:263)
>>> ~[apache-cassandra-3.0.13.jar:3.0.13]
>>>         at
>>> org.apache.cassandra.cql3.statements.SelectStatement.execute(SelectStatement.java:224)
>>> ~[apache-cassandra-3.0.13.jar:3.0.13]
>>>         at
>>> org.apache.cassandra.auth.CassandraRoleManager.getRoleFromTable(CassandraRoleManager.java:520)
>>> ~[apache-cassandra-3.0.13.jar:3.0.13]
>>>         at
>>> org.apache.cassandra.auth.CassandraRoleManager.getRole(CassandraRoleManager.java:502)
>>> ~[apache-cassandra-3.0.13.jar:3.0.13]
>>>
>>>
>>>
>>> *ERROR [PermissionsCacheRefresh:1] 2021-07-26 07:11:25,804
>>> CassandraDaemon.java:207 - Exception in thread
>>> Thread[PermissionsCacheRefresh:1,5,main] java.lang.RuntimeException:
>>> org.apache.cassandra.exceptions.ReadTimeoutException: Operation timed out -
>>> received only 0 responses. *        at
>>> org.apache.cassandra.auth.CassandraRoleManager.getRole(CassandraRoleManager.java:512)
>>> ~[apache-cassandra-3.0.13.jar:3.0.13]
>>>         at
>>> org.apache.cassandra.auth.CassandraRoleManager.isSuper(CassandraRoleManager.java:304)
>>> ~[apache-cassandra-3.0.13.jar:3.0.13]
>>>         at
>>> org.apache.cassandra.auth.Roles.hasSuperuserStatus(Roles.java:52)
>>> ~[apache-cassandra-3.0.13.jar:3.0.13]
>>>         at
>>> org.apache.cassandra.auth.AuthenticatedUser.isSuper(AuthenticatedUser.java:71)
>>> ~[apache-cassandra-3.0.13.jar:3.0.13]
>>>         at
>>> org.apache.cassandra.auth.CassandraAuthorizer.authorize(CassandraAuthorizer.java:76)
>>> ~[apache-cassandra-3.0.13.jar:3.0.13]
>>>         at
>>> org.apache.cassandra.auth.PermissionsCache$1$1.call(PermissionsCache.java:136)
>>> ~[apache-cassandra-3.0.13.jar:3.0.13]
>>>         at
>>> org.apache.cassandra.auth.PermissionsCache$1$1.call(PermissionsCache.java:131)
>>> ~[apache-cassandra-3.0.13.jar:3.0.13]
>>>         at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>>> ~[na:1.8.0_131]
>>>         at
>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>>> ~[na:1.8.0_131]
>>>         at
>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>>> [na:1.8.0_131]
>>>         at
>>> org.apache.cassandra.concurrent.NamedThreadFactory.lambda$threadLocalDeallocator$0(NamedThreadFactory.java:79)
>>> [apache-cassandra-3.0.13.jar:3.0.13]
>>>         at java.lang.Thread.run(Thread.java:748) ~[na:1.8.0_131]
>>> Caused by: org.apache.cassandra.exceptions.ReadTimeoutException:
>>> Operation timed out - received only 0 responses.
>>>         at
>>> org.apache.cassandra.service.ReadCallback.awaitResults(ReadCallback.java:132)
>>> ~[apache-cassandra-3.0.13.jar:3.0.13]
>>>         at
>>> org.apache.cassandra.service.ReadCallback.get(ReadCallback.java:137)
>>> ~[apache-cassandra-3.0.13.jar:3.0.13]
>>>         at
>>> org.apache.cassandra.service.AbstractReadExecutor.get(AbstractReadExecutor.java:145)
>>> ~[apache-cassandra-3.0.13.jar:3.0.13]
>>>         at
>>> org.apache.cassandra.service.StorageProxy$SinglePartitionReadLifecycle.awaitResultsAndRetryOnDigestMismatch(StorageProxy.java:1715)
>>> ~[apache-cassandra-3.0.13.jar:3.0.13]
>>>         at
>>> org.apache.cassandra.service.StorageProxy.fetchRows(StorageProxy.java:1664)
>>> ~[apache-cassandra-3.0.13.jar:3.0.13]
>>>         at
>>> org.apache.cassandra.service.StorageProxy.readRegular(StorageProxy.java:1605)
>>> ~[apache-cassandra-3.0.13.jar:3.0.13]
>>>         at
>>> org.apache.cassandra.service.StorageProxy.read(StorageProxy.java:1524)
>>> ~[apache-cassandra-3.0.13.jar:3.0.13]
>>>         at
>>> org.apache.cassandra.db.SinglePartitionReadCommand$Group.execute(SinglePartitionReadCommand.java:955)
>>> ~[apache-cassandra-3.0.13.jar:3.0.13]
>>>         at
>>> org.apache.cassandra.cql3.statements.SelectStatement.execute(SelectStatement.java:263)
>>> ~[apache-cassandra-3.0.13.jar:3.0.13]
>>>         at
>>> org.apache.cassandra.cql3.statements.SelectStatement.execute(SelectStatement.java:224)
>>> ~[apache-cassandra-3.0.13.jar:3.0.13]
>>>         at
>>> org.apache.cassandra.auth.CassandraRoleManager.getRoleFromTable(CassandraRoleManager.java:520)
>>> ~[apache-cassandra-3.0.13.jar:3.0.13]
>>>         at
>>> org.apache.cassandra.auth.CassandraRoleManager.getRole(CassandraRoleManager.java:502)
>>> ~[apache-cassandra-3.0.13.jar:3.0.13]
>>>
>>
>>
>>
>> *These are the values of these params in my configuration file  *
>>
>> permissions_validity_in_ms: 300000
>>> permissions_update_interval_in_ms: 20000
>>> roles_validity_in_ms: 300000
>>> roles_update_interval_in_ms: 20000
>>>
>>
>> This was not the case earlier and since this comes from a single app
>> alone we are not sure if this is actually the issue. Can anyone please
>> point out if these values are misconfigured and hence causing the issue or
>> is it somewhere else we should be looking at?
>>
>> Any help would be appreciated.
>>
>> Thanks & Regards,
>> Chahat.
>>
>>
>
> --
> Thanks & Regards,
> Chahat Bhatia
> Systems Engineer
> *Evive*
> +91 7087629779
>
>

-- 
Thanks & Regards,
Chahat Bhatia
Systems Engineer
*Evive*
+91 7087629779

Re: Permission/Role Cache causing timeouts in apps.

Reply via email to