Re: Permission/Role Cache causing timeouts in apps.

Chahat Bhatia Tue, 27 Jul 2021 08:28:02 -0700

Yes, the application in quite read heavy and the request pattern is bursty
too. Hence that big a request failure in such less time.
Also, nothing out of the ordinary in cfstats and proxyhistograms.


But there are Native-Transport-Requests dropped messages (Almost similar
stats on all the nodes) :

 /opt/evive/pkgs/apache-cassandra-3.0.13/bin/nodetool tpstats
>
> Pool Name                              Active   Pending      Completed
> Blocked    All time blocked
> ...
> Native-Transport-Requests         6         0                27777476
>     0              8262
>
> Message type           Dropped
> READ                     24260
>


*We have the below values for the following params:*
concurrent_reads: 96
native_transport_max_threads: 320
JVM_OPTS="$JVM_OPTS -Dcassandra.max_queued_native_transport_requests=3072"

Do you think it would be advisable to tune the number in the above params
to have lesser load on the node?


On Tue, 27 Jul 2021 at 20:13, Bowen Song <[email protected]> wrote:

> Wow, 15 seconds timeout? That's pretty long... You may want to check the 
> nodetool
> tpstats and make sure the NTP thread pool isn't blocking things.
>
> 16k read requests dropped in 5 seconds, or over 3k requests per second on
> a single node, is a bit suspicious. Does your read requests tend to be
> bursty?
> On 27/07/2021 15:32, Chahat Bhatia wrote:
>
> Yes, RF=6 for system auth. Sorry my bad.
>
>
> No, we are not using cassandra user for the application. We have a custom
> super user for our operational and administrative tasks and a separate role
> with needed perms for the application.
>
>>  role                      | super | login | options
>> ------------------------+-------+-------+---------
>>  EviveCassandra |  True |  True |        {}
>
>                   ccusr | False |  True |        {}
>
>
>
>
>    -  gc_warn_threshold_in_ms: 1000
>    - read_request_timeout_in_ms: 15000  // been the same since the start
>    - cross_node_timeout: false           // but we have ntp on each
>    server and the clocks are synced.
>
>    - On checking the logs a little more I found out these:
>
> INFO  [ScheduledTasks:1] 2021-07-26 07:21:33,279
>> MessagingService.java:1013 - READ messages were dropped in last 5000 ms:
>> 289 for internal timeout and 0 for cross node timeout
>
> INFO  [ScheduledTasks:1] 2021-07-26 07:22:33,894
>> MessagingService.java:1013 - READ messages were dropped in last 5000 ms:
>> 273 for internal timeout and 0 for cross node timeout
>
> INFO  [ScheduledTasks:1] 2021-07-27 06:12:03,988
>> MessagingService.java:1013 - READ messages were dropped in last 5000 ms:
>> 160 for internal timeout and 0 for cross node timeout
>
> INFO  [ScheduledTasks:1] 2021-07-27 06:13:09,125
>> MessagingService.java:1013 - READ messages were dropped in last 5000 ms:
>> 223 for internal timeout and 0 for cross node timeout
>
> INFO  [ScheduledTasks:1] 2021-07-27 06:13:44,212
>> MessagingService.java:1013 - READ messages were dropped in last 5000 ms: 72
>> for internal timeout and 0 for cross node timeout
>
> INFO  [ScheduledTasks:1] 2021-07-27 06:14:14,496
>> MessagingService.java:1013 - READ messages were dropped in last 5000 ms:
>> 16036 for internal timeout and 0 for cross node timeout
>
> INFO  [ScheduledTasks:1] 2021-07-27 06:15:59,744
>> MessagingService.java:1013 - READ messages were dropped in last 5000 ms:
>> 287 for internal timeout and 0 for cross node timeout
>
>
>
>    - Also, I am checking cfstats and proxyhistorgrams is in progress,
>    will update incase anythings suspicious.
>
>
> On Tue, 27 Jul 2021 at 18:09, Bowen Song <[email protected]> <[email protected]>
> wrote:
>
>> Hello Chahat,
>>
>>
>> You haven't replied to the first point, are you using the "cassandra"
>> user?
>>
>>
>> The schema and your description don't quite match. When you said:
>>
>> *the system_auth  for 2 DCs :   **us-east*
>> * with 6 nodes (and RF=3) and ... *
>>
>> I assume you meant to say 6 nodes and RF=6? That sounds totally fine.
>>
>>
>> As of the GC logs, do you see any WARN level logs? You may also want to
>> double check the choice of the gc_warn_threshold_in_ms value is
>> reasonable. I'm a bit surprised to see the GC logs full of ~1.6ms GC
>> pauses. AFAIK, by default Cassandra only logs GC pauses longer than 200ms.
>> I have to assume that's the result of your custom script. You may want to
>> double check the script actually does what it says on the tin, just in case
>> it misses something.
>>
>>
>> Then, the other relevant questions.
>>
>> Do you use cross node timeout? If so, are the clocks in sync between all
>> servers?
>>
>> What is the timeout values in the cassandra.yaml file? (
>> read_request_timeout_in_ms, etc.) Are they reasonably long enough for
>> the corresponding request type to complete?
>>
>> Since you've only got 7 nodes, I'd also recommend you to check the nodetool
>> cfstats & nodetool cfhistograms output for the tables in the system_auth
>> keyspace on each node, and see if you can spot anything out of ordinary.
>>
>>
>> Regards,
>>
>> Bowen
>> On 27/07/2021 09:23, Chahat Bhatia wrote:
>>
>> Thanks for the prompt response.
>>
>> * Here is the system_schema.keyspaces entry:*
>>
>>                 system_auth |           True | {'class':
>>> 'org.apache.cassandra.locator.NetworkTopologyStrategy', 'us-east': '6',
>>> 'us-east-backup': '1'}
>>>                      census |           True | {'class':
>>> 'org.apache.cassandra.locator.NetworkTopologyStrategy', 'us-east': '3',
>>> 'us-east-backup': '1'}
>>
>>
>> So, the system_auth  for 2 DCs :   *us-east with 6 nodes (and RF=3) and
>> us-east-backup with a single node and 100% of the data.*
>> *Keyspace census* showing RF=3 for the main DC and RF=1 for the backup
>> site.
>>
>>
>> And for running the application, we have a user specifically created for
>>> that application itself and that user also has its appropriate permissions
>>> in cassandra to select, modify and delete from concerned tables. And it
>>> uses LOCAL_QUORUM for querying the data and the local-dc is set to
>>> 'us-east'.
>>
>>
>>
>> Also, there is no excessive GC for any of the nodes, we run a custom
>> script to trackthe GC stats (from the cassandra log itself) and output it.
>> Below is the output from the current running script and its similar for all
>> the servers:
>>
>> 2021-07-27 03:04:45,072 INFO gcstats:58 - Application Thread stop time
>>> 0.001656 seconds.
>>> 2021-07-27 03:04:45,080 INFO gcstats:58 - Application Thread stop time
>>> 0.001669 seconds.
>>> 2021-07-27 03:04:45,087 INFO gcstats:58 - Application Thread stop time
>>> 0.001601 seconds.
>>> 2021-07-27 03:04:45,095 INFO gcstats:58 - Application Thread stop time
>>> 0.001713 seconds.
>>> 2021-07-27 03:04:45,103 INFO gcstats:58 - Application Thread stop time
>>> 0.001586 seconds.
>>> 2021-07-27 03:04:45,110 INFO gcstats:58 - Application Thread stop time
>>> 0.001671 seconds.
>>> 2021-07-27 03:04:45,118 INFO gcstats:58 - Application Thread stop time
>>> 0.001691 seconds.
>>> 2021-07-27 03:04:45,127 INFO gcstats:58 - Application Thread stop time
>>> 0.001860 seconds.
>>> 2021-07-27 03:04:45,134 INFO gcstats:58 - Application Thread stop time
>>> 0.001630 seconds.
>>> 2021-07-27 03:04:45,141 INFO gcstats:58 - Application Thread stop time
>>> 0.001515 seconds.
>>> 2021-07-27 03:04:45,148 INFO gcstats:58 - Application Thread stop time
>>> 0.001533 seconds.
>>> 2021-07-27 03:04:45,156 INFO gcstats:58 - Application Thread stop time
>>> 0.001630 seconds.
>>> 2021-07-27 03:04:45,163 INFO gcstats:58 - Application Thread stop time
>>> 0.001577 seconds.
>>> 2021-07-27 03:04:45,170 INFO gcstats:58 - Application Thread stop time
>>> 0.001538 seconds.
>>> 2021-07-27 03:04:45,177 INFO gcstats:58 - Application Thread stop time
>>> 0.001615 seconds.
>>> 2021-07-27 03:04:45,186 INFO gcstats:58 - Application Thread stop time
>>> 0.001584 seconds.
>>> 2021-07-27 03:04:45,193 INFO gcstats:58 - Application Thread stop time
>>> 0.001558 seconds.
>>> 2021-07-27 03:04:45,200 INFO gcstats:58 - Application Thread stop time
>>> 0.001696 seconds.
>>> 2021-07-27 03:04:45,208 INFO gcstats:58 - Application Thread stop time
>>> 0.001658 seconds.
>>> 2021-07-27 03:04:45,215 INFO gcstats:58 - Application Thread stop time
>>> 0.001592 seconds.
>>> 2021-07-27 03:04:45,222 INFO gcstats:58 - Application Thread stop time
>>> 0.001618 seconds.
>>> 2021-07-27 03:05:08,907 INFO gcstats:58 - Application Thread stop time
>>> 0.001624 seconds.
>>> 2021-07-27 03:06:34,436 INFO gcstats:58 - Application Thread stop time
>>> 0.297773 seconds.
>>>
>>
>>
>>
>>
>>
>> On Tue, 27 Jul 2021 at 13:23, 'Bowen Song' via Infra Updates <
>> [email protected]> wrote:
>>
>>> Hello Chahat,
>>>
>>>
>>> First, can you please make sure the Cassandra user used by the
>>> application is not "cassandra"? Because the "cassandra" user uses
>>> QUORUM consistency level to read the auth tables.
>>>
>>> Then, can you please make sure the replication strategy is set correctly
>>> for the system_auth namespace? I.e.: ensure the old DC is not present, and
>>> the new DC has sufficient number of replicas for fault tolerance.
>>>
>>> Finally, can you please check the GC logs, and make sure there isn't JVM
>>> GC issues, espicially long STW pauses?
>>>
>>>
>>> Regards,
>>>
>>> Bowen
>>>
>>>
>>> On 27/07/2021 08:34, Chahat Bhatia wrote:
>>>
>>> Hi Community,
>>>
>>> Context: We are running a cluster of 6 nodes in production with a RF=3
>>> in AWS.
>>> We recently moved from physical servers to cloud by adding a new DC and
>>> then removing the old one. Everything is working fine in all the other
>>> applications except this one.
>>>
>>>
>>> *As we recently started experiencing read timeouts in one of our
>>> production applications where the client threw  *
>>>
>>> Error An unexpected error occurred server side on ip-IP.ec2.internal:
>>>> com.google.common.util.concurrent.*UncheckedExecutionException*:
>>>> *com.google.common.util.concurrent.UncheckedExecutionException:**
>>>> java.lang.RuntimeException:
>>>> org.apache.cassandra.exceptions.ReadTimeoutException: Operation timed out -
>>>> received only 0 responses.*
>>>> com.datastax.driver.core.exceptions.ServerError: An unexpected error
>>>> occurred server side  :
>>>> com.google.common.util.concurrent.UncheckedExecutionException:
>>>> com.google.common.util.concurrent.UncheckedExecutionException:
>>>> java.lang.RuntimeException:
>>>> org.apache.cassandra.exceptions.ReadTimeoutException: Operation timed
>>>> out - received only 0 responses.
>>>
>>> at com.datastax.driver.core.exceptions.ServerError.copy(
>>>> ServerError.java:63) ~[cassandra-driver-core-3.3.0-shaded.jar!/:?] at
>>>> com.datastax.driver.core.exceptions.ServerError.copy(ServerError.java:
>>>> 25) ~[cassandra-driver-core-3.3.0-shaded.jar!/:?] at
>>>> com.datastax.driver.core.DriverThrowables.propagateCause(
>>>> DriverThrowables.java:37) ~[cassandra-driver-core-3.3.0-shaded.jar!/:?]
>>>> at com.datastax.driver.core.DefaultResultSetFuture.getUninterruptibly(
>>>> DefaultResultSetFuture.java:245) ~[cassandra-driver-core-3
>>>> .3.0-shaded.jar!/:?] at
>>>> com.datastax.driver.core.AbstractSession.execute(AbstractSession.java:
>>>> 68) ~[cassandra-driver-core-3.3.0-shaded.jar!/:? ]
>>>
>>>  ............ cntd
>>>
>>>
>>>
>>> com.datastax.driver.core.exceptions.ReadTimeoutException: Cassandra
>>>> timeout during read query at consistency LOCAL_QUORUM (2 responses were
>>>> required but only 1 replica responded)
>>>
>>>
>>>
>>>
>>>
>>> *And around the same time these were the errors on the server side (from
>>> cassandra logs): *
>>>
>>>
>>>>
>>>> *ERROR [RolesCacheRefresh:1] 2021-07-26 06:32:43,094
>>>> CassandraDaemon.java:207 - Exception in thread
>>>> Thread[RolesCacheRefresh:1,5,main] java.lang.RuntimeException:
>>>> org.apache.cassandra.exceptions.ReadTimeoutException: Operation timed out -
>>>> received only 0 responses. *        at
>>>> org.apache.cassandra.auth.CassandraRoleManager.getRole(CassandraRoleManager.java:512)
>>>> ~[apache-cassandra-3.0.13.jar:3.0.13]
>>>>         at
>>>> org.apache.cassandra.auth.CassandraRoleManager.getRoles(CassandraRoleManager.java:280)
>>>> ~[apache-cassandra-3.0.13.jar:3.0.13]
>>>>         at
>>>> org.apache.cassandra.auth.RolesCache$1$1.call(RolesCache.java:135)
>>>> ~[apache-cassandra-3.0.13.jar:3.0.13]
>>>>         at
>>>> org.apache.cassandra.auth.RolesCache$1$1.call(RolesCache.java:130)
>>>> ~[apache-cassandra-3.0.13.jar:3.0.13]
>>>>         at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>>>> ~[na:1.8.0_131]
>>>>         at
>>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>>>> ~[na:1.8.0_131]
>>>>         at
>>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>>>> [na:1.8.0_131]
>>>>         at
>>>> org.apache.cassandra.concurrent.NamedThreadFactory.lambda$threadLocalDeallocator$0(NamedThreadFactory.java:79)
>>>> [apache-cassandra-3.0.13.jar:3.0.13]
>>>>         at java.lang.Thread.run(Thread.java:748) ~[na:1.8.0_131]
>>>> Caused by: org.apache.cassandra.exceptions.ReadTimeoutException:
>>>> Operation timed out - received only 0 responses.
>>>>         at
>>>> org.apache.cassandra.service.ReadCallback.awaitResults(ReadCallback.java:132)
>>>> ~[apache-cassandra-3.0.13.jar:3.0.13]
>>>>         at
>>>> org.apache.cassandra.service.ReadCallback.get(ReadCallback.java:137)
>>>> ~[apache-cassandra-3.0.13.jar:3.0.13]
>>>>         at
>>>> org.apache.cassandra.service.AbstractReadExecutor.get(AbstractReadExecutor.java:145)
>>>> ~[apache-cassandra-3.0.13.jar:3.0.13]
>>>>         at
>>>> org.apache.cassandra.service.StorageProxy$SinglePartitionReadLifecycle.awaitResultsAndRetryOnDigestMismatch(StorageProxy.java:1715)
>>>> ~[apache-cassandra-3.0.13.jar:3.0.13]
>>>>         at
>>>> org.apache.cassandra.service.StorageProxy.fetchRows(StorageProxy.java:1664)
>>>> ~[apache-cassandra-3.0.13.jar:3.0.13]
>>>>         at
>>>> org.apache.cassandra.service.StorageProxy.readRegular(StorageProxy.java:1605)
>>>> ~[apache-cassandra-3.0.13.jar:3.0.13]
>>>>         at
>>>> org.apache.cassandra.service.StorageProxy.read(StorageProxy.java:1524)
>>>> ~[apache-cassandra-3.0.13.jar:3.0.13]
>>>>         at
>>>> org.apache.cassandra.db.SinglePartitionReadCommand$Group.execute(SinglePartitionReadCommand.java:955)
>>>> ~[apache-cassandra-3.0.13.jar:3.0.13]
>>>>         at
>>>> org.apache.cassandra.cql3.statements.SelectStatement.execute(SelectStatement.java:263)
>>>> ~[apache-cassandra-3.0.13.jar:3.0.13]
>>>>         at
>>>> org.apache.cassandra.cql3.statements.SelectStatement.execute(SelectStatement.java:224)
>>>> ~[apache-cassandra-3.0.13.jar:3.0.13]
>>>>         at
>>>> org.apache.cassandra.auth.CassandraRoleManager.getRoleFromTable(CassandraRoleManager.java:520)
>>>> ~[apache-cassandra-3.0.13.jar:3.0.13]
>>>>         at
>>>> org.apache.cassandra.auth.CassandraRoleManager.getRole(CassandraRoleManager.java:502)
>>>> ~[apache-cassandra-3.0.13.jar:3.0.13]
>>>>
>>>>
>>>>
>>>> *ERROR [PermissionsCacheRefresh:1] 2021-07-26 07:11:25,804
>>>> CassandraDaemon.java:207 - Exception in thread
>>>> Thread[PermissionsCacheRefresh:1,5,main] java.lang.RuntimeException:
>>>> org.apache.cassandra.exceptions.ReadTimeoutException: Operation timed out -
>>>> received only 0 responses. *        at
>>>> org.apache.cassandra.auth.CassandraRoleManager.getRole(CassandraRoleManager.java:512)
>>>> ~[apache-cassandra-3.0.13.jar:3.0.13]
>>>>         at
>>>> org.apache.cassandra.auth.CassandraRoleManager.isSuper(CassandraRoleManager.java:304)
>>>> ~[apache-cassandra-3.0.13.jar:3.0.13]
>>>>         at
>>>> org.apache.cassandra.auth.Roles.hasSuperuserStatus(Roles.java:52)
>>>> ~[apache-cassandra-3.0.13.jar:3.0.13]
>>>>         at
>>>> org.apache.cassandra.auth.AuthenticatedUser.isSuper(AuthenticatedUser.java:71)
>>>> ~[apache-cassandra-3.0.13.jar:3.0.13]
>>>>         at
>>>> org.apache.cassandra.auth.CassandraAuthorizer.authorize(CassandraAuthorizer.java:76)
>>>> ~[apache-cassandra-3.0.13.jar:3.0.13]
>>>>         at
>>>> org.apache.cassandra.auth.PermissionsCache$1$1.call(PermissionsCache.java:136)
>>>> ~[apache-cassandra-3.0.13.jar:3.0.13]
>>>>         at
>>>> org.apache.cassandra.auth.PermissionsCache$1$1.call(PermissionsCache.java:131)
>>>> ~[apache-cassandra-3.0.13.jar:3.0.13]
>>>>         at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>>>> ~[na:1.8.0_131]
>>>>         at
>>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>>>> ~[na:1.8.0_131]
>>>>         at
>>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>>>> [na:1.8.0_131]
>>>>         at
>>>> org.apache.cassandra.concurrent.NamedThreadFactory.lambda$threadLocalDeallocator$0(NamedThreadFactory.java:79)
>>>> [apache-cassandra-3.0.13.jar:3.0.13]
>>>>         at java.lang.Thread.run(Thread.java:748) ~[na:1.8.0_131]
>>>> Caused by: org.apache.cassandra.exceptions.ReadTimeoutException:
>>>> Operation timed out - received only 0 responses.
>>>>         at
>>>> org.apache.cassandra.service.ReadCallback.awaitResults(ReadCallback.java:132)
>>>> ~[apache-cassandra-3.0.13.jar:3.0.13]
>>>>         at
>>>> org.apache.cassandra.service.ReadCallback.get(ReadCallback.java:137)
>>>> ~[apache-cassandra-3.0.13.jar:3.0.13]
>>>>         at
>>>> org.apache.cassandra.service.AbstractReadExecutor.get(AbstractReadExecutor.java:145)
>>>> ~[apache-cassandra-3.0.13.jar:3.0.13]
>>>>         at
>>>> org.apache.cassandra.service.StorageProxy$SinglePartitionReadLifecycle.awaitResultsAndRetryOnDigestMismatch(StorageProxy.java:1715)
>>>> ~[apache-cassandra-3.0.13.jar:3.0.13]
>>>>         at
>>>> org.apache.cassandra.service.StorageProxy.fetchRows(StorageProxy.java:1664)
>>>> ~[apache-cassandra-3.0.13.jar:3.0.13]
>>>>         at
>>>> org.apache.cassandra.service.StorageProxy.readRegular(StorageProxy.java:1605)
>>>> ~[apache-cassandra-3.0.13.jar:3.0.13]
>>>>         at
>>>> org.apache.cassandra.service.StorageProxy.read(StorageProxy.java:1524)
>>>> ~[apache-cassandra-3.0.13.jar:3.0.13]
>>>>         at
>>>> org.apache.cassandra.db.SinglePartitionReadCommand$Group.execute(SinglePartitionReadCommand.java:955)
>>>> ~[apache-cassandra-3.0.13.jar:3.0.13]
>>>>         at
>>>> org.apache.cassandra.cql3.statements.SelectStatement.execute(SelectStatement.java:263)
>>>> ~[apache-cassandra-3.0.13.jar:3.0.13]
>>>>         at
>>>> org.apache.cassandra.cql3.statements.SelectStatement.execute(SelectStatement.java:224)
>>>> ~[apache-cassandra-3.0.13.jar:3.0.13]
>>>>         at
>>>> org.apache.cassandra.auth.CassandraRoleManager.getRoleFromTable(CassandraRoleManager.java:520)
>>>> ~[apache-cassandra-3.0.13.jar:3.0.13]
>>>>         at
>>>> org.apache.cassandra.auth.CassandraRoleManager.getRole(CassandraRoleManager.java:502)
>>>> ~[apache-cassandra-3.0.13.jar:3.0.13]
>>>>
>>>
>>>
>>>
>>> *These are the values of these params in my configuration file  *
>>>
>>> permissions_validity_in_ms: 300000
>>>> permissions_update_interval_in_ms: 20000
>>>> roles_validity_in_ms: 300000
>>>> roles_update_interval_in_ms: 20000
>>>>
>>>
>>> This was not the case earlier and since this comes from a single app
>>> alone we are not sure if this is actually the issue. Can anyone please
>>> point out if these values are misconfigured and hence causing the issue or
>>> is it somewhere else we should be looking at?
>>>
>>> Any help would be appreciated.
>>>
>>> Thanks & Regards,
>>> Chahat.
>>>
>>>
>>
>> --
>> Thanks & Regards,
>> Chahat Bhatia
>> Systems Engineer
>> *Evive*
>> +91 7087629779
>>
>>
>
> --
> Thanks & Regards,
> Chahat Bhatia
> Systems Engineer
> *Evive*
> +91 7087629779
>
>

-- 
Thanks & Regards,
Chahat Bhatia
Systems Engineer
*Evive*
+91 7087629779

Re: Permission/Role Cache causing timeouts in apps.

Reply via email to