Hi,
This was a one time issue for which we are looking for the RCA. Generally
P99 latencies of all the tables are less than 12ms. There was a few ms jump
in P99 on one of the node at this time at coordinator level. The CL is
Local_Quorum.

Another error we noticed in system log at the same time on one of the node
is as below:
WARN  [ReadStage-51] 2024-10-20 02:46:41,664 NoSpamLogger.java:108 -
db-us-2a.c.cass-im.internal/x.x.x.10:7000->/x.x.x.11:7000-SMALL_MESSAGES-bb3053fc
dropping message of type MUTATION_REQ whose timeout expired before reaching
the network

We have not configured the network metrics yet on grafana. Any help in this
regard would be appreciated.
We are also suspicious around any network issue between nodes though its a
GCP setup.

Regards,
Ashish

On Thu, Oct 24, 2024 at 6:32 PM Bowen Song via user <
user@cassandra.apache.org> wrote:

> Can you be more explicit about the "latency metrics from Grafana" you
> looked at? What percentile latencies were you looking at? Any aggregation
> used? You can post the underlying queries used for the dashboard if that's
> easier than explaining it. In general you should only care about the max,
> not average, latencies across Cassandra nodes. You should also look at the
> p99 and p999 latencies, and ignore p50 (i.e. median) latency unless you
> need to meet an SLA or SLO on that.
>
> Also, it's worth checking the network related metrics and connectivity,
> and makes sure there weren't any severe packet loss, congestion, or latency
> related issue.
>
>
> On 24/10/2024 05:25, Naman kaushik wrote:
>
> Hello everyone,
>
> We are currently using Cassandra 4.1.3 in a two-data-center cluster.
> Recently, we observed cross-node latency spikes of 3-4 seconds in one of
> our data centers. Below are the relevant logs from all three nodes in this
> DC:
>
> DEBUG [ScheduledTasks:1] 2024-10-20 02:46:43,164 MonitoringTask.java:174 - 
> 413 operations were slow in the last 5001 msecs:
> <SELECT ItemDetails, ItemDocuments, ItemISQDetails, ItemMappings, 
> LastModified, ItemImages, ItemTitles, ItemCategories, ItemRating, 
> ApprovalStatus, LocalName, UserIdentifier, IsDisplayed, VariantOptions FROM 
> product_data.item_table WHERE item_table_display_id = 2854462277448 LIMIT 
> 5000 ALLOW FILTERING>, time 3400 msec - slow timeout 500 msec
> <SELECT AlternateMasterCategoryData, MasterCategoryData, MasterGroupData, 
> MasterSubCategoryData, MasterParentCategoryData FROM 
> product_data.taxonomy_table WHERE master_id = 6402 LIMIT 5000 ALLOW 
> FILTERING>, time 2309 msec - slow timeout 500 msec/cross-node
> <SELECT ItemDetails, ItemDocuments, ItemISQDetails, ItemMappings, 
> LastModified, ItemImages, ItemTitles, ItemCategories, ItemRating, 
> ApprovalStatus, LocalName, UserIdentifier, IsDisplayed, VariantOptions FROM 
> product_data.item_table WHERE item_table_display_id = 24279823548 LIMIT 5000 
> ALLOW FILTERING>, time 3287 msec - slow timeout 500 msec/cross-node
> <SELECT ItemDetails, ItemDocuments, ItemISQDetails, ItemMappings, 
> LastModified, ItemImages, ItemTitles, ItemCategories, ItemRating,
> ApprovalStatus, LocalName, UserIdentifier, IsDisplayed, VariantOptions FROM 
> product_data.item_table WHERE item_table_display_id = 2854486264330 LIMIT 
> 5000 ALLOW FILTERING>, time 2878 msec - slow timeout 500 msec/cross-node
> <SELECT AlternateMasterCategoryData, MasterCategoryData, MasterGroupData, 
> MasterSubCategoryData, MasterParentCategoryData FROM 
> product_data.taxonomy_table WHERE master_id = 27245 LIMIT 5000 ALLOW 
> FILTERING>, time 3056 msec - slow timeout 500 msec/cross-node
> <SELECT AlternateMasterCategoryData, MasterCategoryData, MasterGroupData, 
> MasterSubCategoryData, MasterParentCategoryData FROM 
> product_data.taxonomy_table WHERE master_id = 32856 LIMIT 5000 ALLOW 
> FILTERING>, time 2353 msec - slow timeout 500 msec/cross-node
> <SELECT AlternateMasterCategoryData, MasterCategoryData, MasterGroupData, 
> MasterSubCategoryData, MasterParentCategoryData FROM 
> product_data.taxonomy_table WHERE master_id = 95589 LIMIT 5000 ALLOW 
> FILTERING>, time 2224 msec - slow timeout 500 msec/cross-node
> <SELECT ItemDetails, ItemDocuments, ItemISQDetails, ItemMappings, 
> LastModified, ItemImages, ItemTitles, ItemCategories, ItemRating, 
> ApprovalStatus, LocalName, UserIdentifier, IsDisplayed, VariantOptions FROM 
> product_data.item_table WHERE item_table_display_id = 2854514159012 LIMIT 
> 5000 ALLOW FILTERING>, time 3396 msec - slow timeout 500 msec
>
> Upon investigation, we found no GC pauses at the time of the latency, and
> CPU and memory utilization across all nodes appeared normal. Additionally,
> latency metrics from Grafana also showed standard performance.
>
> Given these observations, we are trying to identify the potential causes
> of this latency. Any insights or suggestions from the community would be
> greatly appreciated!
>
> Thank you!
>
>

Reply via email to