Can you be more explicit about the "latency metrics from Grafana" you
looked at? What percentile latencies were you looking at? Any
aggregation used? You can post the underlying queries used for the
dashboard if that's easier than explaining it. In general you should
only care about the max, not average, latencies across Cassandra nodes.
You should also look at the p99 and p999 latencies, and ignore p50 (i.e.
median) latency unless you need to meet an SLA or SLO on that.
Also, it's worth checking the network related metrics and connectivity,
and makes sure there weren't any severe packet loss, congestion, or
latency related issue.
On 24/10/2024 05:25, Naman kaushik wrote:
Hello everyone,
We are currently using Cassandra 4.1.3 in a two-data-center cluster.
Recently, we observed cross-node latency spikes of 3-4 seconds in one
of our data centers. Below are the relevant logs from all three nodes
in this DC:
DEBUG [ScheduledTasks:1] 2024-10-20 02:46:43,164
MonitoringTask.java:174 - 413 operations were slow in the last 5001
msecs: <SELECT ItemDetails, ItemDocuments, ItemISQDetails,
ItemMappings, LastModified, ItemImages, ItemTitles, ItemCategories,
ItemRating, ApprovalStatus, LocalName, UserIdentifier, IsDisplayed,
VariantOptions FROM product_data.item_table WHERE
item_table_display_id = 2854462277448 LIMIT 5000 ALLOW FILTERING>,
time 3400 msec - slow timeout 500 msec <SELECT
AlternateMasterCategoryData, MasterCategoryData, MasterGroupData,
MasterSubCategoryData, MasterParentCategoryData FROM
product_data.taxonomy_table WHERE master_id = 6402 LIMIT 5000 ALLOW
FILTERING>, time 2309 msec - slow timeout 500 msec/cross-node <SELECT
ItemDetails, ItemDocuments, ItemISQDetails, ItemMappings,
LastModified, ItemImages, ItemTitles, ItemCategories, ItemRating,
ApprovalStatus, LocalName, UserIdentifier, IsDisplayed, VariantOptions
FROM product_data.item_table WHERE item_table_display_id = 24279823548
LIMIT 5000 ALLOW FILTERING>, time 3287 msec - slow timeout 500
msec/cross-node <SELECT ItemDetails, ItemDocuments, ItemISQDetails,
ItemMappings, LastModified, ItemImages, ItemTitles, ItemCategories,
ItemRating, ApprovalStatus, LocalName, UserIdentifier, IsDisplayed,
VariantOptions FROM product_data.item_table WHERE
item_table_display_id = 2854486264330 LIMIT 5000 ALLOW FILTERING>,
time 2878 msec - slow timeout 500 msec/cross-node <SELECT
AlternateMasterCategoryData, MasterCategoryData, MasterGroupData,
MasterSubCategoryData, MasterParentCategoryData FROM
product_data.taxonomy_table WHERE master_id = 27245 LIMIT 5000 ALLOW
FILTERING>, time 3056 msec - slow timeout 500 msec/cross-node <SELECT
AlternateMasterCategoryData, MasterCategoryData, MasterGroupData,
MasterSubCategoryData, MasterParentCategoryData FROM
product_data.taxonomy_table WHERE master_id = 32856 LIMIT 5000 ALLOW
FILTERING>, time 2353 msec - slow timeout 500 msec/cross-node <SELECT
AlternateMasterCategoryData, MasterCategoryData, MasterGroupData,
MasterSubCategoryData, MasterParentCategoryData FROM
product_data.taxonomy_table WHERE master_id = 95589 LIMIT 5000 ALLOW
FILTERING>, time 2224 msec - slow timeout 500 msec/cross-node <SELECT
ItemDetails, ItemDocuments, ItemISQDetails, ItemMappings,
LastModified, ItemImages, ItemTitles, ItemCategories, ItemRating,
ApprovalStatus, LocalName, UserIdentifier, IsDisplayed, VariantOptions
FROM product_data.item_table WHERE item_table_display_id =
2854514159012 LIMIT 5000 ALLOW FILTERING>, time 3396 msec - slow
timeout 500 msec
Upon investigation, we found no GC pauses at the time of the latency,
and CPU and memory utilization across all nodes appeared normal.
Additionally, latency metrics from Grafana also showed standard
performance.
Given these observations, we are trying to identify the potential
causes of this latency. Any insights or suggestions from the community
would be greatly appreciated!
Thank you!