Hello everyone,
We are currently using Cassandra 4.1.3 in a two-data-center
cluster. Recently, we observed cross-node latency spikes of 3-4
seconds in one of our data centers. Below are the relevant logs
from all three nodes in this DC:
DEBUG [ScheduledTasks:1] 2024-10-20 02:46:43,164
MonitoringTask.java:174 - 413 operations were slow in the last
5001 msecs: <SELECT ItemDetails, ItemDocuments, ItemISQDetails,
ItemMappings, LastModified, ItemImages, ItemTitles,
ItemCategories, ItemRating, ApprovalStatus, LocalName,
UserIdentifier, IsDisplayed, VariantOptions FROM
product_data.item_table WHERE item_table_display_id =
2854462277448 LIMIT 5000 ALLOW FILTERING>, time 3400 msec - slow
timeout 500 msec <SELECT AlternateMasterCategoryData,
MasterCategoryData, MasterGroupData, MasterSubCategoryData,
MasterParentCategoryData FROM product_data.taxonomy_table WHERE
master_id = 6402 LIMIT 5000 ALLOW FILTERING>, time 2309 msec -
slow timeout 500 msec/cross-node <SELECT ItemDetails,
ItemDocuments, ItemISQDetails, ItemMappings, LastModified,
ItemImages, ItemTitles, ItemCategories, ItemRating,
ApprovalStatus, LocalName, UserIdentifier, IsDisplayed,
VariantOptions FROM product_data.item_table WHERE
item_table_display_id = 24279823548 LIMIT 5000 ALLOW FILTERING>,
time 3287 msec - slow timeout 500 msec/cross-node <SELECT
ItemDetails, ItemDocuments, ItemISQDetails, ItemMappings,
LastModified, ItemImages, ItemTitles, ItemCategories, ItemRating,
ApprovalStatus, LocalName, UserIdentifier, IsDisplayed,
VariantOptions FROM product_data.item_table WHERE
item_table_display_id = 2854486264330 LIMIT 5000 ALLOW
FILTERING>, time 2878 msec - slow timeout 500 msec/cross-node
<SELECT AlternateMasterCategoryData, MasterCategoryData,
MasterGroupData, MasterSubCategoryData, MasterParentCategoryData
FROM product_data.taxonomy_table WHERE master_id = 27245 LIMIT
5000 ALLOW FILTERING>, time 3056 msec - slow timeout 500
msec/cross-node <SELECT AlternateMasterCategoryData,
MasterCategoryData, MasterGroupData, MasterSubCategoryData,
MasterParentCategoryData FROM product_data.taxonomy_table WHERE
master_id = 32856 LIMIT 5000 ALLOW FILTERING>, time 2353 msec -
slow timeout 500 msec/cross-node <SELECT
AlternateMasterCategoryData, MasterCategoryData, MasterGroupData,
MasterSubCategoryData, MasterParentCategoryData FROM
product_data.taxonomy_table WHERE master_id = 95589 LIMIT 5000
ALLOW FILTERING>, time 2224 msec - slow timeout 500
msec/cross-node <SELECT ItemDetails, ItemDocuments,
ItemISQDetails, ItemMappings, LastModified, ItemImages,
ItemTitles, ItemCategories, ItemRating, ApprovalStatus,
LocalName, UserIdentifier, IsDisplayed, VariantOptions FROM
product_data.item_table WHERE item_table_display_id =
2854514159012 LIMIT 5000 ALLOW FILTERING>, time 3396 msec - slow
timeout 500 msec
Upon investigation, we found no GC pauses at the time of the
latency, and CPU and memory utilization across all nodes appeared
normal. Additionally, latency metrics from Grafana also showed
standard performance.
Given these observations, we are trying to identify the potential
causes of this latency. Any insights or suggestions from the
community would be greatly appreciated!
Thank you!