Even big cloud providers, like GCP and AWS, can have temporary and minor network issues every now and then. If it was the result of an increase in packet loss for a short duration, TCP retransmission may be of interest. Have a look at /proc/net/netstat on Linux and you will find the relevant metrics. You can use node_exporter with Promtheus or similar tools to parse and gather the data and then visualise it on Grafana.

Monitoring short-lived network latency issue is a bit tricky as it requires measuring it periodically (e.g. ping at a set interval), and depends on how frequently you measure it and how long does a temporary issue last, you may still not catch it.


On 24/10/2024 16:50, MyWorld wrote:
Hi,
This was a one time issue for which we are looking for the RCA. Generally P99 latencies of all the tables are less than 12ms. There was a few ms jump in P99 on one of the node at this time at coordinator level. The CL is Local_Quorum.

Another error we noticed in system log at the same time on one of the node is as below: WARN  [ReadStage-51] 2024-10-20 02:46:41,664 NoSpamLogger.java:108 - db-us-2a.c.cass-im.internal/x.x.x.10:7000->/x.x.x.11:7000-SMALL_MESSAGES-bb3053fc dropping message of type MUTATION_REQ whose timeout expired before reaching the network

We have not configured the network metrics yet on grafana. Any help in this regard would be appreciated. We are also suspicious around any network issue between nodes though its a GCP setup.

Regards,
Ashish

On Thu, Oct 24, 2024 at 6:32 PM Bowen Song via user <user@cassandra.apache.org> wrote:

    Can you be more explicit about the "latency metrics from Grafana"
    you looked at? What percentile latencies were you looking at? Any
    aggregation used? You can post the underlying queries used for the
    dashboard if that's easier than explaining it. In general you
    should only care about the max, not average, latencies across
    Cassandra nodes. You should also look at the p99 and p999
    latencies, and ignore p50 (i.e. median) latency unless you need to
    meet an SLA or SLO on that.

    Also, it's worth checking the network related metrics and
    connectivity, and makes sure there weren't any severe packet loss,
    congestion, or latency related issue.


    On 24/10/2024 05:25, Naman kaushik wrote:

    Hello everyone,

    We are currently using Cassandra 4.1.3 in a two-data-center
    cluster. Recently, we observed cross-node latency spikes of 3-4
    seconds in one of our data centers. Below are the relevant logs
    from all three nodes in this DC:

    DEBUG [ScheduledTasks:1] 2024-10-20 02:46:43,164
    MonitoringTask.java:174 - 413 operations were slow in the last
    5001 msecs: <SELECT ItemDetails, ItemDocuments, ItemISQDetails,
    ItemMappings, LastModified, ItemImages, ItemTitles,
    ItemCategories, ItemRating, ApprovalStatus, LocalName,
    UserIdentifier, IsDisplayed, VariantOptions FROM
    product_data.item_table WHERE item_table_display_id =
    2854462277448 LIMIT 5000 ALLOW FILTERING>, time 3400 msec - slow
    timeout 500 msec <SELECT AlternateMasterCategoryData,
    MasterCategoryData, MasterGroupData, MasterSubCategoryData,
    MasterParentCategoryData FROM product_data.taxonomy_table WHERE
    master_id = 6402 LIMIT 5000 ALLOW FILTERING>, time 2309 msec -
    slow timeout 500 msec/cross-node <SELECT ItemDetails,
    ItemDocuments, ItemISQDetails, ItemMappings, LastModified,
    ItemImages, ItemTitles, ItemCategories, ItemRating,
    ApprovalStatus, LocalName, UserIdentifier, IsDisplayed,
    VariantOptions FROM product_data.item_table WHERE
    item_table_display_id = 24279823548 LIMIT 5000 ALLOW FILTERING>,
    time 3287 msec - slow timeout 500 msec/cross-node <SELECT
    ItemDetails, ItemDocuments, ItemISQDetails, ItemMappings,
    LastModified, ItemImages, ItemTitles, ItemCategories, ItemRating,
    ApprovalStatus, LocalName, UserIdentifier, IsDisplayed,
    VariantOptions FROM product_data.item_table WHERE
    item_table_display_id = 2854486264330 LIMIT 5000 ALLOW
    FILTERING>, time 2878 msec - slow timeout 500 msec/cross-node
    <SELECT AlternateMasterCategoryData, MasterCategoryData,
    MasterGroupData, MasterSubCategoryData, MasterParentCategoryData
    FROM product_data.taxonomy_table WHERE master_id = 27245 LIMIT
    5000 ALLOW FILTERING>, time 3056 msec - slow timeout 500
    msec/cross-node <SELECT AlternateMasterCategoryData,
    MasterCategoryData, MasterGroupData, MasterSubCategoryData,
    MasterParentCategoryData FROM product_data.taxonomy_table WHERE
    master_id = 32856 LIMIT 5000 ALLOW FILTERING>, time 2353 msec -
    slow timeout 500 msec/cross-node <SELECT
    AlternateMasterCategoryData, MasterCategoryData, MasterGroupData,
    MasterSubCategoryData, MasterParentCategoryData FROM
    product_data.taxonomy_table WHERE master_id = 95589 LIMIT 5000
    ALLOW FILTERING>, time 2224 msec - slow timeout 500
    msec/cross-node <SELECT ItemDetails, ItemDocuments,
    ItemISQDetails, ItemMappings, LastModified, ItemImages,
    ItemTitles, ItemCategories, ItemRating, ApprovalStatus,
    LocalName, UserIdentifier, IsDisplayed, VariantOptions FROM
    product_data.item_table WHERE item_table_display_id =
    2854514159012 LIMIT 5000 ALLOW FILTERING>, time 3396 msec - slow
    timeout 500 msec

    Upon investigation, we found no GC pauses at the time of the
    latency, and CPU and memory utilization across all nodes appeared
    normal. Additionally, latency metrics from Grafana also showed
    standard performance.

    Given these observations, we are trying to identify the potential
    causes of this latency. Any insights or suggestions from the
    community would be greatly appreciated!

    Thank you!

Reply via email to