Thank you Erick - looking through all the logs on the nodes I found this:

INFO  [CompactionExecutor:17551] 2021-09-15 15:13:20,524 CompactionTask.java:245 - Compacted (fb0cdca0-1658-11ec-9098-dd70c3a3487a) 4 sstables to [/data/7/cassandra/data/doc/fieldcounts-03b67080ada111ebade9fdc1d34336d3/nb-96619-big,] to level=0.  9.762MiB to 9.672MiB (~99% of original) in 3,873ms.  Read Throughput = 2.520MiB/s, Write Throughput = 2.497MiB/s, Row Throughput = ~125,729/s.  255,171 total partitions merged to 251,458.  Partition merge counts were {1:247758, 2:3687, 3:13, } INFO  [NonPeriodicTasks:1] 2021-09-15 15:13:20,524 SSTable.java:111 - Deleting sstable: /data/7/cassandra/data/doc/fieldcounts-03b67080ada111ebade9fdc1d34336d3/nb-96618-big INFO  [NonPeriodicTasks:1] 2021-09-15 15:13:20,525 SSTable.java:111 - Deleting sstable: /data/7/cassandra/data/doc/fieldcounts-03b67080ada111ebade9fdc1d34336d3/nb-96575-big INFO  [NonPeriodicTasks:1] 2021-09-15 15:13:20,526 SSTable.java:111 - Deleting sstable: /data/7/cassandra/data/doc/fieldcounts-03b67080ada111ebade9fdc1d34336d3/nb-96607-big INFO  [NonPeriodicTasks:1] 2021-09-15 15:13:20,532 SSTable.java:111 - Deleting sstable: /data/7/cassandra/data/doc/fieldcounts-03b67080ada111ebade9fdc1d34336d3/nb-96554-big DEBUG [epollEventLoopGroup-5-85] 2021-09-15 15:13:20,642 InitialConnectionHandler.java:121 - Response to STARTUP sent, configuring pipeline for 5/v5 DEBUG [epollEventLoopGroup-5-85] 2021-09-15 15:13:20,643 InitialConnectionHandler.java:153 - Configured pipeline: DefaultChannelPipeline{(frameDecoder = org.apache.cassandra.net.FrameDecoderCrc), (frameEncoder = org.apache.cassandra.net.FrameEncoderCrc), (cqlProcessor = org.apache.cassandra.transport.CQLMessageHandler), (exceptionHandler = org.apache.cassandra.transport.ExceptionHandlers$PostV5ExceptionHandler)} INFO  [ScheduledTasks:1] 2021-09-15 15:13:21,976 MessagingMetrics.java:206 - COUNTER_MUTATION_RSP messages were dropped in last 5000 ms: 0 internal and 1 cross node. Mean internal dropped latency: 0 ms and Mean cross-node dropped latency: 44285 ms

So - yes, nodes are dropping mutations.  I did find a node where one of the drives was pegged.  Fixed that - but it's still happening.  This happened after adding a relatively large node (.44) to the cluster:

nodetool status
Datacenter: datacenter1
=======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address         Load        Tokens  Owns (effective)  Host ID                               Rack UN  172.16.100.251  526.35 GiB  200     35.1% 660f476c-a124-4ca0-b55f-75efe56370da  rack1 UN  172.16.100.252  537.14 GiB  200     34.8% e83aa851-69b4-478f-88f6-60e657ea6539  rack1 UN  172.16.100.249  548.82 GiB  200     34.6% 49e4f571-7d1c-4e1e-aca7-5bbe076596f7  rack1 UN  172.16.100.36   561.85 GiB  200     35.0% d9702f96-256e-45ae-8e12-69a42712be50  rack1 UN  172.16.100.39   547.86 GiB  200     34.2% 93f9cb0f-ea71-4e3d-b62a-f0ea0e888c47  rack1 UN  172.16.100.253  11.52 GiB   4       0.7% a1a16910-9167-4174-b34b-eb859d36347e  rack1 UN  172.16.100.248  560.63 GiB  200     35.0% 4bbbe57c-6219-41e5-bbac-de92a9594d53  rack1 UN  172.16.100.44   432.76 GiB  200     34.7% b2e5366e-8386-40ec-a641-27944a5a7cfa  rack1 UN  172.16.100.37   331.31 GiB  120     20.5% 08a19658-40be-4e55-8709-812b3d4ac750  rack1 UN  172.16.100.250  501.62 GiB  200     35.3% b74b6e65-af63-486a-b07f-9e304ec30a39  rack1

At this point I'm not sure what's going on.  Some repairs have failed over the past few days.

-Joe

On 9/14/2021 7:23 PM, Erick Ramirez wrote:
The obvious conclusion is to say that the nodes can't keep up so it would be interesting to know how often you're issuing the counter updates. Also, how are the commit log disks performing on the nodes? If you have monitoring in place, check the IO stats/metrics. And finally, review the logs on the nodes to see if they are indeed dropping mutations. Cheers!

<http://www.avg.com/email-signature?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=emailclient> Virus-free. www.avg.com <http://www.avg.com/email-signature?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=emailclient>

<#DAB4FAD8-2DD7-40BB-A1B8-4E2AA1F9FDF2>

Reply via email to