Thank you Erick - looking through all the logs on the nodes I found this:
INFO [CompactionExecutor:17551] 2021-09-15 15:13:20,524
CompactionTask.java:245 - Compacted
(fb0cdca0-1658-11ec-9098-dd70c3a3487a) 4 sstables to
[/data/7/cassandra/data/doc/fieldcounts-03b67080ada111ebade9fdc1d34336d3/nb-96619-big,]
to level=0. 9.762MiB to 9.672MiB (~99% of original) in 3,873ms. Read
Throughput = 2.520MiB/s, Write Throughput = 2.497MiB/s, Row Throughput =
~125,729/s. 255,171 total partitions merged to 251,458. Partition
merge counts were {1:247758, 2:3687, 3:13, }
INFO [NonPeriodicTasks:1] 2021-09-15 15:13:20,524 SSTable.java:111 -
Deleting sstable:
/data/7/cassandra/data/doc/fieldcounts-03b67080ada111ebade9fdc1d34336d3/nb-96618-big
INFO [NonPeriodicTasks:1] 2021-09-15 15:13:20,525 SSTable.java:111 -
Deleting sstable:
/data/7/cassandra/data/doc/fieldcounts-03b67080ada111ebade9fdc1d34336d3/nb-96575-big
INFO [NonPeriodicTasks:1] 2021-09-15 15:13:20,526 SSTable.java:111 -
Deleting sstable:
/data/7/cassandra/data/doc/fieldcounts-03b67080ada111ebade9fdc1d34336d3/nb-96607-big
INFO [NonPeriodicTasks:1] 2021-09-15 15:13:20,532 SSTable.java:111 -
Deleting sstable:
/data/7/cassandra/data/doc/fieldcounts-03b67080ada111ebade9fdc1d34336d3/nb-96554-big
DEBUG [epollEventLoopGroup-5-85] 2021-09-15 15:13:20,642
InitialConnectionHandler.java:121 - Response to STARTUP sent,
configuring pipeline for 5/v5
DEBUG [epollEventLoopGroup-5-85] 2021-09-15 15:13:20,643
InitialConnectionHandler.java:153 - Configured pipeline:
DefaultChannelPipeline{(frameDecoder =
org.apache.cassandra.net.FrameDecoderCrc), (frameEncoder =
org.apache.cassandra.net.FrameEncoderCrc), (cqlProcessor =
org.apache.cassandra.transport.CQLMessageHandler), (exceptionHandler =
org.apache.cassandra.transport.ExceptionHandlers$PostV5ExceptionHandler)}
INFO [ScheduledTasks:1] 2021-09-15 15:13:21,976
MessagingMetrics.java:206 - COUNTER_MUTATION_RSP messages were dropped
in last 5000 ms: 0 internal and 1 cross node. Mean internal dropped
latency: 0 ms and Mean cross-node dropped latency: 44285 ms
So - yes, nodes are dropping mutations. I did find a node where one of
the drives was pegged. Fixed that - but it's still happening. This
happened after adding a relatively large node (.44) to the cluster:
nodetool status
Datacenter: datacenter1
=======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns (effective) Host
ID Rack
UN 172.16.100.251 526.35 GiB 200 35.1%
660f476c-a124-4ca0-b55f-75efe56370da rack1
UN 172.16.100.252 537.14 GiB 200 34.8%
e83aa851-69b4-478f-88f6-60e657ea6539 rack1
UN 172.16.100.249 548.82 GiB 200 34.6%
49e4f571-7d1c-4e1e-aca7-5bbe076596f7 rack1
UN 172.16.100.36 561.85 GiB 200 35.0%
d9702f96-256e-45ae-8e12-69a42712be50 rack1
UN 172.16.100.39 547.86 GiB 200 34.2%
93f9cb0f-ea71-4e3d-b62a-f0ea0e888c47 rack1
UN 172.16.100.253 11.52 GiB 4 0.7%
a1a16910-9167-4174-b34b-eb859d36347e rack1
UN 172.16.100.248 560.63 GiB 200 35.0%
4bbbe57c-6219-41e5-bbac-de92a9594d53 rack1
UN 172.16.100.44 432.76 GiB 200 34.7%
b2e5366e-8386-40ec-a641-27944a5a7cfa rack1
UN 172.16.100.37 331.31 GiB 120 20.5%
08a19658-40be-4e55-8709-812b3d4ac750 rack1
UN 172.16.100.250 501.62 GiB 200 35.3%
b74b6e65-af63-486a-b07f-9e304ec30a39 rack1
At this point I'm not sure what's going on. Some repairs have failed
over the past few days.
-Joe
On 9/14/2021 7:23 PM, Erick Ramirez wrote:
The obvious conclusion is to say that the nodes can't keep up so it
would be interesting to know how often you're issuing the counter
updates. Also, how are the commit log disks performing on the nodes?
If you have monitoring in place, check the IO stats/metrics. And
finally, review the logs on the nodes to see if they are indeed
dropping mutations. Cheers!
<http://www.avg.com/email-signature?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=emailclient>
Virus-free. www.avg.com
<http://www.avg.com/email-signature?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=emailclient>
<#DAB4FAD8-2DD7-40BB-A1B8-4E2AA1F9FDF2>