Re: COUNTER timeout

Joe Obernberger Wed, 15 Sep 2021 15:15:32 -0700

Thank you!
Clocks were out of sync; chronyd wasn't chrony'ding.
Going so much faster now!  Cheers.


-Joe

On 9/15/2021 4:07 PM, Bowen Song wrote:

Well, the log says cross node timeout, latency a bit over 44 seconds.Here's a few most likely causes:
1. The clocks are not in sync - please check the time on each server,and ensure NTP client is running on all Cassandra servers
2. Long stop the world GC pauses - please check the GC logs and makesure this isn't the case
3. Overload - please monitor the CPU usage and disk IO when timeouthappens and make sure they are not the bottleneck
On 15/09/2021 20:34, Joe Obernberger wrote:
Thank you Erick - looking through all the logs on the nodes I found this:
INFO [CompactionExecutor:17551] 2021-09-15 15:13:20,524CompactionTask.java:245 - Compacted(fb0cdca0-1658-11ec-9098-dd70c3a3487a) 4 sstables to[/data/7/cassandra/data/doc/fieldcounts-03b67080ada111ebade9fdc1d34336d3/nb-96619-big,]to level=0. 9.762MiB to 9.672MiB (~99% of original) in 3,873ms. Read Throughput = 2.520MiB/s, Write Throughput = 2.497MiB/s, RowThroughput = ~125,729/s. 255,171 total partitions merged to251,458. Partition merge counts were {1:247758, 2:3687, 3:13, }INFO [NonPeriodicTasks:1] 2021-09-15 15:13:20,524 SSTable.java:111 -Deleting sstable:/data/7/cassandra/data/doc/fieldcounts-03b67080ada111ebade9fdc1d34336d3/nb-96618-bigINFO [NonPeriodicTasks:1] 2021-09-15 15:13:20,525 SSTable.java:111 -Deleting sstable:/data/7/cassandra/data/doc/fieldcounts-03b67080ada111ebade9fdc1d34336d3/nb-96575-bigINFO [NonPeriodicTasks:1] 2021-09-15 15:13:20,526 SSTable.java:111 -Deleting sstable:/data/7/cassandra/data/doc/fieldcounts-03b67080ada111ebade9fdc1d34336d3/nb-96607-bigINFO [NonPeriodicTasks:1] 2021-09-15 15:13:20,532 SSTable.java:111 -Deleting sstable:/data/7/cassandra/data/doc/fieldcounts-03b67080ada111ebade9fdc1d34336d3/nb-96554-bigDEBUG [epollEventLoopGroup-5-85] 2021-09-15 15:13:20,642InitialConnectionHandler.java:121 - Response to STARTUP sent,configuring pipeline for 5/v5DEBUG [epollEventLoopGroup-5-85] 2021-09-15 15:13:20,643InitialConnectionHandler.java:153 - Configured pipeline:DefaultChannelPipeline{(frameDecoder =org.apache.cassandra.net.FrameDecoderCrc), (frameEncoder =org.apache.cassandra.net.FrameEncoderCrc), (cqlProcessor =org.apache.cassandra.transport.CQLMessageHandler), (exceptionHandler=org.apache.cassandra.transport.ExceptionHandlers$PostV5ExceptionHandler)}INFO [ScheduledTasks:1] 2021-09-15 15:13:21,976MessagingMetrics.java:206 - COUNTER_MUTATION_RSP messages weredropped in last 5000 ms: 0 internal and 1 cross node. Mean internaldropped latency: 0 ms and Mean cross-node dropped latency: 44285 ms
So - yes, nodes are dropping mutations. I did find a node where oneof the drives was pegged. Fixed that - but it's still happening. This happened after adding a relatively large node (.44) to the cluster:
nodetool status
Datacenter: datacenter1
=======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns (effective) HostID RackUN 172.16.100.251 526.35 GiB 200 35.1%660f476c-a124-4ca0-b55f-75efe56370da rack1UN 172.16.100.252 537.14 GiB 200 34.8%e83aa851-69b4-478f-88f6-60e657ea6539 rack1UN 172.16.100.249 548.82 GiB 200 34.6%49e4f571-7d1c-4e1e-aca7-5bbe076596f7 rack1UN 172.16.100.36 561.85 GiB 200 35.0%d9702f96-256e-45ae-8e12-69a42712be50 rack1UN 172.16.100.39 547.86 GiB 200 34.2%93f9cb0f-ea71-4e3d-b62a-f0ea0e888c47 rack1UN 172.16.100.253 11.52 GiB 4 0.7%a1a16910-9167-4174-b34b-eb859d36347e rack1UN 172.16.100.248 560.63 GiB 200 35.0%4bbbe57c-6219-41e5-bbac-de92a9594d53 rack1UN 172.16.100.44 432.76 GiB 200 34.7%b2e5366e-8386-40ec-a641-27944a5a7cfa rack1UN 172.16.100.37 331.31 GiB 120 20.5%08a19658-40be-4e55-8709-812b3d4ac750 rack1UN 172.16.100.250 501.62 GiB 200 35.3%b74b6e65-af63-486a-b07f-9e304ec30a39 rack1
At this point I'm not sure what's going on. Some repairs have failedover the past few days.
-Joe

On 9/14/2021 7:23 PM, Erick Ramirez wrote:
The obvious conclusion is to say that the nodes can't keep up so itwould be interesting to know how often you're issuing the counterupdates. Also, how are the commit log disks performing on the nodes?If you have monitoring in place, check the IO stats/metrics. Andfinally, review the logs on the nodes to see if they are indeeddropping mutations. Cheers!
<http://www.avg.com/email-signature?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=emailclient>Virus-free. www.avg.com<http://www.avg.com/email-signature?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=emailclient>
<#DAB4FAD8-2DD7-40BB-A1B8-4E2AA1F9FDF2>

Re: COUNTER timeout

Reply via email to