Thank you!
Clocks were out of sync; chronyd wasn't chrony'ding.
Going so much faster now! Cheers.
-Joe
On 9/15/2021 4:07 PM, Bowen Song wrote:
Well, the log says cross node timeout, latency a bit over 44 seconds.
Here's a few most likely causes:
1. The clocks are not in sync - please check the time on each server,
and ensure NTP client is running on all Cassandra servers
2. Long stop the world GC pauses - please check the GC logs and make
sure this isn't the case
3. Overload - please monitor the CPU usage and disk IO when timeout
happens and make sure they are not the bottleneck
On 15/09/2021 20:34, Joe Obernberger wrote:
Thank you Erick - looking through all the logs on the nodes I found this:
INFO [CompactionExecutor:17551] 2021-09-15 15:13:20,524
CompactionTask.java:245 - Compacted
(fb0cdca0-1658-11ec-9098-dd70c3a3487a) 4 sstables to
[/data/7/cassandra/data/doc/fieldcounts-03b67080ada111ebade9fdc1d34336d3/nb-96619-big,]
to level=0. 9.762MiB to 9.672MiB (~99% of original) in 3,873ms.
Read Throughput = 2.520MiB/s, Write Throughput = 2.497MiB/s, Row
Throughput = ~125,729/s. 255,171 total partitions merged to
251,458. Partition merge counts were {1:247758, 2:3687, 3:13, }
INFO [NonPeriodicTasks:1] 2021-09-15 15:13:20,524 SSTable.java:111 -
Deleting sstable:
/data/7/cassandra/data/doc/fieldcounts-03b67080ada111ebade9fdc1d34336d3/nb-96618-big
INFO [NonPeriodicTasks:1] 2021-09-15 15:13:20,525 SSTable.java:111 -
Deleting sstable:
/data/7/cassandra/data/doc/fieldcounts-03b67080ada111ebade9fdc1d34336d3/nb-96575-big
INFO [NonPeriodicTasks:1] 2021-09-15 15:13:20,526 SSTable.java:111 -
Deleting sstable:
/data/7/cassandra/data/doc/fieldcounts-03b67080ada111ebade9fdc1d34336d3/nb-96607-big
INFO [NonPeriodicTasks:1] 2021-09-15 15:13:20,532 SSTable.java:111 -
Deleting sstable:
/data/7/cassandra/data/doc/fieldcounts-03b67080ada111ebade9fdc1d34336d3/nb-96554-big
DEBUG [epollEventLoopGroup-5-85] 2021-09-15 15:13:20,642
InitialConnectionHandler.java:121 - Response to STARTUP sent,
configuring pipeline for 5/v5
DEBUG [epollEventLoopGroup-5-85] 2021-09-15 15:13:20,643
InitialConnectionHandler.java:153 - Configured pipeline:
DefaultChannelPipeline{(frameDecoder =
org.apache.cassandra.net.FrameDecoderCrc), (frameEncoder =
org.apache.cassandra.net.FrameEncoderCrc), (cqlProcessor =
org.apache.cassandra.transport.CQLMessageHandler), (exceptionHandler
=
org.apache.cassandra.transport.ExceptionHandlers$PostV5ExceptionHandler)}
INFO [ScheduledTasks:1] 2021-09-15 15:13:21,976
MessagingMetrics.java:206 - COUNTER_MUTATION_RSP messages were
dropped in last 5000 ms: 0 internal and 1 cross node. Mean internal
dropped latency: 0 ms and Mean cross-node dropped latency: 44285 ms
So - yes, nodes are dropping mutations. I did find a node where one
of the drives was pegged. Fixed that - but it's still happening.
This happened after adding a relatively large node (.44) to the cluster:
nodetool status
Datacenter: datacenter1
=======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns (effective) Host
ID Rack
UN 172.16.100.251 526.35 GiB 200 35.1%
660f476c-a124-4ca0-b55f-75efe56370da rack1
UN 172.16.100.252 537.14 GiB 200 34.8%
e83aa851-69b4-478f-88f6-60e657ea6539 rack1
UN 172.16.100.249 548.82 GiB 200 34.6%
49e4f571-7d1c-4e1e-aca7-5bbe076596f7 rack1
UN 172.16.100.36 561.85 GiB 200 35.0%
d9702f96-256e-45ae-8e12-69a42712be50 rack1
UN 172.16.100.39 547.86 GiB 200 34.2%
93f9cb0f-ea71-4e3d-b62a-f0ea0e888c47 rack1
UN 172.16.100.253 11.52 GiB 4 0.7%
a1a16910-9167-4174-b34b-eb859d36347e rack1
UN 172.16.100.248 560.63 GiB 200 35.0%
4bbbe57c-6219-41e5-bbac-de92a9594d53 rack1
UN 172.16.100.44 432.76 GiB 200 34.7%
b2e5366e-8386-40ec-a641-27944a5a7cfa rack1
UN 172.16.100.37 331.31 GiB 120 20.5%
08a19658-40be-4e55-8709-812b3d4ac750 rack1
UN 172.16.100.250 501.62 GiB 200 35.3%
b74b6e65-af63-486a-b07f-9e304ec30a39 rack1
At this point I'm not sure what's going on. Some repairs have failed
over the past few days.
-Joe
On 9/14/2021 7:23 PM, Erick Ramirez wrote:
The obvious conclusion is to say that the nodes can't keep up so it
would be interesting to know how often you're issuing the counter
updates. Also, how are the commit log disks performing on the nodes?
If you have monitoring in place, check the IO stats/metrics. And
finally, review the logs on the nodes to see if they are indeed
dropping mutations. Cheers!
<http://www.avg.com/email-signature?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=emailclient>
Virus-free. www.avg.com
<http://www.avg.com/email-signature?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=emailclient>
<#DAB4FAD8-2DD7-40BB-A1B8-4E2AA1F9FDF2>