Thanks to both Erik/Shaun for your responses, Both your explanations are plausible in my scenario, this is what I have done subsequently which seems to have improved the situation,
1. The cluster was very busy trying to run repairs/sync the new replicas (about 350GB) in the new DC (Gossip was temporarily marking down the source nodes at different points in time) * Disabled Reaper, stopped all validation/repairs 1. I removed the new replica’s to stop any potential read_repair across the WAN * I will recreate the replica’s over the weekend during quiet time & run the repair to sync 1. The network ping response time was quite high around 10-15msec at error times * This dropped to under 1ms later in the day when some jobs were rerun successfully 1. I will apply some of the recommended TCP_KEEPALIVE settings Shaun pointed me to Last question: In all your experiences, how high can the latency (simple ping response times go) before it becomes a problem? (Obviously the lower the better but is there some sort of cut off/formula where problems can be expected intermittently like the connection resets) Kind regards Arnulf Hanauer From: Erick Ramirez <erick.rami...@datastax.com> Sent: Thursday, 13 February 2020 03:10 To: user@cassandra.apache.org Subject: Re: Connection reset by peer I generally see these exceptions when the cluster is overloaded. I think what's happening is that when the app/driver sends a read request, the coordinator takes a long time to respond because the nodes are busy serving other requests. The driver gives up (client-side timeout reached) and the socket is closed. Meanwhile, the coordinator eventually gets results from replicas and tries to send the response back to the app/driver but can't because the connection is no longer there. Does this scenario sound plausible for your cluster? Erick Ramirez | Developer Relations erick.rami...@datastax.com<mailto:erick.rami...@datastax.com> | datastax.com<http://www.datastax.com> [https://lh4.googleusercontent.com/GivrE4j_1bWvQnZP67Zpa5eomhEeKON-X6kFljLxDatL7QPL_aineBJzM_rXzrqNQkENnZt7KyXLROlLTHuMM3OFNlZ8IrW-adjXKRiD7ojG6OjjFoLio3HbKwVwXt7_Qna02H8I]<https://www.linkedin.com/company/datastax>[https://lh6.googleusercontent.com/0juOULc-Qhs6qzVY5mN0tzIMZ4w17Jv2fiV5xboewGBH0SFiEwV0uPTO_W5OwGr0jCOXmoJLBq1aNLsr1oChLMgJNvNt1e4bHxO2KJUK-iagQ4jw9SiuTMmpktVSfygdLS_vQe6v]<https://www.facebook.com/datastax>[https://lh5.googleusercontent.com/IdGeRVBWRf50wPOny50XQ3O0rtkebOh8e2D9DCanVuy-f3a-wpI3PpQJnGtVFL5aHPOwm4hsginvqhQfTXnP_XT_8fuQWS6Mt0KKRFkRANDhS22T3UiXpGfBkMHJxy48ZQJFaXsZ]<https://twitter.com/datastax>[https://lh4.googleusercontent.com/PbPMGIQsTltjGio5a_e7dp35l6ysZMG_ib69EUHmIvbXHXzRkrNKNMfMR8uwSS1AAoQaG6xn96PH-L1wLQE8FBLSjN_g10Q8y0n1Tp5SYtKO3L1JDN_T73HgSSQJayqn7YMTFXn-]<http://feeds.feedburner.com/datastax>[https://lh5.googleusercontent.com/Rnk5QTWTovfX-z1uRr0FQjt17VnMURyI8rDCi4rTJUY5lnX-QevuQWTFa39GS9fJCMqP0SXSkCLtKf064p0-59f80PmA2hZRqGRFFOlZlbJzXv2EevvdbeKYFq4s9g5zzP54KKQB]<https://github.com/datastax/> [https://datastax.com/sites/default/files/content/graphics/email/datastax-email-signature-2019.jpg]<https://www.datastax.com/accelerate> On Wed, 12 Feb 2020 at 21:13, Hanauer, Arnulf, Vodacom South Africa (External) <arnulf.hana...@vcontractor.co.za<mailto:arnulf.hana...@vcontractor.co.za>> wrote: Hi Cassandra folks, We are getting a lot of these errors and transactions are timing out and I was wondering if this can be caused by Cassandra itself or if this is a genuine Linux network issue only. The client job reports Cassandra node down after this occurs but I suspect this is due to the connection failure – need some clarification as where to go look for a solution. INFO [epollEventLoopGroup-2-10] 2020-02-12 11:53:42,748 Message.java:623 - Unexpected exception during request; channel = [id: 0x8a3e6831, L:/10.132.65.152:9042<http://10.132.65.152:9042> - R:/10.132.11.15:48020<http://10.132.11.15:48020>] io.netty.channel.unix.Errors$NativeIoException: syscall:read(...)() failed: Connection reset by peer at io.netty.channel.unix.FileDescriptor.readAddress(...)(Unknown Source) ~[netty-all-4.0.44.Final.jar:4.0.44.Final] INFO [epollEventLoopGroup-2-15] 2020-02-12 11:42:46,871 Message.java:623 - Unexpected exception during request; channel = [id: 0xa071f1c8, L:/10.132.65.152:9042<http://10.132.65.152:9042> - R:/10.132.11.15:45134<http://10.132.11.15:45134>] io.netty.channel.unix.Errors$NativeIoException: syscall:read(...)() failed: Connection reset by peer at io.netty.channel.unix.FileDescriptor.readAddress(...)(Unknown Source) ~[netty-all-4.0.44.Final.jar:4.0.44.Final] Source and Destination IP addresses are in the same DC (LAN). I did recycle all the Cassandra services on all the nodes in both clusters but the problem remains. The only change made recently was the adding of replicas in the second DC for the keyspace that is being written to when these messages occur (not had a chance to run a full repair yet to sync the replicas) FYI: Cassandra 3.11.2 5 Node cluster each in 2 DC’s Kind regards Arnulf Hanauer "This e-mail is sent on the Terms and Conditions that can be accessed by Clicking on this link https://webmail.vodacom.co.za/tc/default.html<https://urldefense.proofpoint.com/v2/url?u=https-3A__www.vodacom.co.za_vodacom_terms_email-2Dacceptable-2Duser-2Dpolicy&d=DwMFAg&c=adz96Xi0w1RHqtPMowiL2g&r=DPfYm4e7OLSdVEGyWr82F_m1fTjoAHtX5mdHEINlrQw&m=Cz0CXUbGNM5oF7LQwJE1Z3tCQtOsH_Oerb8gVDKOshU&s=LutuQpxi284UPHm0bQsqVMlLobQnBwQQ694tK8g1Reo&e=> " "This e-mail is sent on the Terms and Conditions that can be accessed by Clicking on this linkhttps://www.vodacom.co.za/vodacom/terms/email-acceptable-user-policy"