Batch Queries Timeout When Private IP of a Node Fails in Multi-DC Cassandra 4.1.4

manish khandelwal Tue, 04 Feb 2025 00:22:11 -0800

Hi All

I have a Cassandra 4.1.4 cluster with two data centers, each having 3
nodes. The configuration is: listen_address = private IP, broadcast_address
= public IP, listen_on_broadcast_address = true, prefer_local = true


*Issue Observed:*

   - We execute a multi-partition batch query with LOCAL_QUORUM
   consistency, and it succeeds.
   - We bring down the private IP of one node in the DC where queries are
   executed.
   - Batch queries start timing out, but simple INSERT and SELECT queries
   work fine.
   - Stopping the affected node (where IP was down) resolves the issue, and
   batch queries succeed again.

*Analysis So Far:*

   -

   From the Cassandra source code, it looks like the coordinator picks two
   other nodes (nodes other than coordinator) for writing batch logs.
   -

   The failed node (private IP down) gets selected for batch logs, but it
   never responds, causing the timeout.
   -

   The node is not marked down in nodetool status but is unreachable in
   nodetool describecluster till node is not brought down. After restarting
   the node, nodetool status also shows the problematic node as Down.

Is this expected behaviour that nodes in the datacenter are not able to
mark the node down as soon as private interface of one of the node goes
down in a multi dc setup where both private and public interfaces are open?

I am seeing the same behaviour in Cassandra 3.11.2

Regards

Manish

Batch Queries Timeout When Private IP of a Node Fails in Multi-DC Cassandra 4.1.4

Reply via email to