Hi All I have a Cassandra 4.1.4 cluster with two data centers, each having 3 nodes. The configuration is: listen_address = private IP, broadcast_address = public IP, listen_on_broadcast_address = true, prefer_local = true
*Issue Observed:* - We execute a multi-partition batch query with LOCAL_QUORUM consistency, and it succeeds. - We bring down the private IP of one node in the DC where queries are executed. - Batch queries start timing out, but simple INSERT and SELECT queries work fine. - Stopping the affected node (where IP was down) resolves the issue, and batch queries succeed again. *Analysis So Far:* - From the Cassandra source code, it looks like the coordinator picks two other nodes (nodes other than coordinator) for writing batch logs. - The failed node (private IP down) gets selected for batch logs, but it never responds, causing the timeout. - The node is not marked down in nodetool status but is unreachable in nodetool describecluster till node is not brought down. After restarting the node, nodetool status also shows the problematic node as Down. Is this expected behaviour that nodes in the datacenter are not able to mark the node down as soon as private interface of one of the node goes down in a multi dc setup where both private and public interfaces are open? I am seeing the same behaviour in Cassandra 3.11.2 Regards Manish