Dear Cassandra Community,

I recently observed an issue in our multi-DC setup where batch queries
timed out when the private interface of one node went down (CASSANDRA-20291
<https://issues.apache.org/jira/browse/CASSANDRA-20291>) . Since the
FailureDetector primarily relies on the public interface, the affected node
still appears up, leading to batch query timeouts.

I would like to seek suggestions on the best approach to filter out such
nodes (i.e., nodes with a down private interface) to prevent query
disruptions. Some possible approaches I am considering are:

   1.

   Identifying the problematic node by explicitly checking connectivity on
   its private IP. Currently, filtering relies on the FailureDetector, which
   only monitors the public interface.
   2.

   Enhancing the FailureDetector mechanism to consider both private and
   public interfaces in a Cassandra cluster using multi-interface network .
   3.

   Any other recommendations or alternative approaches that the community
   has found effective in similar scenarios?

I appreciate any insights or suggestions you may have.

Best regards,
Manish

On Tue, Feb 4, 2025 at 1:51 PM manish khandelwal <
manishkhandelwa...@gmail.com> wrote:

> Hi All
>
> I have a Cassandra 4.1.4 cluster with two data centers, each having 3
> nodes. The configuration is: listen_address = private IP, broadcast_address
> = public IP, listen_on_broadcast_address = true, prefer_local = true
>
> *Issue Observed:*
>
>    - We execute a multi-partition batch query with LOCAL_QUORUM
>    consistency, and it succeeds.
>    - We bring down the private IP of one node in the DC where queries are
>    executed.
>    - Batch queries start timing out, but simple INSERT and SELECT queries
>    work fine.
>    - Stopping the affected node (where IP was down) resolves the issue,
>    and batch queries succeed again.
>
> *Analysis So Far:*
>
>    -
>
>    From the Cassandra source code, it looks like the coordinator picks
>    two other nodes (nodes other than coordinator) for writing batch logs.
>    -
>
>    The failed node (private IP down) gets selected for batch logs, but it
>    never responds, causing the timeout.
>    -
>
>    The node is not marked down in nodetool status but is unreachable in
>    nodetool describecluster till node is not brought down. After restarting
>    the node, nodetool status also shows the problematic node as Down.
>
> Is this expected behaviour that nodes in the datacenter are not able to
> mark the node down as soon as private interface of one of the node goes
> down in a multi dc setup where both private and public interfaces are open?
>
> I am seeing the same behaviour in Cassandra 3.11.2
>
> Regards
>
> Manish
>

Reply via email to