[jira] [Updated] (IGNITE-22377) Choose node to fail on a refused handshake

Ivan Bessonov (Jira) Thu, 08 Jan 2026 23:29:19 -0800


     [ 
https://issues.apache.org/jira/browse/IGNITE-22377?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Ivan Bessonov updated IGNITE-22377:
-----------------------------------
    Description: 
Currently, if during a handshake a node gets refused because it's stale from 
the point of view of the node to which it connects, the refused node notifies 
its FailureHandler to force node restart.

If a network partition happens, this might cause problems when it disappears: 
nodes  from different segments will start sniping each other. In the worst 
case, a single segmented node might make the whole cluster (but itself) restart 
if.

It is suggested that the refusing node sends the following information about 
the physical topology as it sees it to the refused node:
 # Number of nodes in the PT
 # Min ID of nodes in the PT

The refused node will only restart if the number of nodes in the PT, as it sees 
it, is less than the number of nodes in the PT of the refusing node; if the 
sizes are equal, then comparing Min IDs of nodes in the PT will allow to make a 
determenistic decision.

This idea needs to be thought through and improved (or rejected).
h3. Update

The idea is rejected. The main justification for it is a complete 
unpredictability of a proposed behavior when cluster consists of two nodes. It 
makes too many "normal" tests fail for various reasons.

The approach is replaced with validating the version of logical topology. This 
version cannot be increased without working CMG, which means that only a 
healthy part of the cluster can do that. So, if a node notices that it is 
rejected by another node with a higher logical topology version, it should stop 
itself. If versions are equal then nothing happens, nodes will have to be 
stopped manually

  was:
Currently, if during a handshake a node gets refused because it's stale from 
the point of view of the node to which it connects, the refused node notifies 
its FailureHandler to force node restart.

If a network partition happens, this might cause problems when it disappears: 
nodes  from different segments will start sniping each other. In the worst 
case, a single segmented node might make the whole cluster (but itself) restart 
if.

It is suggested that the refusing node sends the following information about 
the physical topology as it sees it to the refused node:
 # Number of nodes in the PT
 # Min ID of nodes in the PT

The refused node will only restart if the number of nodes in the PT, as it sees 
it, is less than the number of nodes in the PT of the refusing node; if the 
sizes are equal, then comparing Min IDs of nodes in the PT will allow to make a 
determenistic decision.

This idea needs to be thought through and improved (or rejected).


> Choose node to fail on a refused handshake
> ------------------------------------------
>
>                 Key: IGNITE-22377
>                 URL: https://issues.apache.org/jira/browse/IGNITE-22377
>             Project: Ignite
>          Issue Type: Improvement
>            Reporter: Roman Puchkovskiy
>            Assignee: Ivan Bessonov
>            Priority: Major
>              Labels: ignite-3
>             Fix For: 3.2
>
>          Time Spent: 20m
>  Remaining Estimate: 0h
>
> Currently, if during a handshake a node gets refused because it's stale from 
> the point of view of the node to which it connects, the refused node notifies 
> its FailureHandler to force node restart.
> If a network partition happens, this might cause problems when it disappears: 
> nodes  from different segments will start sniping each other. In the worst 
> case, a single segmented node might make the whole cluster (but itself) 
> restart if.
> It is suggested that the refusing node sends the following information about 
> the physical topology as it sees it to the refused node:
>  # Number of nodes in the PT
>  # Min ID of nodes in the PT
> The refused node will only restart if the number of nodes in the PT, as it 
> sees it, is less than the number of nodes in the PT of the refusing node; if 
> the sizes are equal, then comparing Min IDs of nodes in the PT will allow to 
> make a determenistic decision.
> This idea needs to be thought through and improved (or rejected).
> h3. Update
> The idea is rejected. The main justification for it is a complete 
> unpredictability of a proposed behavior when cluster consists of two nodes. 
> It makes too many "normal" tests fail for various reasons.
> The approach is replaced with validating the version of logical topology. 
> This version cannot be increased without working CMG, which means that only a 
> healthy part of the cluster can do that. So, if a node notices that it is 
> rejected by another node with a higher logical topology version, it should 
> stop itself. If versions are equal then nothing happens, nodes will have to 
> be stopped manually



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (IGNITE-22377) Choose node to fail on a refused handshake

Reply via email to