[
https://issues.apache.org/jira/browse/IGNITE-26986?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Nikita Amelchev reassigned IGNITE-26986:
----------------------------------------
Assignee: Nikita Amelchev
> Multi-datacenter awarness for connection recovery mechanism
> -----------------------------------------------------------
>
> Key: IGNITE-26986
> URL: https://issues.apache.org/jira/browse/IGNITE-26986
> Project: Ignite
> Issue Type: Task
> Reporter: Sergey Chugunov
> Assignee: Nikita Amelchev
> Priority: Major
> Labels: IEP-140, ise
> Fix For: 2.18
>
>
> Connection recovery mechanism developed in IGNITE-7163 improves topology
> resilience against brief network instability. However it could cause the
> whole cluster to go down if a cross-DC network partitioning happens in a
> multi-datacenter environment.
> This is because connection recovery forces nodes to segment from topology
> when they cannot restore connection to the next node in a specified timeout.
> And if a node sits at the edge of its datacenter, and several of its next
> nodes are in the remote DC, then all attempts of the edge node to find an
> alive next will fail because of the partitioning. And if connection recovery
> timeout isn't big enough, the edge node will consider itself as segmented and
> stop.
> Then the previous node of a newly failed one becomes an edge node, and the
> process repeats.
> In this case connection recovery mechanism will force the whole cluster to
> shutdown instead of improving stability.
> Thereby it should be aware on multi-datacenter envorinments and tweak its
> behavior accordingly.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)