Hi everyone.

I think we should check behavior of failure detection with tests or find them if already written. I’ll research this question and rise a ticket if a reproducer appears.



08.04.2020 12:19, Stephen Darlington пишет:
Yes. Nodes are always chatting to each another even if there are no requests 
coming In.

Here’s the status message: 
https://github.com/apache/ignite/blob/e9b3c4cebaecbeec9fa51bd6ec32a879fb89948a/modules/core/src/main/java/org/apache/ignite/spi/discovery/tcp/messages/TcpDiscoveryStatusCheckMessage.java

Regards,
Stephen

On 8 Apr 2020, at 10:04, Anton Vinogradov <a...@apache.org> wrote:

It seems you're talking about Failure Detection (Timeouts).
Will it detect node failure on still cluster?

On Wed, Apr 8, 2020 at 11:52 AM Stephen Darlington <
stephen.darling...@gridgain.com> wrote:

The configuration parameters that I’m aware of are here:


https://ignite.apache.org/releases/latest/javadoc/org/apache/ignite/spi/discovery/tcp/TcpDiscoverySpi.html

Other people would be better placed to discuss the internals.

Regards.
Stephen

On 8 Apr 2020, at 09:32, Anton Vinogradov <a...@apache.org> wrote:

Stephen,

Nodes check on their neighbours and notify the remaining nodes if one
disappears.
Could you explain how this works in detail?
How can I set/change check frequency?

On Wed, Apr 8, 2020 at 11:13 AM Stephen Darlington <
stephen.darling...@gridgain.com> wrote:

This is one of the functions of the DiscoverySPI. Nodes check on their
neighbours and notify the remaining nodes if one disappears. When the
topology changes, it triggers a rebalance, which relocates primary
partitions to live nodes. This is entirely transparent to clients.

It gets more complex… like there’s the partition loss policy and
rebalancing doesn’t always happen (configurable, persistence, etc)… but
broadly it does as you expect.

Regards,
Stephen

On 8 Apr 2020, at 08:40, Anton Vinogradov <a...@apache.org> wrote:

Igniters,
Do we have some feature allows to check nodes aliveness on a regular
basis?
Scenario:
Precondition
The cluster has no load but some node's JVM crashed.

Expected actual
The user performs an operation (eg. cache put) related to this node
(via
another node) and waits for some timeout to gain it's dead.
The cluster starts the switch to relocate primary partitions to alive
nodes.
Now user able to retry the operation.

Desired
Some WatchDog checks nodes aliveness on a regular basis.
Once a failure detected, the cluster starts the switch.
Later, the user performs an operation on an already fixed cluster and
waits for nothing.

It would be good news if the "Desired" case is already Actual.
Can somebody point to the feature that performs this check?





Reply via email to