NODE_LEFT/NODE_FAILED message not generated after a client restart- old node id not removed from cluster - resulting in 2 node ids registered with the cluster.

Veena Mithare Sun, 19 Jun 2022 01:09:10 -0700

Hello all,

We observed a weird state in our cluster last week. We are on 2.8.1 cluster


We have about 20 clients connected to a 3 node server cluster . Discovery
spi issues
=================================================================
1. Server and client nodes were restarted one after the other to apply some
OS patches.
2. After that , the baseline showed that servers were all connected fine
and there was no cluster segmentation .
3. The clients showed that they were connected fine to the cluster.
4. A couple of days after this, one of the clients was restarted . For some
strange reason, there were two client node ids(  node id before restart and
node id after restart )  connected to the server - we could confirm the
same by looking at dbeaver -> view -> nodes. And the old node id was not
removed at all from the cluster's client node list .

5. Typically when a client is restarted the following events are observed
and handled by the discovery spi / discovery manager :
a. The client node joins the cluster with a new ignite node id -
NODE_JOINED event is observed ( Restarted client node joined with new NODE
ID . NODE_JOINED observed.)
b. The old node id of the client is removed from the cluster either by a
NODE_LEFT event or a NODE_FAILED event. ( No NODE_LEFT or NODE_FAILED
observed for the old client node id )
c. Client failure detection timeout should have initiated a NODE_FAILED -
but even this was not observed . This log not observed 'Failing client node
due to not receiving metrics'
Our cluster has  clientfailuredetectionTimeout set to 30 sec and
failuredetectiontimeout set to 10 sec.

Impact of this on communication spi
=============================
These clients continuously listen to the server for a cache update . Since
old node id of the client was not removed from the server (
communicationspi ), hours later, when a cache update was done,  the server
kept sending the cache update to the old node id and since the old node id
did not exist - it failed to send the message . The communication spi gave
up sending the message with a ExponentialBackupFailureStrategy message .

2022-06-15T09:50:54,184 WARN  o.a.i.s.c.t.TcpCommunicationSpi
[callback-#17507]: Handshake timed out (will stop attempts to perform the
handshake) [node=81a87a82-411e-44dd-860d-859681b2dd25,
connTimeoutStrategy=ExponentialBackoffTimeoutStrategy [maxTimeout=10000,
totalTimeout=15000, startNanos=340561005246301, currTimeout=10000],
err=Operation timed out [timeoutStrategy= ExponentialBackoffTimeoutStrategy
[maxTimeout=10000, totalTimeout=15000, startNanos=340561005246301,
currTimeout=10000]], addr=machinename/x.y.z.a:47101,
failureDetectionTimeoutEnabled=false, timeout=0]

2022-06-15T09:50:54,188 ERROR o.a.i.s.c.t.TcpCommunicationSpi
[callback-#17507]: Failed to send message to remote node
[node=TcpDiscoveryNode [id=81a87a82-411e-44dd-860d-859681b2dd25,
consistentId=applicationname, addrs=ArrayList [x.y.z.a], sockAddrs=HashSet
[machinename/x.y.z.a:0], discPort=0, order=95, intOrder=66,
lastExchangeTime=1655106665418, loc=false,
ver=2.8.1#20200521-sha1:86422096, isClient=true], msg=GridIoMessage [plc=2,
topic=T4 [topic=TOPIC_CACHE, id1=94370aa1-e970-37ae-9471-fd583d923522,
id2=81a87a82-411e-44dd-860d-859681b2dd25, id3=1], topicOrd=-1,
ordered=true, timeout=0, skipOnTimeout=true, msg=GridContinuousMessage
[type=MSG_EVT_NOTIFICATION, routineId=18bdbd2e-e45a-4bb1-801c-e5598d2945bb,
data=null, futId=null]]]

org.apache.ignite.IgniteCheckedException: Failed to connect to node (is
node still alive?). Make sure that each ComputeTask and cache Transaction
has a timeout set in order to prevent parties from waiting forever in case
of network issues [nodeId=81a87a82-411e-44dd-860d-859681b2dd25,
addrs=[machinename/x.y.z.a:47101]]

        at
org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.createNioSession(TcpCommunicationSpi.java:3698)
~[ignite-core-2.8.1.jar:2.8.1]

        at
org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.createTcpClient(TcpCommunicationSpi.java:3458)
~[ignite-core-2.8.1.jar:2.8.1]

        at
org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.createCommunicationClient(TcpCommunicationSpi.java:3198)
~[ignite-core-2.8.1.jar:2.8.1]

Caused by: org.apache.ignite.spi.IgniteSpiOperationTimeoutException:
Operation timed out [timeoutStrategy= ExponentialBackoffTimeoutStrategy
[maxTimeout=10000, totalTimeout=15000, startNanos=340640194097875,
currTimeout=10000]]

        at
org.apache.ignite.spi.ExponentialBackoffTimeoutStrategy.nextTimeout(ExponentialBackoffTimeoutStrategy.java:103)
~[ignite-core-2.8.1.jar:2.8.1]

        at
org.apache.ignite.spi.TimeoutStrategy.nextTimeout(TimeoutStrategy.java:39)
~[ignite-core-2.8.1.jar:2.8.1]

        at
org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.createNioSession(TcpCommunicationSpi.java:3582)
~[ignite-core-2.8.1.jar:2.8.1]


We do have these settings :
<property name="communicationSpi">         <bean
class="org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi">
      ...           <property name="connectTimeout" value="5000"/>
          <property name="maxConnectTimeout" value="10000"/>           ...
    </bean> </property>

======================
As a remedial action , we moved the client to a backup cluster and
restarted this main cluster - and now restarting the clients behave
correctly.

Not sure what could have caused the above behaviour. Any guidance on how to
debug this will be helpful,


regards,
Veena.

NODE_LEFT/NODE_FAILED message not generated after a client restart- old node id not removed from cluster - resulting in 2 node ids registered with the cluster.

Reply via email to