Hello all, We observed a weird state in our cluster last week. We are on 2.8.1 cluster
We have about 20 clients connected to a 3 node server cluster . Discovery spi issues ================================================================= 1. Server and client nodes were restarted one after the other to apply some OS patches. 2. After that , the baseline showed that servers were all connected fine and there was no cluster segmentation . 3. The clients showed that they were connected fine to the cluster. 4. A couple of days after this, one of the clients was restarted . For some strange reason, there were two client node ids( node id before restart and node id after restart ) connected to the server - we could confirm the same by looking at dbeaver -> view -> nodes. And the old node id was not removed at all from the cluster's client node list . 5. Typically when a client is restarted the following events are observed and handled by the discovery spi / discovery manager : a. The client node joins the cluster with a new ignite node id - NODE_JOINED event is observed ( Restarted client node joined with new NODE ID . NODE_JOINED observed.) b. The old node id of the client is removed from the cluster either by a NODE_LEFT event or a NODE_FAILED event. ( No NODE_LEFT or NODE_FAILED observed for the old client node id ) c. Client failure detection timeout should have initiated a NODE_FAILED - but even this was not observed . This log not observed 'Failing client node due to not receiving metrics' Our cluster has clientfailuredetectionTimeout set to 30 sec and failuredetectiontimeout set to 10 sec. Impact of this on communication spi ============================= These clients continuously listen to the server for a cache update . Since old node id of the client was not removed from the server ( communicationspi ), hours later, when a cache update was done, the server kept sending the cache update to the old node id and since the old node id did not exist - it failed to send the message . The communication spi gave up sending the message with a ExponentialBackupFailureStrategy message . 2022-06-15T09:50:54,184 WARN o.a.i.s.c.t.TcpCommunicationSpi [callback-#17507]: Handshake timed out (will stop attempts to perform the handshake) [node=81a87a82-411e-44dd-860d-859681b2dd25, connTimeoutStrategy=ExponentialBackoffTimeoutStrategy [maxTimeout=10000, totalTimeout=15000, startNanos=340561005246301, currTimeout=10000], err=Operation timed out [timeoutStrategy= ExponentialBackoffTimeoutStrategy [maxTimeout=10000, totalTimeout=15000, startNanos=340561005246301, currTimeout=10000]], addr=machinename/x.y.z.a:47101, failureDetectionTimeoutEnabled=false, timeout=0] 2022-06-15T09:50:54,188 ERROR o.a.i.s.c.t.TcpCommunicationSpi [callback-#17507]: Failed to send message to remote node [node=TcpDiscoveryNode [id=81a87a82-411e-44dd-860d-859681b2dd25, consistentId=applicationname, addrs=ArrayList [x.y.z.a], sockAddrs=HashSet [machinename/x.y.z.a:0], discPort=0, order=95, intOrder=66, lastExchangeTime=1655106665418, loc=false, ver=2.8.1#20200521-sha1:86422096, isClient=true], msg=GridIoMessage [plc=2, topic=T4 [topic=TOPIC_CACHE, id1=94370aa1-e970-37ae-9471-fd583d923522, id2=81a87a82-411e-44dd-860d-859681b2dd25, id3=1], topicOrd=-1, ordered=true, timeout=0, skipOnTimeout=true, msg=GridContinuousMessage [type=MSG_EVT_NOTIFICATION, routineId=18bdbd2e-e45a-4bb1-801c-e5598d2945bb, data=null, futId=null]]] org.apache.ignite.IgniteCheckedException: Failed to connect to node (is node still alive?). Make sure that each ComputeTask and cache Transaction has a timeout set in order to prevent parties from waiting forever in case of network issues [nodeId=81a87a82-411e-44dd-860d-859681b2dd25, addrs=[machinename/x.y.z.a:47101]] at org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.createNioSession(TcpCommunicationSpi.java:3698) ~[ignite-core-2.8.1.jar:2.8.1] at org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.createTcpClient(TcpCommunicationSpi.java:3458) ~[ignite-core-2.8.1.jar:2.8.1] at org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.createCommunicationClient(TcpCommunicationSpi.java:3198) ~[ignite-core-2.8.1.jar:2.8.1] Caused by: org.apache.ignite.spi.IgniteSpiOperationTimeoutException: Operation timed out [timeoutStrategy= ExponentialBackoffTimeoutStrategy [maxTimeout=10000, totalTimeout=15000, startNanos=340640194097875, currTimeout=10000]] at org.apache.ignite.spi.ExponentialBackoffTimeoutStrategy.nextTimeout(ExponentialBackoffTimeoutStrategy.java:103) ~[ignite-core-2.8.1.jar:2.8.1] at org.apache.ignite.spi.TimeoutStrategy.nextTimeout(TimeoutStrategy.java:39) ~[ignite-core-2.8.1.jar:2.8.1] at org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.createNioSession(TcpCommunicationSpi.java:3582) ~[ignite-core-2.8.1.jar:2.8.1] We do have these settings : <property name="communicationSpi"> <bean class="org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi"> ... <property name="connectTimeout" value="5000"/> <property name="maxConnectTimeout" value="10000"/> ... </bean> </property> ====================== As a remedial action , we moved the client to a backup cluster and restarted this main cluster - and now restarting the clients behave correctly. Not sure what could have caused the above behaviour. Any guidance on how to debug this will be helpful, regards, Veena.