Hi, We had a cluster problem while two of the four server nodes left the cluster and started again in a cluster running Ignite 2.10 The problem seems to be that Partition Exchange didn't finish
In particular, the problem seems to be in one of the remaining nodes that has a Blocked system-critical error on disco-event-worker that is locked for more than 19s. I checked GC logs and there isn't any relevant pause when the block occurs Some relevant logs: remaining node: >2022-03-22 13:33:42.439 WARN [services-deployment-worker-#95] o.a.i.i.p.s.ServiceDeploymentManager - Failed to wait service deployment process or timeout had been reached, timeout=10000, taskDepId=ServiceDeploymentProcessId [topVer=AffinityTopologyVersion [topVer=9, minorTopVer=0], reqId=null] >2022-03-22 13:33:42.687 WARN [exchange-worker-#54] o.a.i.i.p.c.d.d.p.GridDhtPartitionsExchangeFuture - Unable to await partitions release latch within timeout. Some nodes have not sent acknowledgement for latch completion. It's possible due to unfinishined atomic updates, transactions or not released explicit locks on that nodes. Please check logs for errors on nodes with ids reported in latch `pendingAcks` collection [latch=ServerLatch [permits=1, pendingAcks=HashSet [6d63e1bd-9de4-4341-81bb-c3a293ebe2eb], super=CompletableLatch [id=CompletableLatchUid [id=exchange, topVer=AffinityTopologyVersion [topVer=9, minorTopVer=0]]]]] >2022-03-22 13:33:51.681 ERROR [tcp-disco-msg-worker-[5bc334df 0:0:0:0:0:0:0 :1%lo:47500]-#2-#45] o.a.i.i.u.typedef.G - Blocked system-critical thread has been detected. This can lead to cluster-wide undefined behaviour [workerName=disco-event-worker, threadName=disco-event-worker-#50, blockedFor=19s] >2022-03-22 13:33:51.701 WARN [tcp-disco-msg-worker-[5bc334df 0:0:0:0:0:0:0 :1%lo:47500]-#2-#45] o.a.ignite.Ignite - Possible failure suppressed accordingly to a configured handler [hnd=AbstractFailureHandler [ignoredFailureTypes=UnmodifiableSet [SYSTEM_WORKER_BLOCKED, SYSTEM_CRITICAL_OPERATION_TIMEOUT]], failureCtx=FailureContext [type=SYSTEM_WORKER_BLOCKED, err=class o.a.i.IgniteException: GridWorker [name=disco-event-worker, igniteInstanceName=null, finished=false, heartbeatTs=1647956012230]]] org.apache.ignite.IgniteException: GridWorker [name=disco-event-worker, igniteInstanceName=null, finished=false, heartbeatTs=1647956012230] at java.base@11.0.9/jdk.internal.misc.Unsafe.park(Native Method) at java.base@11.0.9 /java.util.concurrent.locks.LockSupport.park(LockSupport.java:323) at org.apache.ignite.internal.util.future.GridFutureAdapter.get0(GridFutureAdapter.java:178) at org.apache.ignite.internal.util.future.GridFutureAdapter.get(GridFutureAdapter.java:141) at org.apache.ignite.internal.processors.cache.distributed.dht.atomic.GridDhtAtomicCache.remove0(GridDhtAtomicCache.java:746) >2022-03-22 13:33:51.701 WARN [tcp-disco-msg-worker-[5bc334df 0:0:0:0:0:0:0 :1%lo:47500]-#2-#45] o.a.i.i.p.f.FailureProcessor - No deadlocked threads detected. ... A thread dump where the blocked thread thread dump is: Thread [name="disco-event-worker-#50", id=306, state=WAITING, blockCnt=1, waitCnt=543567] at java.base@11.0.9/jdk.internal.misc.Unsafe.park(Native Method) at java.base@11.0.9 /java.util.concurrent.locks.LockSupport.park(LockSupport.java:323) at o.a.i.i.util.future.GridFutureAdapter.get0(GridFutureAdapter.java:178) at o.a.i.i.util.future.GridFutureAdapter.get(GridFutureAdapter.java:141) at o.a.i.i.processors.cache.distributed.dht.atomic.GridDhtAtomicCache.remove0(GridDhtAtomicCache.java:746) at o.a.i.i.processors.cache.GridCacheAdapter.remove(GridCacheAdapter.java:3397) at o.a.i.i.processors.cache.GridCacheAdapter.remove(GridCacheAdapter.java:3380) at o.a.i.i.processors.cache.IgniteCacheProxyImpl.remove(IgniteCacheProxyImpl.java:1437) at o.a.i.i.processors.cache.GatewayProtectedCacheProxy.remove(GatewayProtectedCacheProxy.java:952) at es.decide.dwm.service.IgniteManager.removeProcessingRequest(IgniteManager.java:186) at es.decide.dwm.service.IgniteManager.lambda$registerIgniteEvents$1(IgniteManager.java:178) at es.decide.dwm.service.IgniteManager$$Lambda$8656/0x000000080019d040.accept(Unknown Source) at java.base@11.0.9/java.util.ArrayList.forEach(ArrayList.java:1541) at es.decide.dwm.service.IgniteManager.lambda$registerIgniteEvents$3c60aaa7$1(IgniteManager.java:176) at es.decide.dwm.service.IgniteManager$$Lambda$3251/0x0000000802113c40.apply(Unknown Source) at o.a.i.i.GridEventConsumeHandler$2.onEvent(GridEventConsumeHandler.java:192) at o.a.i.i.managers.eventstorage.GridEventStorageManager$LocalListenerWrapper.onEvent(GridEventStorageManager.java:1399) at o.a.i.i.managers.eventstorage.GridEventStorageManager.notifyListeners(GridEventStorageManager.java:894) at o.a.i.i.managers.eventstorage.GridEventStorageManager.notifyListeners(GridEventStorageManager.java:879) at o.a.i.i.managers.eventstorage.GridEventStorageManager.record0(GridEventStorageManager.java:350) at o.a.i.i.managers.eventstorage.GridEventStorageManager.record(GridEventStorageManager.java:313) at o.a.i.i.managers.discovery.GridDiscoveryManager$DiscoveryWorker.recordEvent(GridDiscoveryManager.java:2894) at o.a.i.i.managers.discovery.GridDiscoveryManager$DiscoveryWorker.body0(GridDiscoveryManager.java:3110) at o.a.i.i.managers.discovery.GridDiscoveryManager$DiscoveryWorker.body(GridDiscoveryManager.java:2914) at o.a.i.i.util.worker.GridWorker.run(GridWorker.java:120) at java.base@11.0.9/java.lang.Thread.run(Thread.java:834) .. >2022-03-22 13:33:52.687 WARN [exchange-worker-#54] o.a.i.i.p.c.d.d.p.GridDhtPartitionsExchangeFuture - Unable to await partitions release latch within timeout. Some nodes have not sent acknowledgement for latch completion. It's possible due to unfinishined atomic updates, transactions or not released explicit locks on that nodes. Please check logs for errors on nodes with ids reported in latch `pendingAcks` collection [latch=ServerLatch [permits=1, pendingAcks=HashSet [6d63e1bd-9de4-4341-81bb-c3a293ebe2eb], super=CompletableLatch [id=CompletableLatchUid [id=exchange, topVer=AffinityTopologyVersion [topVer=9, minorTopVer=0]]]]] The problem seems to be a listener that we register with ignite.events(forLocal).remoteListen() that deletes/puts some cache entries during NODE_LEFT events. What I don't understand it's why it's running in a critical ignite thread. Do you think that this listener is the possible cause of the cluster problem? Besides removing the listener is there any way to run it in a safer way? And if the listener isn't the problem any help or clue on where the problem can be? A lot of thanks in advance, -- Joan Jesús Pujol Espinar http://www.joanpujol.cat