Hi,

We had a cluster problem while two of the four server nodes left the
cluster and started again in a cluster running Ignite 2.10
The problem seems to be that Partition Exchange didn't finish

In particular, the problem seems to be in one of the remaining nodes that
has a Blocked system-critical error on disco-event-worker that is locked
for more than 19s.
I checked GC logs and there isn't any relevant pause when the block occurs

Some relevant logs:

remaining node:
>2022-03-22 13:33:42.439 WARN  [services-deployment-worker-#95]
o.a.i.i.p.s.ServiceDeploymentManager - Failed to wait service deployment
process or timeout had been reached, timeout=10000,
taskDepId=ServiceDeploymentProcessId [topVer=AffinityTopologyVersion
[topVer=9, minorTopVer=0], reqId=null]
>2022-03-22 13:33:42.687 WARN  [exchange-worker-#54]
o.a.i.i.p.c.d.d.p.GridDhtPartitionsExchangeFuture - Unable to await
partitions release latch within timeout. Some nodes have not sent
acknowledgement for latch completion. It's possible due to unfinishined
atomic updates, transactions or not released explicit locks on that nodes.
Please check logs for errors on nodes with ids reported in latch
`pendingAcks` collection [latch=ServerLatch [permits=1, pendingAcks=HashSet
[6d63e1bd-9de4-4341-81bb-c3a293ebe2eb], super=CompletableLatch
[id=CompletableLatchUid [id=exchange, topVer=AffinityTopologyVersion
[topVer=9, minorTopVer=0]]]]]
>2022-03-22 13:33:51.681 ERROR [tcp-disco-msg-worker-[5bc334df 0:0:0:0:0:0:0
:1%lo:47500]-#2-#45] o.a.i.i.u.typedef.G - Blocked system-critical thread
has been detected. This can lead to cluster-wide undefined behaviour
[workerName=disco-event-worker, threadName=disco-event-worker-#50,
blockedFor=19s]
>2022-03-22 13:33:51.701 WARN  [tcp-disco-msg-worker-[5bc334df 0:0:0:0:0:0:0
:1%lo:47500]-#2-#45] o.a.ignite.Ignite - Possible failure suppressed
accordingly to a configured handler [hnd=AbstractFailureHandler
[ignoredFailureTypes=UnmodifiableSet [SYSTEM_WORKER_BLOCKED,
SYSTEM_CRITICAL_OPERATION_TIMEOUT]], failureCtx=FailureContext
[type=SYSTEM_WORKER_BLOCKED, err=class o.a.i.IgniteException: GridWorker
[name=disco-event-worker, igniteInstanceName=null, finished=false,
heartbeatTs=1647956012230]]]
org.apache.ignite.IgniteException: GridWorker [name=disco-event-worker,
igniteInstanceName=null, finished=false, heartbeatTs=1647956012230]
  at java.base@11.0.9/jdk.internal.misc.Unsafe.park(Native Method)
  at java.base@11.0.9
/java.util.concurrent.locks.LockSupport.park(LockSupport.java:323)
  at
org.apache.ignite.internal.util.future.GridFutureAdapter.get0(GridFutureAdapter.java:178)
  at
org.apache.ignite.internal.util.future.GridFutureAdapter.get(GridFutureAdapter.java:141)
  at
org.apache.ignite.internal.processors.cache.distributed.dht.atomic.GridDhtAtomicCache.remove0(GridDhtAtomicCache.java:746)
>2022-03-22 13:33:51.701 WARN  [tcp-disco-msg-worker-[5bc334df 0:0:0:0:0:0:0
:1%lo:47500]-#2-#45] o.a.i.i.p.f.FailureProcessor - No deadlocked threads
detected.
...
A thread dump where the blocked thread thread dump is:
Thread [name="disco-event-worker-#50", id=306, state=WAITING, blockCnt=1,
waitCnt=543567]
        at java.base@11.0.9/jdk.internal.misc.Unsafe.park(Native Method)
        at java.base@11.0.9
/java.util.concurrent.locks.LockSupport.park(LockSupport.java:323)
        at
o.a.i.i.util.future.GridFutureAdapter.get0(GridFutureAdapter.java:178)
        at
o.a.i.i.util.future.GridFutureAdapter.get(GridFutureAdapter.java:141)
        at
o.a.i.i.processors.cache.distributed.dht.atomic.GridDhtAtomicCache.remove0(GridDhtAtomicCache.java:746)
        at
o.a.i.i.processors.cache.GridCacheAdapter.remove(GridCacheAdapter.java:3397)
        at
o.a.i.i.processors.cache.GridCacheAdapter.remove(GridCacheAdapter.java:3380)
        at
o.a.i.i.processors.cache.IgniteCacheProxyImpl.remove(IgniteCacheProxyImpl.java:1437)
        at
o.a.i.i.processors.cache.GatewayProtectedCacheProxy.remove(GatewayProtectedCacheProxy.java:952)
        at
es.decide.dwm.service.IgniteManager.removeProcessingRequest(IgniteManager.java:186)
        at
es.decide.dwm.service.IgniteManager.lambda$registerIgniteEvents$1(IgniteManager.java:178)
        at
es.decide.dwm.service.IgniteManager$$Lambda$8656/0x000000080019d040.accept(Unknown
Source)
        at java.base@11.0.9/java.util.ArrayList.forEach(ArrayList.java:1541)
        at
es.decide.dwm.service.IgniteManager.lambda$registerIgniteEvents$3c60aaa7$1(IgniteManager.java:176)
        at
es.decide.dwm.service.IgniteManager$$Lambda$3251/0x0000000802113c40.apply(Unknown
Source)
        at
o.a.i.i.GridEventConsumeHandler$2.onEvent(GridEventConsumeHandler.java:192)
        at
o.a.i.i.managers.eventstorage.GridEventStorageManager$LocalListenerWrapper.onEvent(GridEventStorageManager.java:1399)
        at
o.a.i.i.managers.eventstorage.GridEventStorageManager.notifyListeners(GridEventStorageManager.java:894)
        at
o.a.i.i.managers.eventstorage.GridEventStorageManager.notifyListeners(GridEventStorageManager.java:879)
        at
o.a.i.i.managers.eventstorage.GridEventStorageManager.record0(GridEventStorageManager.java:350)
        at
o.a.i.i.managers.eventstorage.GridEventStorageManager.record(GridEventStorageManager.java:313)
        at
o.a.i.i.managers.discovery.GridDiscoveryManager$DiscoveryWorker.recordEvent(GridDiscoveryManager.java:2894)
        at
o.a.i.i.managers.discovery.GridDiscoveryManager$DiscoveryWorker.body0(GridDiscoveryManager.java:3110)
        at
o.a.i.i.managers.discovery.GridDiscoveryManager$DiscoveryWorker.body(GridDiscoveryManager.java:2914)
        at o.a.i.i.util.worker.GridWorker.run(GridWorker.java:120)
        at java.base@11.0.9/java.lang.Thread.run(Thread.java:834)

..
>2022-03-22 13:33:52.687 WARN  [exchange-worker-#54]
o.a.i.i.p.c.d.d.p.GridDhtPartitionsExchangeFuture - Unable to await
partitions release latch within timeout. Some nodes have not sent
acknowledgement for latch completion. It's possible due to unfinishined
atomic updates, transactions or not released explicit locks on that nodes.
Please check logs for errors on nodes with ids reported in latch
`pendingAcks` collection [latch=ServerLatch [permits=1, pendingAcks=HashSet
[6d63e1bd-9de4-4341-81bb-c3a293ebe2eb], super=CompletableLatch
[id=CompletableLatchUid [id=exchange, topVer=AffinityTopologyVersion
[topVer=9, minorTopVer=0]]]]]


The problem seems to be a listener that we register with
ignite.events(forLocal).remoteListen() that deletes/puts some cache entries
during NODE_LEFT events. What I don't understand it's why it's running in a
critical ignite thread.
Do you think that this listener is the possible cause of the cluster
problem? Besides removing the listener is there any way to run it in a
safer way?
And if the listener isn't the problem any help or clue on where the problem
can be?

A lot of thanks in advance,

-- 
Joan Jesús Pujol Espinar
http://www.joanpujol.cat

Reply via email to