Hello. i check your logs and found no issues, seems need more detailed logs and infra inspections. Very fast decision in your case is just to update to newer version, there are huge bugs have been fixed there. >Hello, > >Background >We are using Ignite version 2.8.1-1 and running a cluster of 6 ignite server >nodes with the same configurations. Each ignite server node has 4 cores and >16GB host memory. The service configuration is attached. We use java clients >to connect to the ignite server nodes to write and read caches. We do not use >any of the SQL functionality. > >The Issue >In the past few months, it has occurred multiple times that one out of six >nodes got into the SYSTEM_WORKER_BLOCKED state. When this happens, the >performance of the other 5 nodes also gets impacted. Checking the metrics >printed in the log just before the issue has happened, it suggests the system >was not using too much resource. The CPU usage was low, there was plenty of >room on the heap, and the affected thread is not in deadlock mode status >either. Without an explicit error or warning log regarding the issue, it is >hard to tell what went wrong. Could you please take a look at the >configuration and log and give us some hints? > > >In addition, we have configured the RestartProcessFailureHandler to handle >the system error like SYSTEM_WORKER_BLOCKED, the node is supposed to restart >itself but it never did, which is a separate issue. Maybe the >RestartProcessFailureHandler is not suitable for handling such failure? > >Logs > >[2022-04-05T04:14:34,104] [INFO ] >grid-timeout-worker-#23%ignite-jetstream-prd1% >[IgniteKernal%ignite-jetstream-prd1] >Metrics for local node (to disable set 'metricsLogFrequency' to 0) >^-- Node [id=d26bf020, name=ignite-jetstream-prd1, uptime=2 days, >08:44:17.623] >^-- H/N/C [hosts=89, nodes=218, CPUs=1352] >^-- CPU [cur=1.83%, avg=14.86%, GC=0%] >^-- PageMemory [pages=1576438] >^-- Heap [used=1015MB, free=57.37%, comm=2382MB] >^-- Off-heap [used=6230MB, free=6.35%, comm=6552MB] >^-- sysMemPlc region [used=0MB, free=99.99%, comm=100MB] >^-- default region [used=6229MB, free=1.93%, comm=6352MB] >^-- metastoreMemPlc region [used=0MB, free=99.62%, comm=0MB] >^-- TxLog region [used=0MB, free=100%, comm=100MB] >^-- Ignite persistence [used=55405MB] >^-- sysMemPlc region [used=0MB] >^-- default region [used=55404MB] >^-- metastoreMemPlc region [used=0MB] >^-- TxLog region [used=0MB] >^-- Outbound messages queue [size=0] >^-- Public thread pool [active=0, idle=2, qSize=0] >^-- System thread pool [active=0, idle=8, qSize=0] >... >[2022-04-05T04:15:13,406] [ERROR] >grid-timeout-worker-#23%ignite-jetstream-prd1% [G] Blocked system-critical >thread has been detected. This can lead to cluster-wide undefined behaviour >workerName=sys-stripe-3, threadName=sys-stripe-3-#4%ignite-jetstream-prd1%, >blockedFor=12s >[2022-04-05T04:15:13,417] [WARN ] >grid-timeout-worker-#23%ignite-jetstream-prd1% [G] Thread >name="sys-stripe-3-#4%ignite-jetstream-prd1%", id=20, state=TIMED_WAITING, >blockCnt=15385, waitCnt=12226082 >Lock >[object=java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync@70b28434, > ownerName=null, ownerId=-1] >[2022-04-05T04:15:13,419] [ERROR] >grid-timeout-worker-#23%ignite-jetstream-prd1% [] Critical system error >detected. Will be handled accordingly to configured handler >[hnd=RestartProcessFailureHandler [super=AbstractFailureHandler >[ignoredFailureTypes=UnmodifiableSet []]], failureCtx=FailureContext >[type=SYSTEM_WORKER_BLOCKED, err=class o.a.i.IgniteException: GridWorker >[name=sys-stripe-3, igniteInstanceName=ignite-jetstream-prd1, finished=false, >heartbeatTs=1649132101219] ]] >org.apache.ignite.IgniteException: GridWorker [name=sys-stripe-3, >igniteInstanceName=ignite-jetstream-prd1, finished=false, >heartbeatTs=1649132101219] >at >org.apache.ignite.internal.IgnitionEx$IgniteNamedInstance$3.apply(IgnitionEx.java:1810) > [ignite-core-2.8.1.jar:2.8.1] >at >org.apache.ignite.internal.IgnitionEx$IgniteNamedInstance$3.apply(IgnitionEx.java:1805) > [ignite-core-2.8.1.jar:2.8.1] >at >org.apache.ignite.internal.worker.WorkersRegistry.onIdle(WorkersRegistry.java:234) > [ignite-core-2.8.1.jar:2.8.1] >at >org.apache.ignite.internal.util.worker.GridWorker.onIdle(GridWorker.java:297) >[ignite-core-2.8.1.jar:2.8.1] >at >org.apache.ignite.internal.processors.timeout.GridTimeoutProcessor$TimeoutWorker.body(GridTimeoutProcessor.java:221) > [ignite-core-2.8.1.jar:2.8.1] >at org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:120) >[ignite-core-2.8.1.jar:2.8.1] >at java.lang.Thread.run(Thread.java:748) [?:1.8.0_312] ... >[2022-04-05T04:15:15,970] [WARN ] >grid-timeout-worker-#23%ignite-jetstream-prd1% [G] >>> Possible starvation in >striped pool. >Thread name: sys-stripe-3-#4%ignite-jetstream-prd1% >Queue: [Message closure [msg=GridIoMessage [plc=2, topic=TOPIC_CACHE, >topicOrd=8, ordered=false, timeout=0, skipOnTimeout=false, >msg=GridCacheTtlUpdateRequest [keys=ArrayList [KeyCacheObjectImpl [part=917, >val=null, hasValBytes=true] ], nearKeys=null, ttl=691200000, >topVer=AffinityTopologyVersion [topVer=18213, minorTopVer=0] , >super=GridCacheIdMessage [cacheId=402651327, super=GridCacheMessage >[msgId=1015456383, depInfo=null, lastAffChangedTopVer=AffinityTopologyVersion >[topVer=17908, minorTopVer=1] , err=null, skipPrepare=false]]]]], Message >closure [msg=GridIoMessage [plc=2, topic=TOPIC_CACHE, topicOrd=8, >ordered=false, timeout=0, skipOnTimeout=false, msg=GridDhtAtomicUpdateRequest >[keys=ArrayList [KeyCacheObjectImpl [part=467, val=null, hasValBytes=true] , >KeyCacheObjectImpl [part=860, val=null, hasValBytes=true] ], vals=ArrayList >[null, null] , prevVals=null, ttls=null, conflictExpireTimes=null, >nearTtls=null, nearExpireTimes=null, nearKeys=null, nearVals=null, >obsoleteIndexes=null, forceTransformBackups=false, updateCntrs=GridLongList >[idx=2, arr= [1536367,1538883] ], super=GridDhtAtomicAbstractUpdateRequest >[onRes=false, nearNodeId=1dc2c320-2d5f-4c1e-b4ad-a6c92a578164, >nearFutId=615246, flags=hasRes] ]]], Message closure [msg=GridIoMessage >[plc=2, topic=TOPIC_CACHE, topicOrd=8, ordered=false, timeout=0, >skipOnTimeout=false, msg=GridDhtAtomicUpdateRequest [keys=ArrayList >[KeyCacheObjectImpl [part=363, val=null, hasValBytes=true] ], vals=ArrayList >[BinaryObjectImpl [arr= true, ctx=false, start=0] ], prevVals=null, >ttls=GridLongList [idx=1, arr= [691200000] ], conflictExpireTimes=null, >nearTtls=null, nearExpireTimes=null, nearKeys=null, nearVals=null, >obsoleteIndexes=null, forceTransformBackups=false, updateCntrs=GridLongList >[idx=1, arr= [535297] ], super=GridDhtAtomicAbstractUpdateRequest >[onRes=false, nearNodeId=d4e9311d-4d6d-406b-8152-db69ae40e203, >nearFutId=416118, flags=hasRes] ]]], Message closure [msg=GridIoMessage >[plc=2, topic=TOPIC_CACHE, topicOrd=8, ordered=false, timeout=0, >skipOnTimeout=false, msg=GridDhtAtomicDeferredUpdateResponse >[futIds=GridLongList [idx=1, arr= [3630062] ]]]], Message closure >[msg=GridIoMessage [plc=2, topic=TOPIC_CACHE, topicOrd=8, ordered=false, >timeout=0, skipOnTimeout=false, msg=GridDhtAtomicDeferredUpdateResponse >[futIds=GridLongList [idx=1, arr= [3630062] ]]]], >o.a.i.i.processors.cache.distributed.dht.atomic.GridDhtAtomicCache$DeferredUpdateTimeout@6f0944ce, > >o.a.i.i.processors.cache.distributed.dht.atomic.GridDhtAtomicCache$DeferredUpdateTimeout@44c595b1, > Message closure [msg=GridIoMessage [plc=2, topic=TOPIC_CACHE, topicOrd=8, >ordered=false, timeout=0, skipOnTimeout=false, >msg=GridDhtAtomicDeferredUpdateResponse [futIds=GridLongList [idx=1, arr= >[3649221] ]]]]] >Deadlock: false >Completed: 8884416 >Thread name="sys-stripe-3-#4%ignite-jetstream-prd1%", id=20, >state=TIMED_WAITING, blockCnt=15385, waitCnt=12226082 >Lock >[object=java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync@70b28434, > ownerName=null, ownerId=-1] >at sun.misc.Unsafe.park(Native Method) >at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215) >at >java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedNanos(AbstractQueuedSynchronizer.java:1037) >at >java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(AbstractQueuedSynchronizer.java:1328) >at >java.util.concurrent.locks.ReentrantReadWriteLock$ReadLock.tryLock(ReentrantReadWriteLock.java:871) >at >o.a.i.i.processors.cache.persistence.GridCacheDatabaseSharedManager.checkpointReadLock(GridCacheDatabaseSharedManager.java:1626) >at >o.a.i.i.processors.cache.GridCacheMapEntry.updateTtl(GridCacheMapEntry.java:3023) >at >o.a.i.i.processors.cache.GridCacheMapEntry.updateTtl(GridCacheMapEntry.java:4191) >at >o.a.i.i.processors.cache.distributed.dht.GridDhtCacheAdapter.updateTtl(GridDhtCacheAdapter.java:1226) >at >o.a.i.i.processors.cache.distributed.dht.GridDhtCacheAdapter.processTtlUpdateRequest(GridDhtCacheAdapter.java:1189) >at >o.a.i.i.processors.cache.distributed.dht.GridDhtCacheAdapter.access$300(GridDhtCacheAdapter.java:104) >at >o.a.i.i.processors.cache.distributed.dht.GridDhtCacheAdapter$3.apply(GridDhtCacheAdapter.java:398) >at >o.a.i.i.processors.cache.distributed.dht.GridDhtCacheAdapter$3.apply(GridDhtCacheAdapter.java:396) >at >o.a.i.i.processors.cache.GridCacheIoManager.processMessage(GridCacheIoManager.java:1142) >at >o.a.i.i.processors.cache.GridCacheIoManager.onMessage0(GridCacheIoManager.java:591) >at >o.a.i.i.processors.cache.GridCacheIoManager.handleMessage(GridCacheIoManager.java:392) >at >o.a.i.i.processors.cache.GridCacheIoManager.handleMessage(GridCacheIoManager.java:318) >at >o.a.i.i.processors.cache.GridCacheIoManager.access$100(GridCacheIoManager.java:109) >at >o.a.i.i.processors.cache.GridCacheIoManager$1.onMessage(GridCacheIoManager.java:308) >at >o.a.i.i.managers.communication.GridIoManager.invokeListener(GridIoManager.java:1847) >at >o.a.i.i.managers.communication.GridIoManager.processRegularMessage0(GridIoManager.java:1472) >at >o.a.i.i.managers.communication.GridIoManager.access$5200(GridIoManager.java:229) >at o.a.i.i.managers.communication.GridIoManager$9.run(GridIoManager.java:1367) >at o.a.i.i.util.StripedExecutor$Stripe.body(StripedExecutor.java:565) >at o.a.i.i.util.worker.GridWorker.run(GridWorker.java:120) >at java.lang.Thread.run(Thread.java:748) > >Ray > >Sent with Shift > >This electronic communication and the information and any files transmitted >with it, or attached to it, are confidential and are intended solely for the >use of the individual or entity to whom it is addressed and may contain >information that is confidential, legally privileged, protected by privacy >laws, or otherwise restricted from disclosure to anyone else. If you are not >the intended recipient or the person responsible for delivering the e-mail to >the intended recipient, you are hereby notified that any use, copying, >distributing, dissemination, forwarding, printing, or copying of this e-mail >is strictly prohibited. If you received this e-mail in error, please return >the e-mail to the sender, delete it from your computer, and destroy any >printed copy of it.