Ivan Bessonov created IGNITE-13814: -------------------------------------- Summary: Long restorePartitionStates triggers FailureHandler on node startup Key: IGNITE-13814 URL: https://issues.apache.org/jira/browse/IGNITE-13814 Project: Ignite Issue Type: Bug Environment: {noformat} Thread [name="sys-stripe-4-#5%EPE_CLUSTER_PERF%", id=24, state=WAITING, blockCnt=4, waitCnt=70836] at java.base@11.0.8/jdk.internal.misc.Unsafe.park(Native Method) at java.base@11.0.8/java.util.concurrent.locks.LockSupport.park(LockSupport.java:323) at app//o.a.i.i.util.future.GridFutureAdapter.get0(GridFutureAdapter.java:186) at app//o.a.i.i.util.future.GridFutureAdapter.getUninterruptibly(GridFutureAdapter.java:154) at app//o.a.i.i.processors.cache.persistence.file.AsyncFileIO.read(AsyncFileIO.java:128) at app//o.a.i.i.processors.cache.persistence.file.AbstractFileIO$2.run(AbstractFileIO.java:89) at app//o.a.i.i.processors.cache.persistence.file.AbstractFileIO.fully(AbstractFileIO.java:52) at app//o.a.i.i.processors.cache.persistence.file.AbstractFileIO.readFully(AbstractFileIO.java:87) at app//o.a.i.i.processors.cache.persistence.file.FilePageStore.readWithFailover(FilePageStore.java:794) at app//o.a.i.i.processors.cache.persistence.file.FilePageStore.read(FilePageStore.java:418) at app//o.a.i.i.processors.cache.persistence.file.FilePageStoreManager.read(FilePageStoreManager.java:519) at app//o.a.i.i.processors.cache.persistence.file.FilePageStoreManager.read(FilePageStoreManager.java:503) at app//o.a.i.i.processors.cache.persistence.pagemem.PageMemoryImpl.acquirePage(PageMemoryImpl.java:874) at app//o.a.i.i.processors.cache.persistence.pagemem.PageMemoryImpl.acquirePage(PageMemoryImpl.java:700) at app//o.a.i.i.processors.cache.persistence.pagemem.PageMemoryImpl.acquirePage(PageMemoryImpl.java:689) at app//o.a.i.i.processors.cache.persistence.DataStructure.acquirePage(DataStructure.java:157) at app//o.a.i.i.processors.cache.persistence.freelist.PagesList.init(PagesList.java:274) at app//o.a.i.i.processors.cache.persistence.freelist.AbstractFreeList.<init>(AbstractFreeList.java:390) at app//o.a.i.i.processors.cache.persistence.freelist.CacheFreeList.<init>(CacheFreeList.java:57) at app//o.a.i.i.processors.cache.persistence.GridCacheOffheapManager$GridCacheDataStore$1.<init>(GridCacheOffheapManager.java:1806) at app//o.a.i.i.processors.cache.persistence.GridCacheOffheapManager$GridCacheDataStore.init0(GridCacheOffheapManager.java:1805) at app//o.a.i.i.processors.cache.persistence.GridCacheOffheapManager$GridCacheDataStore.init(GridCacheOffheapManager.java:2130) at app//o.a.i.i.processors.cache.persistence.GridCacheOffheapManager.restorePartitionStates(GridCacheOffheapManager.java:544) at app//o.a.i.i.processors.cache.GridCacheProcessor$CacheRecoveryLifecycle.lambda$restorePartitionStates$0(GridCacheProcessor.java:5253) at app//o.a.i.i.processors.cache.GridCacheProcessor$CacheRecoveryLifecycle$$Lambda$633/0x0000000800717040.run(Unknown Source) at app//o.a.i.i.util.StripedExecutor$Stripe.body(StripedExecutor.java:559) at app//o.a.i.i.util.worker.GridWorker.run(GridWorker.java:119) at java.base@11.0.8/java.lang.Thread.run(Thread.java:834){noformat} In this case, warm-up is on, but client also reports this to happen without warm-up.I don't think that restore partition states should trigger FH. It may take a lot of time with PDS. Also, why do we run it in striped pool? Let's imagine two large caches get the same stripe - restore time doubles. Reporter: Ivan Bessonov Assignee: Ivan Bessonov
The following would be printed to log: {noformat} [2020-10-30T17:32:26,190][WARN ][grid-timeout-worker-#22%EPE_CLUSTER_PERF%][] Possible failure suppressed accordingly to a configured handler [hnd=StopNodeOrHaltFailureHandler [tryStop=false, timeout=0, super=AbstractFailureHandler [ignoredFailureTypes=UnmodifiableSet [SYSTEM_WORKER_BLOCKED, SYSTEM_CRITICAL_OPERATION_TIMEOUT]]], failureCtx=FailureContext [type=SYSTEM_WORKER_BLOCKED, err=class o.a.i.IgniteException: GridWorker [name=sys-stripe-4, igniteInstanceName=EPE_CLUSTER_PERF, finished=false, heartbeatTs=1604104192954]]] org.apache.ignite.IgniteException: GridWorker [name=sys-stripe-4, igniteInstanceName=EPE_CLUSTER_PERF, finished=false, heartbeatTs=1604104192954] at org.apache.ignite.internal.IgnitionEx$IgniteNamedInstance$2.apply(IgnitionEx.java:1859) [ignite-core-8.7.28.jar:8.7.28] at org.apache.ignite.internal.IgnitionEx$IgniteNamedInstance$2.apply(IgnitionEx.java:1854) [ignite-core-8.7.28.jar:8.7.28] at org.apache.ignite.internal.worker.WorkersRegistry.onIdle(WorkersRegistry.java:233) [ignite-core-8.7.28.jar:8.7.28] at org.apache.ignite.internal.util.worker.GridWorker.onIdle(GridWorker.java:296) [ignite-core-8.7.28.jar:8.7.28] at org.apache.ignite.internal.processors.timeout.GridTimeoutProcessor$TimeoutWorker.body(GridTimeoutProcessor.java:265) [ignite-core-8.7.28.jar:8.7.28] at org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:119) [ignite-core-8.7.28.jar:8.7.28] at java.lang.Thread.run(Thread.java:834) [?:?]{noformat} Actual thread dump of affected thread: -- This message was sent by Atlassian Jira (v8.3.4#803005)