Iurii Gerzhedovich created IGNITE-24895: -------------------------------------------
Summary: AI3. Cluster became broken after a few restarts Key: IGNITE-24895 URL: https://issues.apache.org/jira/browse/IGNITE-24895 Project: Ignite Issue Type: Improvement Reporter: Iurii Gerzhedovich The scenario is straightforward but can have a variety number of restarts. So, I just run org.apache.ignite.internal.benchmark.TpchBenchmark with TPCH SF 0.1 dataset with defined working directory to keep persistence for every run. In other words the scenario can be just a 1. Create 3 node cluster. 2. Load some data. 3. Run SQL RO loads. 4. Restart cluster 5. goto 3. After an undefined number of restarts the cluster became broken and had tons of errors in logs. Try to run the cluster again on the same persistence lead the same issue. The first Exception in logs: {code:java} 2025-03-21T10:12:29,195][WARN ][%node_3345%common-scheduler-0][FailureManager] Possible failure suppressed according to a configured handler [hnd=StopNodeOrHaltFailureHandler [tryStop=true, timeout=60000, super=AbstractFailureHandler [ignoredFailureTypes=UnmodifiableSet [SYSTEM_WORKER_BLOCKED, SYSTEM_CRITICAL_OPERATION_TIMEOUT]]], failureCtx=SYSTEM_WORKER_BLOCKED] org.apache.ignite.lang.IgniteException: A critical thread is blocked for 524 ms that is more than the allowed 500 ms, it is "%node_3345%MessagingService-inbound-Default-0-0" prio=10 Id=292 WAITING on java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync@24577734 owned by "%node_3345%metastorage-compaction-executor-0" Id=595 at java.base@11.0.25/jdk.internal.misc.Unsafe.park(Native Method) - waiting on java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync@24577734 at java.base@11.0.25/java.util.concurrent.locks.LockSupport.park(LockSupport.java:194) at java.base@11.0.25/java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:885) at java.base@11.0.25/java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireShared(AbstractQueuedSynchronizer.java:1009) at java.base@11.0.25/java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireShared(AbstractQueuedSynchronizer.java:1324) at java.base@11.0.25/java.util.concurrent.locks.ReentrantReadWriteLock$ReadLock.lock(ReentrantReadWriteLock.java:738) at app//org.apache.ignite.internal.metastorage.server.AbstractKeyValueStorage.getCompactionRevision(AbstractKeyValueStorage.java:258) at app//org.apache.ignite.internal.metastorage.impl.MetaStorageManagerImpl.withTrackReadOperationFromLeaderFuture(MetaStorageManagerImpl.java:1260) at app//org.apache.ignite.internal.metastorage.impl.MetaStorageManagerImpl.lambda$getAll$49(MetaStorageManagerImpl.java:916) at app//org.apache.ignite.internal.metastorage.impl.MetaStorageManagerImpl$$Lambda$1794/0x0000000800bc2c40.get(Unknown Source) at app//org.apache.ignite.internal.util.IgniteUtils.inBusyLock(IgniteUtils.java:868) at app//org.apache.ignite.internal.metastorage.impl.MetaStorageManagerImpl.getAll(MetaStorageManagerImpl.java:914) at app//org.apache.ignite.internal.table.distributed.TableManager.lambda$writeTableAssignmentsToMetastore$51(TableManager.java:1089) at app//org.apache.ignite.internal.table.distributed.TableManager$$Lambda$1887/0x0000000800bf7440.apply(Unknown Source) at java.base@11.0.25/java.util.concurrent.CompletableFuture$UniCompose.tryFire(CompletableFuture.java:1072) at java.base@11.0.25/java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:506) at java.base@11.0.25/java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:2079) at app//org.apache.ignite.internal.raft.RaftGroupServiceImpl.lambda$sendWithRetry$49(RaftGroupServiceImpl.java:624) at app//org.apache.ignite.internal.raft.RaftGroupServiceImpl$$Lambda$1341/0x0000000800a4f840.accept(Unknown Source) at java.base@11.0.25/java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:859) at java.base@11.0.25/java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:837) at java.base@11.0.25/java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:506) at java.base@11.0.25/java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:2079) at app//org.apache.ignite.internal.network.DefaultMessagingService.onInvokeResponse(DefaultMessagingService.java:584) at app//org.apache.ignite.internal.network.DefaultMessagingService.handleInvokeResponse(DefaultMessagingService.java:475) at app//org.apache.ignite.internal.network.DefaultMessagingService.lambda$handleMessageFromNetwork$4(DefaultMessagingService.java:409) at app//org.apache.ignite.internal.network.DefaultMessagingService$$Lambda$1545/0x0000000800ad9040.run(Unknown Source) at java.base@11.0.25/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) at java.base@11.0.25/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) at java.base@11.0.25/java.lang.Thread.run(Thread.java:829) Number of locked synchronizers = 1 - java.util.concurrent.ThreadPoolExecutor$Worker@71945073 {code} And after it we have tons of such exceptions with increasing timeouts up to a half minutes ( maybe more, but need to wait to much time) -- This message was sent by Atlassian Jira (v8.20.10#820010)