[ https://issues.apache.org/jira/browse/IGNITE-24895?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Kirill Tkalenko updated IGNITE-24895: ------------------------------------- Fix Version/s: 3.1 > AI3. Cluster became broken after a few restarts > ----------------------------------------------- > > Key: IGNITE-24895 > URL: https://issues.apache.org/jira/browse/IGNITE-24895 > Project: Ignite > Issue Type: Improvement > Reporter: Iurii Gerzhedovich > Assignee: Kirill Tkalenko > Priority: Blocker > Labels: ignite-3 > Fix For: 3.1 > > > The scenario is straightforward but can have a variety number of restarts. > So, I just run org.apache.ignite.internal.benchmark.TpchBenchmark with TPCH > SF 0.1 dataset with defined working directory to keep persistence for every > run. > In other words the scenario can be just a > 1. Create 3 node cluster. > 2. Load some data. > 3. Run SQL RO loads. > 4. Restart cluster > 5. goto 3. > After an undefined number of restarts the cluster became broken and had tons > of errors in logs. Try to run the cluster again on the same persistence lead > the same issue. > The first Exception in logs: > {code:java} > 2025-03-21T10:12:29,195][WARN > ][%node_3345%common-scheduler-0][FailureManager] Possible failure suppressed > according to a configured handler [hnd=StopNodeOrHaltFailureHandler > [tryStop=true, timeout=60000, super=AbstractFailureHandler > [ignoredFailureTypes=UnmodifiableSet [SYSTEM_WORKER_BLOCKED, > SYSTEM_CRITICAL_OPERATION_TIMEOUT]]], failureCtx=SYSTEM_WORKER_BLOCKED] > org.apache.ignite.lang.IgniteException: A critical thread is blocked for 524 > ms that is more than the allowed 500 ms, it is > "%node_3345%MessagingService-inbound-Default-0-0" prio=10 Id=292 WAITING on > java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync@24577734 owned > by "%node_3345%metastorage-compaction-executor-0" Id=595 > at java.base@11.0.25/jdk.internal.misc.Unsafe.park(Native Method) > - waiting on > java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync@24577734 > at > java.base@11.0.25/java.util.concurrent.locks.LockSupport.park(LockSupport.java:194) > at > java.base@11.0.25/java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:885) > at > java.base@11.0.25/java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireShared(AbstractQueuedSynchronizer.java:1009) > at > java.base@11.0.25/java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireShared(AbstractQueuedSynchronizer.java:1324) > at > java.base@11.0.25/java.util.concurrent.locks.ReentrantReadWriteLock$ReadLock.lock(ReentrantReadWriteLock.java:738) > at > app//org.apache.ignite.internal.metastorage.server.AbstractKeyValueStorage.getCompactionRevision(AbstractKeyValueStorage.java:258) > at > app//org.apache.ignite.internal.metastorage.impl.MetaStorageManagerImpl.withTrackReadOperationFromLeaderFuture(MetaStorageManagerImpl.java:1260) > at > app//org.apache.ignite.internal.metastorage.impl.MetaStorageManagerImpl.lambda$getAll$49(MetaStorageManagerImpl.java:916) > at > app//org.apache.ignite.internal.metastorage.impl.MetaStorageManagerImpl$$Lambda$1794/0x0000000800bc2c40.get(Unknown > Source) > at > app//org.apache.ignite.internal.util.IgniteUtils.inBusyLock(IgniteUtils.java:868) > at > app//org.apache.ignite.internal.metastorage.impl.MetaStorageManagerImpl.getAll(MetaStorageManagerImpl.java:914) > at > app//org.apache.ignite.internal.table.distributed.TableManager.lambda$writeTableAssignmentsToMetastore$51(TableManager.java:1089) > at > app//org.apache.ignite.internal.table.distributed.TableManager$$Lambda$1887/0x0000000800bf7440.apply(Unknown > Source) > at > java.base@11.0.25/java.util.concurrent.CompletableFuture$UniCompose.tryFire(CompletableFuture.java:1072) > at > java.base@11.0.25/java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:506) > at > java.base@11.0.25/java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:2079) > at > app//org.apache.ignite.internal.raft.RaftGroupServiceImpl.lambda$sendWithRetry$49(RaftGroupServiceImpl.java:624) > at > app//org.apache.ignite.internal.raft.RaftGroupServiceImpl$$Lambda$1341/0x0000000800a4f840.accept(Unknown > Source) > at > java.base@11.0.25/java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:859) > at > java.base@11.0.25/java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:837) > at > java.base@11.0.25/java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:506) > at > java.base@11.0.25/java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:2079) > at > app//org.apache.ignite.internal.network.DefaultMessagingService.onInvokeResponse(DefaultMessagingService.java:584) > at > app//org.apache.ignite.internal.network.DefaultMessagingService.handleInvokeResponse(DefaultMessagingService.java:475) > at > app//org.apache.ignite.internal.network.DefaultMessagingService.lambda$handleMessageFromNetwork$4(DefaultMessagingService.java:409) > at > app//org.apache.ignite.internal.network.DefaultMessagingService$$Lambda$1545/0x0000000800ad9040.run(Unknown > Source) > at > java.base@11.0.25/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) > at > java.base@11.0.25/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) > at java.base@11.0.25/java.lang.Thread.run(Thread.java:829) > Number of locked synchronizers = 1 > - java.util.concurrent.ThreadPoolExecutor$Worker@71945073 {code} > And after it we have tons of such exceptions with increasing timeouts up to a > half minutes ( maybe more, but need to wait to much time) -- This message was sent by Atlassian Jira (v8.20.10#820010)