[ https://issues.apache.org/jira/browse/IGNITE-24722?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Igor updated IGNITE-24722: -------------------------- Description: *Steps to reproduce:* 1. Start 3 nodes on single Windows machine (cores=9, memory=32766) *Expected:* 3 nodes started and joined into cluster. *Actual:* 1 node makes thread dump and shutting down. The node has log messages like: {code:java} 2025-03-05 22:19:32:184 -0600 [WARNING][%BasicAi3Operations3NodesTest_cluster_1%common-scheduler-0][FailureManager] Possible failure suppressed according to a configured handler [hnd=NoOpFailureHandler [super=AbstractFailureHandler [ignoredFailureTypes=UnmodifiableSet [SYSTEM_WORKER_BLOCKED, SYSTEM_CRITICAL_OPERATION_TIMEOUT]]], failureCtx=SYSTEM_WORKER_BLOCKED] org.apache.ignite.lang.IgniteException: IGN-WORKERS-1 TraceId:538a0c73-bc2e-481b-a5df-45ab414c3e15 A critical thread is blocked for 2978 ms that is more than the allowed 500 ms, it is "%BasicAi3Operations3NodesTest_cluster_1%MessagingService-inbound-Default-0-0" prio=10 Id=153 WAITING on java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@608a31a6 at java.base@11.0.16.1/jdk.internal.misc.Unsafe.park(Native Method) - waiting on java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@608a31a6 at java.base@11.0.16.1/java.util.concurrent.locks.LockSupport.park(LockSupport.java:194) at java.base@11.0.16.1/java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2081) at java.base@11.0.16.1/java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:433) at java.base@11.0.16.1/java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1054) at java.base@11.0.16.1/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1114) at java.base@11.0.16.1/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) at java.base@11.0.16.1/java.lang.Thread.run(Thread.java:834) {code} and {code:java} 2025-03-05 22:19:32:535 -0600 [INFO][%BasicAi3Operations3NodesTest_cluster_1%MessagingService-inbound-Default-0-0][DistributionZoneManager] Failed to update distribution zones' logical topology and version keys [topology = [{id=71f7ef04-da2f-45d2-a1f1-b802e0542f67, name=BasicAi3Operations3NodesTest_cluster_0, address=172.25.1.11:3344}], version = 1] 2025-03-05 22:19:32:545 -0600 [INFO][%BasicAi3Operations3NodesTest_cluster_1%MessagingService-inbound-Default-0-0][DistributionZoneManager] Failed to update distribution zones' logical topology and version keys [topology = [{id=71f7ef04-da2f-45d2-a1f1-b802e0542f67, name=BasicAi3Operations3NodesTest_cluster_0, address=172.25.1.11:3344}, {id=764f1058-8120-43e0-bdc1-e2e49ce31818, name=BasicAi3Operations3NodesTest_cluster_2, address=172.25.1.11:3346}], version = 2] {code} Logs are in attachment. [^cluster logs.zip] was: *Steps to reproduce:* 1. Start 3 nodes on single Windows machine (cores=9, memory=32766) *Expected:* 3 nodes started and joined into cluster. *Actual:* 1 node makes thread dump and shutting down. The node has log messages like: {code:java} 2025-03-05 22:19:32:184 -0600 [WARNING][%BasicAi3Operations3NodesTest_cluster_1%common-scheduler-0][FailureManager] Possible failure suppressed according to a configured handler [hnd=NoOpFailureHandler [super=AbstractFailureHandler [ignoredFailureTypes=UnmodifiableSet [SYSTEM_WORKER_BLOCKED, SYSTEM_CRITICAL_OPERATION_TIMEOUT]]], failureCtx=SYSTEM_WORKER_BLOCKED] org.apache.ignite.lang.IgniteException: IGN-WORKERS-1 TraceId:538a0c73-bc2e-481b-a5df-45ab414c3e15 A critical thread is blocked for 2978 ms that is more than the allowed 500 ms, it is "%BasicAi3Operations3NodesTest_cluster_1%MessagingService-inbound-Default-0-0" prio=10 Id=153 WAITING on java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@608a31a6 at java.base@11.0.16.1/jdk.internal.misc.Unsafe.park(Native Method) - waiting on java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@608a31a6 at java.base@11.0.16.1/java.util.concurrent.locks.LockSupport.park(LockSupport.java:194) at java.base@11.0.16.1/java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2081) at java.base@11.0.16.1/java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:433) at java.base@11.0.16.1/java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1054) at java.base@11.0.16.1/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1114) at java.base@11.0.16.1/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) at java.base@11.0.16.1/java.lang.Thread.run(Thread.java:834) {code} and {code:java} 2025-03-05 22:19:32:535 -0600 [INFO][%BasicAi3Operations3NodesTest_cluster_1%MessagingService-inbound-Default-0-0][DistributionZoneManager] Failed to update distribution zones' logical topology and version keys [topology = [{id=71f7ef04-da2f-45d2-a1f1-b802e0542f67, name=BasicAi3Operations3NodesTest_cluster_0, address=172.25.1.11:3344}], version = 1] 2025-03-05 22:19:32:545 -0600 [INFO][%BasicAi3Operations3NodesTest_cluster_1%MessagingService-inbound-Default-0-0][DistributionZoneManager] Failed to update distribution zones' logical topology and version keys [topology = [{id=71f7ef04-da2f-45d2-a1f1-b802e0542f67, name=BasicAi3Operations3NodesTest_cluster_0, address=172.25.1.11:3344}, {id=764f1058-8120-43e0-bdc1-e2e49ce31818, name=BasicAi3Operations3NodesTest_cluster_2, address=172.25.1.11:3346}], version = 2] {code} Logs are in attachment. > [FLAKY][Windows] 1 node goes down when 3 nodes cluster is started on 9 cores > cpu > -------------------------------------------------------------------------------- > > Key: IGNITE-24722 > URL: https://issues.apache.org/jira/browse/IGNITE-24722 > Project: Ignite > Issue Type: Bug > Components: general, platforms > Affects Versions: 3.1 > Environment: 3 nodes on single Windows machine (cores=9, memory=32766) > Reporter: Igor > Priority: Major > Labels: ignite-3 > Attachments: cluster logs.zip > > > *Steps to reproduce:* > 1. Start 3 nodes on single Windows machine (cores=9, memory=32766) > *Expected:* > 3 nodes started and joined into cluster. > *Actual:* > 1 node makes thread dump and shutting down. > The node has log messages like: > {code:java} > 2025-03-05 22:19:32:184 -0600 > [WARNING][%BasicAi3Operations3NodesTest_cluster_1%common-scheduler-0][FailureManager] > Possible failure suppressed according to a configured handler > [hnd=NoOpFailureHandler [super=AbstractFailureHandler > [ignoredFailureTypes=UnmodifiableSet [SYSTEM_WORKER_BLOCKED, > SYSTEM_CRITICAL_OPERATION_TIMEOUT]]], failureCtx=SYSTEM_WORKER_BLOCKED] > org.apache.ignite.lang.IgniteException: IGN-WORKERS-1 > TraceId:538a0c73-bc2e-481b-a5df-45ab414c3e15 A critical thread is blocked for > 2978 ms that is more than the allowed 500 ms, it is > "%BasicAi3Operations3NodesTest_cluster_1%MessagingService-inbound-Default-0-0" > prio=10 Id=153 WAITING on > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@608a31a6 > at java.base@11.0.16.1/jdk.internal.misc.Unsafe.park(Native Method) > - waiting on > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@608a31a6 > at > java.base@11.0.16.1/java.util.concurrent.locks.LockSupport.park(LockSupport.java:194) > at > java.base@11.0.16.1/java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2081) > at > java.base@11.0.16.1/java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:433) > at > java.base@11.0.16.1/java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1054) > at > java.base@11.0.16.1/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1114) > at > java.base@11.0.16.1/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) > at java.base@11.0.16.1/java.lang.Thread.run(Thread.java:834) {code} > and > {code:java} > 2025-03-05 22:19:32:535 -0600 > [INFO][%BasicAi3Operations3NodesTest_cluster_1%MessagingService-inbound-Default-0-0][DistributionZoneManager] > Failed to update distribution zones' logical topology and version keys > [topology = [{id=71f7ef04-da2f-45d2-a1f1-b802e0542f67, > name=BasicAi3Operations3NodesTest_cluster_0, address=172.25.1.11:3344}], > version = 1] > 2025-03-05 22:19:32:545 -0600 > [INFO][%BasicAi3Operations3NodesTest_cluster_1%MessagingService-inbound-Default-0-0][DistributionZoneManager] > Failed to update distribution zones' logical topology and version keys > [topology = [{id=71f7ef04-da2f-45d2-a1f1-b802e0542f67, > name=BasicAi3Operations3NodesTest_cluster_0, address=172.25.1.11:3344}, > {id=764f1058-8120-43e0-bdc1-e2e49ce31818, > name=BasicAi3Operations3NodesTest_cluster_2, address=172.25.1.11:3346}], > version = 2] {code} > Logs are in attachment. [^cluster logs.zip] -- This message was sent by Atlassian Jira (v8.20.10#820010)