[ 
https://issues.apache.org/jira/browse/IGNITE-28195?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kirill Sizov updated IGNITE-28195:
----------------------------------
    Description: 
Found this on TeamCity:

 
{noformat}
java.lang.AssertionError: java.util.concurrent.TimeoutException  at 
org.apache.ignite.internal.testframework.matchers.CompletableFutureMatcher.matchesSafely(CompletableFutureMatcher.java:71)
  at 
org.apache.ignite.internal.testframework.matchers.CompletableFutureMatcher.matchesSafely(CompletableFutureMatcher.java:28)
  at org.hamcrest.TypeSafeMatcher.matches(TypeSafeMatcher.java:83)  at 
org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:10)  at 
org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:6)  at 
org.apache.ignite.internal.disaster.ItDisasterRecoveryReconfigurationTest.awaitPrimaryReplica(ItDisasterRecoveryReconfigurationTest.java:1933)
  at 
org.apache.ignite.internal.disaster.ItDisasterRecoveryReconfigurationTest.testNewResetOverwritesFlags(ItDisasterRecoveryReconfigurationTest.java:684)
  at java.base/java.lang.reflect.Method.invoke(Method.java:568)  at 
java.base/java.util.ArrayList.forEach(ArrayList.java:1511)  at 
java.base/java.util.ArrayList.forEach(ArrayList.java:1511)Caused by: 
java.util.concurrent.TimeoutException  at 
java.base/java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1960)
  at 
java.base/java.util.concurrent.CompletableFuture.get(CompletableFuture.java:2095)
  at 
org.apache.ignite.internal.testframework.matchers.CompletableFutureMatcher.matchesSafely(CompletableFutureMatcher.java:67)
  ... 9 more {noformat}
However, the issue might be not related to disaster recovery at all.

 

What we got in logs:
{code:java}
[05:33:55]W:             [:ignite-transactions:integrationTest] 
[2026-03-10T05:33:55,214][WARN 
][%idrrt_tnrof_1%metastorage-watch-executor-0][WatchProcessor] Watch event 
processing timings 
[lsnr=org.apache.ignite.internal.distributionzones.DataNodesManager$$Lambda$1255/0x000000080112b870,
 stages=[stage="Sync notification" (0 ms),stage="Async notification" (0 
ms),stage="Total time" (0 ms)], 
lsnr=org.apache.ignite.internal.distributionzones.DataNodesManager$$Lambda$1256/0x000000080112ba98,
 stages=[stage="Sync notification" (0 ms),stage="Async notification" (0 
ms),stage="Total time" (0 ms)], 
lsnr=org.apache.ignite.internal.distributionzones.DataNodesManager$$Lambda$1257/0x000000080112bcc0,
 stages=[stage="Sync notification" (0 ms),stage="Async notification" (0 
ms),stage="Total time" (0 ms)], 
lsnr=org.apache.ignite.internal.distributionzones.rebalance.DistributionZoneRebalanceEngine$$Lambda$1249/0x0000000801129f50,
 stages=[stage="Sync notification" (0 ms),stage="Async notification" (24811 
ms),stage="Total time" (24811 ms)]] {code}
 

 

 
{noformat}
[05:34:28]W:             [:ignite-transactions:integrationTest] 
[2026-03-10T05:34:28,454][WARN 
][%idrrt_tnrof_2%metastorage-watch-executor-0][WatchProcessor] Watch event 
processing timings 
[lsnr=org.apache.ignite.internal.placementdriver.AssignmentsTracker$$Lambda$1063/0x00000008010db970,
 stages=[stage="Sync notification" (0 ms),stage="Async notification" (0 
ms),stage="Total time" (0 ms)], 
lsnr=org.apache.ignite.internal.placementdriver.AssignmentsTracker$$Lambda$1062/0x00000008010db748,
 stages=[stage="Sync notification" (0 ms),stage="Async notification" (0 
ms),stage="Total time" (0 ms)], 
lsnr=org.apache.ignite.internal.partition.replicator.PartitionReplicaLifecycleManager$$Lambda$1278/0x000000080113da00,
 stages=[stage="Sync notification" (0 ms),stage="Async notification" (0 
ms),stage="Total time" (0 ms)], 
lsnr=org.apache.ignite.internal.partition.replicator.PartitionReplicaLifecycleManager$$Lambda$1279/0x000000080113dc28,
 stages=[stage="Sync notification" (0 ms),stage="Async notification" (907 
ms),stage="Total time" (907 ms)]]{noformat}
 

 

In the test we got 5 nodes, and awaitPrimaryReplica timeouted after 60 seconds.

The issue was that from the 3 nodes of the majority only node 0 executed all MS 
events, the other two nodes were unable to keep up with node 0.

{{onCreateZone}} was created at MS revision 404, At the same time nodes 1 and 2 
had rev 321 as the last processed.

We need to investigate it further to find out how to speed up processing.

 

*What we currently see:*

5 nodes, every one subscribes in 
ZoneRebalanceUtil.createDistributionZonesDataNodesListener. 
There are 25 partitions in the default zone. 

Each node handles the even from createDistributionZonesDataNodesListener in 
triggerZonePartitionsRebalance and creates a MS event for each partition and 
waits in allOf. Once a single MS event is executed due to the iif condition, 
nevertheless we create 5 nodes * 25 partitions  = 125 events. 

So out of 125 events we process only 25 (the other 100 get rejected as 
outdated), part of those 25 execute heavy things like starting the partitions 
(weakStartReplica). But since we have allOf, we need to wait until all of them 
are finished.

 

  was:
Found this on TeamCity:

 
{noformat}
java.lang.AssertionError: java.util.concurrent.TimeoutException  at 
org.apache.ignite.internal.testframework.matchers.CompletableFutureMatcher.matchesSafely(CompletableFutureMatcher.java:71)
  at 
org.apache.ignite.internal.testframework.matchers.CompletableFutureMatcher.matchesSafely(CompletableFutureMatcher.java:28)
  at org.hamcrest.TypeSafeMatcher.matches(TypeSafeMatcher.java:83)  at 
org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:10)  at 
org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:6)  at 
org.apache.ignite.internal.disaster.ItDisasterRecoveryReconfigurationTest.awaitPrimaryReplica(ItDisasterRecoveryReconfigurationTest.java:1933)
  at 
org.apache.ignite.internal.disaster.ItDisasterRecoveryReconfigurationTest.testNewResetOverwritesFlags(ItDisasterRecoveryReconfigurationTest.java:684)
  at java.base/java.lang.reflect.Method.invoke(Method.java:568)  at 
java.base/java.util.ArrayList.forEach(ArrayList.java:1511)  at 
java.base/java.util.ArrayList.forEach(ArrayList.java:1511)Caused by: 
java.util.concurrent.TimeoutException  at 
java.base/java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1960)
  at 
java.base/java.util.concurrent.CompletableFuture.get(CompletableFuture.java:2095)
  at 
org.apache.ignite.internal.testframework.matchers.CompletableFutureMatcher.matchesSafely(CompletableFutureMatcher.java:67)
  ... 9 more {noformat}
However, the issue might be not related to disaster recovery at all


> TimeoutException on 
> ItDisasterRecoveryReconfigurationTest.testNewResetOverwritesFlags
> -------------------------------------------------------------------------------------
>
>                 Key: IGNITE-28195
>                 URL: https://issues.apache.org/jira/browse/IGNITE-28195
>             Project: Ignite
>          Issue Type: Bug
>            Reporter: Kirill Sizov
>            Priority: Major
>         Attachments: _Integration_Tests_Integration_Transactions_19869.log.zip
>
>
> Found this on TeamCity:
>  
> {noformat}
> java.lang.AssertionError: java.util.concurrent.TimeoutException  at 
> org.apache.ignite.internal.testframework.matchers.CompletableFutureMatcher.matchesSafely(CompletableFutureMatcher.java:71)
>   at 
> org.apache.ignite.internal.testframework.matchers.CompletableFutureMatcher.matchesSafely(CompletableFutureMatcher.java:28)
>   at org.hamcrest.TypeSafeMatcher.matches(TypeSafeMatcher.java:83)  at 
> org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:10)  at 
> org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:6)  at 
> org.apache.ignite.internal.disaster.ItDisasterRecoveryReconfigurationTest.awaitPrimaryReplica(ItDisasterRecoveryReconfigurationTest.java:1933)
>   at 
> org.apache.ignite.internal.disaster.ItDisasterRecoveryReconfigurationTest.testNewResetOverwritesFlags(ItDisasterRecoveryReconfigurationTest.java:684)
>   at java.base/java.lang.reflect.Method.invoke(Method.java:568)  at 
> java.base/java.util.ArrayList.forEach(ArrayList.java:1511)  at 
> java.base/java.util.ArrayList.forEach(ArrayList.java:1511)Caused by: 
> java.util.concurrent.TimeoutException  at 
> java.base/java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1960)
>   at 
> java.base/java.util.concurrent.CompletableFuture.get(CompletableFuture.java:2095)
>   at 
> org.apache.ignite.internal.testframework.matchers.CompletableFutureMatcher.matchesSafely(CompletableFutureMatcher.java:67)
>   ... 9 more {noformat}
> However, the issue might be not related to disaster recovery at all.
>  
> What we got in logs:
> {code:java}
> [05:33:55]W:           [:ignite-transactions:integrationTest] 
> [2026-03-10T05:33:55,214][WARN 
> ][%idrrt_tnrof_1%metastorage-watch-executor-0][WatchProcessor] Watch event 
> processing timings 
> [lsnr=org.apache.ignite.internal.distributionzones.DataNodesManager$$Lambda$1255/0x000000080112b870,
>  stages=[stage="Sync notification" (0 ms),stage="Async notification" (0 
> ms),stage="Total time" (0 ms)], 
> lsnr=org.apache.ignite.internal.distributionzones.DataNodesManager$$Lambda$1256/0x000000080112ba98,
>  stages=[stage="Sync notification" (0 ms),stage="Async notification" (0 
> ms),stage="Total time" (0 ms)], 
> lsnr=org.apache.ignite.internal.distributionzones.DataNodesManager$$Lambda$1257/0x000000080112bcc0,
>  stages=[stage="Sync notification" (0 ms),stage="Async notification" (0 
> ms),stage="Total time" (0 ms)], 
> lsnr=org.apache.ignite.internal.distributionzones.rebalance.DistributionZoneRebalanceEngine$$Lambda$1249/0x0000000801129f50,
>  stages=[stage="Sync notification" (0 ms),stage="Async notification" (24811 
> ms),stage="Total time" (24811 ms)]] {code}
>  
>  
>  
> {noformat}
> [05:34:28]W:           [:ignite-transactions:integrationTest] 
> [2026-03-10T05:34:28,454][WARN 
> ][%idrrt_tnrof_2%metastorage-watch-executor-0][WatchProcessor] Watch event 
> processing timings 
> [lsnr=org.apache.ignite.internal.placementdriver.AssignmentsTracker$$Lambda$1063/0x00000008010db970,
>  stages=[stage="Sync notification" (0 ms),stage="Async notification" (0 
> ms),stage="Total time" (0 ms)], 
> lsnr=org.apache.ignite.internal.placementdriver.AssignmentsTracker$$Lambda$1062/0x00000008010db748,
>  stages=[stage="Sync notification" (0 ms),stage="Async notification" (0 
> ms),stage="Total time" (0 ms)], 
> lsnr=org.apache.ignite.internal.partition.replicator.PartitionReplicaLifecycleManager$$Lambda$1278/0x000000080113da00,
>  stages=[stage="Sync notification" (0 ms),stage="Async notification" (0 
> ms),stage="Total time" (0 ms)], 
> lsnr=org.apache.ignite.internal.partition.replicator.PartitionReplicaLifecycleManager$$Lambda$1279/0x000000080113dc28,
>  stages=[stage="Sync notification" (0 ms),stage="Async notification" (907 
> ms),stage="Total time" (907 ms)]]{noformat}
>  
>  
> In the test we got 5 nodes, and awaitPrimaryReplica timeouted after 60 
> seconds.
> The issue was that from the 3 nodes of the majority only node 0 executed all 
> MS events, the other two nodes were unable to keep up with node 0.
> {{onCreateZone}} was created at MS revision 404, At the same time nodes 1 and 
> 2 had rev 321 as the last processed.
> We need to investigate it further to find out how to speed up processing.
>  
> *What we currently see:*
> 5 nodes, every one subscribes in 
> ZoneRebalanceUtil.createDistributionZonesDataNodesListener. 
> There are 25 partitions in the default zone. 
> Each node handles the even from createDistributionZonesDataNodesListener in 
> triggerZonePartitionsRebalance and creates a MS event for each partition and 
> waits in allOf. Once a single MS event is executed due to the iif condition, 
> nevertheless we create 5 nodes * 25 partitions  = 125 events. 
> So out of 125 events we process only 25 (the other 100 get rejected as 
> outdated), part of those 25 execute heavy things like starting the partitions 
> (weakStartReplica). But since we have allOf, we need to wait until all of 
> them are finished.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to