[ 
https://issues.apache.org/jira/browse/IGNITE-28195?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kirill Sizov updated IGNITE-28195:
----------------------------------
    Labels: ignite-3  (was: )

> TimeoutException on 
> ItDisasterRecoveryReconfigurationTest.testNewResetOverwritesFlags
> -------------------------------------------------------------------------------------
>
>                 Key: IGNITE-28195
>                 URL: https://issues.apache.org/jira/browse/IGNITE-28195
>             Project: Ignite
>          Issue Type: Bug
>            Reporter: Kirill Sizov
>            Priority: Major
>              Labels: ignite-3
>         Attachments: _Integration_Tests_Integration_Transactions_19869.log.zip
>
>
> Found this on TeamCity:
>  
> {noformat}
> java.lang.AssertionError: java.util.concurrent.TimeoutException  at 
> org.apache.ignite.internal.testframework.matchers.CompletableFutureMatcher.matchesSafely(CompletableFutureMatcher.java:71)
>   at 
> org.apache.ignite.internal.testframework.matchers.CompletableFutureMatcher.matchesSafely(CompletableFutureMatcher.java:28)
>   at org.hamcrest.TypeSafeMatcher.matches(TypeSafeMatcher.java:83)  at 
> org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:10)  at 
> org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:6)  at 
> org.apache.ignite.internal.disaster.ItDisasterRecoveryReconfigurationTest.awaitPrimaryReplica(ItDisasterRecoveryReconfigurationTest.java:1933)
>   at 
> org.apache.ignite.internal.disaster.ItDisasterRecoveryReconfigurationTest.testNewResetOverwritesFlags(ItDisasterRecoveryReconfigurationTest.java:684)
>   at java.base/java.lang.reflect.Method.invoke(Method.java:568)  at 
> java.base/java.util.ArrayList.forEach(ArrayList.java:1511)  at 
> java.base/java.util.ArrayList.forEach(ArrayList.java:1511)Caused by: 
> java.util.concurrent.TimeoutException  at 
> java.base/java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1960)
>   at 
> java.base/java.util.concurrent.CompletableFuture.get(CompletableFuture.java:2095)
>   at 
> org.apache.ignite.internal.testframework.matchers.CompletableFutureMatcher.matchesSafely(CompletableFutureMatcher.java:67)
>   ... 9 more {noformat}
> However, the issue might be not related to disaster recovery at all.
>  
> What we got in logs:
> {code:java}
> [05:33:55]W:           [:ignite-transactions:integrationTest] 
> [2026-03-10T05:33:55,214][WARN 
> ][%idrrt_tnrof_1%metastorage-watch-executor-0][WatchProcessor] Watch event 
> processing timings 
> [lsnr=org.apache.ignite.internal.distributionzones.DataNodesManager$$Lambda$1255/0x000000080112b870,
>  stages=[stage="Sync notification" (0 ms),stage="Async notification" (0 
> ms),stage="Total time" (0 ms)], 
> lsnr=org.apache.ignite.internal.distributionzones.DataNodesManager$$Lambda$1256/0x000000080112ba98,
>  stages=[stage="Sync notification" (0 ms),stage="Async notification" (0 
> ms),stage="Total time" (0 ms)], 
> lsnr=org.apache.ignite.internal.distributionzones.DataNodesManager$$Lambda$1257/0x000000080112bcc0,
>  stages=[stage="Sync notification" (0 ms),stage="Async notification" (0 
> ms),stage="Total time" (0 ms)], 
> lsnr=org.apache.ignite.internal.distributionzones.rebalance.DistributionZoneRebalanceEngine$$Lambda$1249/0x0000000801129f50,
>  stages=[stage="Sync notification" (0 ms),stage="Async notification" (24811 
> ms),stage="Total time" (24811 ms)]] {code}
>  
>  
>  
> {noformat}
> [05:34:28]W:           [:ignite-transactions:integrationTest] 
> [2026-03-10T05:34:28,454][WARN 
> ][%idrrt_tnrof_2%metastorage-watch-executor-0][WatchProcessor] Watch event 
> processing timings 
> [lsnr=org.apache.ignite.internal.placementdriver.AssignmentsTracker$$Lambda$1063/0x00000008010db970,
>  stages=[stage="Sync notification" (0 ms),stage="Async notification" (0 
> ms),stage="Total time" (0 ms)], 
> lsnr=org.apache.ignite.internal.placementdriver.AssignmentsTracker$$Lambda$1062/0x00000008010db748,
>  stages=[stage="Sync notification" (0 ms),stage="Async notification" (0 
> ms),stage="Total time" (0 ms)], 
> lsnr=org.apache.ignite.internal.partition.replicator.PartitionReplicaLifecycleManager$$Lambda$1278/0x000000080113da00,
>  stages=[stage="Sync notification" (0 ms),stage="Async notification" (0 
> ms),stage="Total time" (0 ms)], 
> lsnr=org.apache.ignite.internal.partition.replicator.PartitionReplicaLifecycleManager$$Lambda$1279/0x000000080113dc28,
>  stages=[stage="Sync notification" (0 ms),stage="Async notification" (907 
> ms),stage="Total time" (907 ms)]]{noformat}
>  
>  
> In the test we got 5 nodes, and awaitPrimaryReplica timeouted after 60 
> seconds.
> The issue was that from the 3 nodes of the majority only node 0 executed all 
> MS events, the other two nodes were unable to keep up with node 0.
> {{onCreateZone}} was created at MS revision 404, At the same time nodes 1 and 
> 2 had rev 321 as the last processed.
> We need to investigate it further to find out how to speed up processing.
>  
> *What we currently see:*
> 5 nodes, every one subscribes in 
> ZoneRebalanceUtil.createDistributionZonesDataNodesListener. 
> There are 25 partitions in the default zone. 
> Each node handles the even from createDistributionZonesDataNodesListener in 
> triggerZonePartitionsRebalance and creates a MS event for each partition and 
> waits in allOf. Once a single MS event is executed due to the iif condition, 
> nevertheless we create 5 nodes * 25 partitions  = 125 events. 
> So out of 125 events we process only 25 (the other 100 get rejected as 
> outdated), part of those 25 execute heavy things like starting the partitions 
> (weakStartReplica). But since we have allOf, we need to wait until all of 
> them are finished.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to