[
https://issues.apache.org/jira/browse/IGNITE-28195?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Kirill Sizov updated IGNITE-28195:
----------------------------------
Labels: ignite-3 (was: )
> TimeoutException on
> ItDisasterRecoveryReconfigurationTest.testNewResetOverwritesFlags
> -------------------------------------------------------------------------------------
>
> Key: IGNITE-28195
> URL: https://issues.apache.org/jira/browse/IGNITE-28195
> Project: Ignite
> Issue Type: Bug
> Reporter: Kirill Sizov
> Priority: Major
> Labels: ignite-3
> Attachments: _Integration_Tests_Integration_Transactions_19869.log.zip
>
>
> Found this on TeamCity:
>
> {noformat}
> java.lang.AssertionError: java.util.concurrent.TimeoutException at
> org.apache.ignite.internal.testframework.matchers.CompletableFutureMatcher.matchesSafely(CompletableFutureMatcher.java:71)
> at
> org.apache.ignite.internal.testframework.matchers.CompletableFutureMatcher.matchesSafely(CompletableFutureMatcher.java:28)
> at org.hamcrest.TypeSafeMatcher.matches(TypeSafeMatcher.java:83) at
> org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:10) at
> org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:6) at
> org.apache.ignite.internal.disaster.ItDisasterRecoveryReconfigurationTest.awaitPrimaryReplica(ItDisasterRecoveryReconfigurationTest.java:1933)
> at
> org.apache.ignite.internal.disaster.ItDisasterRecoveryReconfigurationTest.testNewResetOverwritesFlags(ItDisasterRecoveryReconfigurationTest.java:684)
> at java.base/java.lang.reflect.Method.invoke(Method.java:568) at
> java.base/java.util.ArrayList.forEach(ArrayList.java:1511) at
> java.base/java.util.ArrayList.forEach(ArrayList.java:1511)Caused by:
> java.util.concurrent.TimeoutException at
> java.base/java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1960)
> at
> java.base/java.util.concurrent.CompletableFuture.get(CompletableFuture.java:2095)
> at
> org.apache.ignite.internal.testframework.matchers.CompletableFutureMatcher.matchesSafely(CompletableFutureMatcher.java:67)
> ... 9 more {noformat}
> However, the issue might be not related to disaster recovery at all.
>
> What we got in logs:
> {code:java}
> [05:33:55]W: [:ignite-transactions:integrationTest]
> [2026-03-10T05:33:55,214][WARN
> ][%idrrt_tnrof_1%metastorage-watch-executor-0][WatchProcessor] Watch event
> processing timings
> [lsnr=org.apache.ignite.internal.distributionzones.DataNodesManager$$Lambda$1255/0x000000080112b870,
> stages=[stage="Sync notification" (0 ms),stage="Async notification" (0
> ms),stage="Total time" (0 ms)],
> lsnr=org.apache.ignite.internal.distributionzones.DataNodesManager$$Lambda$1256/0x000000080112ba98,
> stages=[stage="Sync notification" (0 ms),stage="Async notification" (0
> ms),stage="Total time" (0 ms)],
> lsnr=org.apache.ignite.internal.distributionzones.DataNodesManager$$Lambda$1257/0x000000080112bcc0,
> stages=[stage="Sync notification" (0 ms),stage="Async notification" (0
> ms),stage="Total time" (0 ms)],
> lsnr=org.apache.ignite.internal.distributionzones.rebalance.DistributionZoneRebalanceEngine$$Lambda$1249/0x0000000801129f50,
> stages=[stage="Sync notification" (0 ms),stage="Async notification" (24811
> ms),stage="Total time" (24811 ms)]] {code}
>
>
>
> {noformat}
> [05:34:28]W: [:ignite-transactions:integrationTest]
> [2026-03-10T05:34:28,454][WARN
> ][%idrrt_tnrof_2%metastorage-watch-executor-0][WatchProcessor] Watch event
> processing timings
> [lsnr=org.apache.ignite.internal.placementdriver.AssignmentsTracker$$Lambda$1063/0x00000008010db970,
> stages=[stage="Sync notification" (0 ms),stage="Async notification" (0
> ms),stage="Total time" (0 ms)],
> lsnr=org.apache.ignite.internal.placementdriver.AssignmentsTracker$$Lambda$1062/0x00000008010db748,
> stages=[stage="Sync notification" (0 ms),stage="Async notification" (0
> ms),stage="Total time" (0 ms)],
> lsnr=org.apache.ignite.internal.partition.replicator.PartitionReplicaLifecycleManager$$Lambda$1278/0x000000080113da00,
> stages=[stage="Sync notification" (0 ms),stage="Async notification" (0
> ms),stage="Total time" (0 ms)],
> lsnr=org.apache.ignite.internal.partition.replicator.PartitionReplicaLifecycleManager$$Lambda$1279/0x000000080113dc28,
> stages=[stage="Sync notification" (0 ms),stage="Async notification" (907
> ms),stage="Total time" (907 ms)]]{noformat}
>
>
> In the test we got 5 nodes, and awaitPrimaryReplica timeouted after 60
> seconds.
> The issue was that from the 3 nodes of the majority only node 0 executed all
> MS events, the other two nodes were unable to keep up with node 0.
> {{onCreateZone}} was created at MS revision 404, At the same time nodes 1 and
> 2 had rev 321 as the last processed.
> We need to investigate it further to find out how to speed up processing.
>
> *What we currently see:*
> 5 nodes, every one subscribes in
> ZoneRebalanceUtil.createDistributionZonesDataNodesListener.
> There are 25 partitions in the default zone.
> Each node handles the even from createDistributionZonesDataNodesListener in
> triggerZonePartitionsRebalance and creates a MS event for each partition and
> waits in allOf. Once a single MS event is executed due to the iif condition,
> nevertheless we create 5 nodes * 25 partitions = 125 events.
> So out of 125 events we process only 25 (the other 100 get rejected as
> outdated), part of those 25 execute heavy things like starting the partitions
> (weakStartReplica). But since we have allOf, we need to wait until all of
> them are finished.
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)