[
https://issues.apache.org/jira/browse/IGNITE-28195?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Kirill Sizov updated IGNITE-28195:
----------------------------------
Description:
Found this on TeamCity:
{noformat}
java.lang.AssertionError: java.util.concurrent.TimeoutException at
org.apache.ignite.internal.testframework.matchers.CompletableFutureMatcher.matchesSafely(CompletableFutureMatcher.java:71)
at
org.apache.ignite.internal.testframework.matchers.CompletableFutureMatcher.matchesSafely(CompletableFutureMatcher.java:28)
at org.hamcrest.TypeSafeMatcher.matches(TypeSafeMatcher.java:83) at
org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:10) at
org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:6) at
org.apache.ignite.internal.disaster.ItDisasterRecoveryReconfigurationTest.awaitPrimaryReplica(ItDisasterRecoveryReconfigurationTest.java:1933)
at
org.apache.ignite.internal.disaster.ItDisasterRecoveryReconfigurationTest.testNewResetOverwritesFlags(ItDisasterRecoveryReconfigurationTest.java:684)
at java.base/java.lang.reflect.Method.invoke(Method.java:568) at
java.base/java.util.ArrayList.forEach(ArrayList.java:1511) at
java.base/java.util.ArrayList.forEach(ArrayList.java:1511)Caused by:
java.util.concurrent.TimeoutException at
java.base/java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1960)
at
java.base/java.util.concurrent.CompletableFuture.get(CompletableFuture.java:2095)
at
org.apache.ignite.internal.testframework.matchers.CompletableFutureMatcher.matchesSafely(CompletableFutureMatcher.java:67)
... 9 more {noformat}
However, the issue might be not related to disaster recovery at all.
What we got in logs:
{code:java}
[05:33:55]W: [:ignite-transactions:integrationTest]
[2026-03-10T05:33:55,214][WARN
][%idrrt_tnrof_1%metastorage-watch-executor-0][WatchProcessor] Watch event
processing timings
[lsnr=org.apache.ignite.internal.distributionzones.DataNodesManager$$Lambda$1255/0x000000080112b870,
stages=[stage="Sync notification" (0 ms),stage="Async notification" (0
ms),stage="Total time" (0 ms)],
lsnr=org.apache.ignite.internal.distributionzones.DataNodesManager$$Lambda$1256/0x000000080112ba98,
stages=[stage="Sync notification" (0 ms),stage="Async notification" (0
ms),stage="Total time" (0 ms)],
lsnr=org.apache.ignite.internal.distributionzones.DataNodesManager$$Lambda$1257/0x000000080112bcc0,
stages=[stage="Sync notification" (0 ms),stage="Async notification" (0
ms),stage="Total time" (0 ms)],
lsnr=org.apache.ignite.internal.distributionzones.rebalance.DistributionZoneRebalanceEngine$$Lambda$1249/0x0000000801129f50,
stages=[stage="Sync notification" (0 ms),stage="Async notification" (24811
ms),stage="Total time" (24811 ms)]] {code}
{noformat}
[05:34:28]W: [:ignite-transactions:integrationTest]
[2026-03-10T05:34:28,454][WARN
][%idrrt_tnrof_2%metastorage-watch-executor-0][WatchProcessor] Watch event
processing timings
[lsnr=org.apache.ignite.internal.placementdriver.AssignmentsTracker$$Lambda$1063/0x00000008010db970,
stages=[stage="Sync notification" (0 ms),stage="Async notification" (0
ms),stage="Total time" (0 ms)],
lsnr=org.apache.ignite.internal.placementdriver.AssignmentsTracker$$Lambda$1062/0x00000008010db748,
stages=[stage="Sync notification" (0 ms),stage="Async notification" (0
ms),stage="Total time" (0 ms)],
lsnr=org.apache.ignite.internal.partition.replicator.PartitionReplicaLifecycleManager$$Lambda$1278/0x000000080113da00,
stages=[stage="Sync notification" (0 ms),stage="Async notification" (0
ms),stage="Total time" (0 ms)],
lsnr=org.apache.ignite.internal.partition.replicator.PartitionReplicaLifecycleManager$$Lambda$1279/0x000000080113dc28,
stages=[stage="Sync notification" (0 ms),stage="Async notification" (907
ms),stage="Total time" (907 ms)]]{noformat}
In the test we got 5 nodes, and awaitPrimaryReplica timeouted after 60 seconds.
The issue was that from the 3 nodes of the majority only node 0 executed all MS
events, the other two nodes were unable to keep up with node 0.
{{onCreateZone}} was created at MS revision 404, At the same time nodes 1 and 2
had rev 321 as the last processed.
We need to investigate it further to find out how to speed up processing.
*What we currently see:*
5 nodes, every one subscribes in
ZoneRebalanceUtil.createDistributionZonesDataNodesListener.
There are 25 partitions in the default zone.
Each node handles the even from createDistributionZonesDataNodesListener in
triggerZonePartitionsRebalance and creates a MS event for each partition and
waits in allOf. Once a single MS event is executed due to the iif condition,
nevertheless we create 5 nodes * 25 partitions = 125 events.
So out of 125 events we process only 25 (the other 100 get rejected as
outdated), part of those 25 execute heavy things like starting the partitions
(weakStartReplica). But since we have allOf, we need to wait until all of them
are finished.
was:
Found this on TeamCity:
{noformat}
java.lang.AssertionError: java.util.concurrent.TimeoutException at
org.apache.ignite.internal.testframework.matchers.CompletableFutureMatcher.matchesSafely(CompletableFutureMatcher.java:71)
at
org.apache.ignite.internal.testframework.matchers.CompletableFutureMatcher.matchesSafely(CompletableFutureMatcher.java:28)
at org.hamcrest.TypeSafeMatcher.matches(TypeSafeMatcher.java:83) at
org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:10) at
org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:6) at
org.apache.ignite.internal.disaster.ItDisasterRecoveryReconfigurationTest.awaitPrimaryReplica(ItDisasterRecoveryReconfigurationTest.java:1933)
at
org.apache.ignite.internal.disaster.ItDisasterRecoveryReconfigurationTest.testNewResetOverwritesFlags(ItDisasterRecoveryReconfigurationTest.java:684)
at java.base/java.lang.reflect.Method.invoke(Method.java:568) at
java.base/java.util.ArrayList.forEach(ArrayList.java:1511) at
java.base/java.util.ArrayList.forEach(ArrayList.java:1511)Caused by:
java.util.concurrent.TimeoutException at
java.base/java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1960)
at
java.base/java.util.concurrent.CompletableFuture.get(CompletableFuture.java:2095)
at
org.apache.ignite.internal.testframework.matchers.CompletableFutureMatcher.matchesSafely(CompletableFutureMatcher.java:67)
... 9 more {noformat}
However, the issue might be not related to disaster recovery at all
> TimeoutException on
> ItDisasterRecoveryReconfigurationTest.testNewResetOverwritesFlags
> -------------------------------------------------------------------------------------
>
> Key: IGNITE-28195
> URL: https://issues.apache.org/jira/browse/IGNITE-28195
> Project: Ignite
> Issue Type: Bug
> Reporter: Kirill Sizov
> Priority: Major
> Attachments: _Integration_Tests_Integration_Transactions_19869.log.zip
>
>
> Found this on TeamCity:
>
> {noformat}
> java.lang.AssertionError: java.util.concurrent.TimeoutException at
> org.apache.ignite.internal.testframework.matchers.CompletableFutureMatcher.matchesSafely(CompletableFutureMatcher.java:71)
> at
> org.apache.ignite.internal.testframework.matchers.CompletableFutureMatcher.matchesSafely(CompletableFutureMatcher.java:28)
> at org.hamcrest.TypeSafeMatcher.matches(TypeSafeMatcher.java:83) at
> org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:10) at
> org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:6) at
> org.apache.ignite.internal.disaster.ItDisasterRecoveryReconfigurationTest.awaitPrimaryReplica(ItDisasterRecoveryReconfigurationTest.java:1933)
> at
> org.apache.ignite.internal.disaster.ItDisasterRecoveryReconfigurationTest.testNewResetOverwritesFlags(ItDisasterRecoveryReconfigurationTest.java:684)
> at java.base/java.lang.reflect.Method.invoke(Method.java:568) at
> java.base/java.util.ArrayList.forEach(ArrayList.java:1511) at
> java.base/java.util.ArrayList.forEach(ArrayList.java:1511)Caused by:
> java.util.concurrent.TimeoutException at
> java.base/java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1960)
> at
> java.base/java.util.concurrent.CompletableFuture.get(CompletableFuture.java:2095)
> at
> org.apache.ignite.internal.testframework.matchers.CompletableFutureMatcher.matchesSafely(CompletableFutureMatcher.java:67)
> ... 9 more {noformat}
> However, the issue might be not related to disaster recovery at all.
>
> What we got in logs:
> {code:java}
> [05:33:55]W: [:ignite-transactions:integrationTest]
> [2026-03-10T05:33:55,214][WARN
> ][%idrrt_tnrof_1%metastorage-watch-executor-0][WatchProcessor] Watch event
> processing timings
> [lsnr=org.apache.ignite.internal.distributionzones.DataNodesManager$$Lambda$1255/0x000000080112b870,
> stages=[stage="Sync notification" (0 ms),stage="Async notification" (0
> ms),stage="Total time" (0 ms)],
> lsnr=org.apache.ignite.internal.distributionzones.DataNodesManager$$Lambda$1256/0x000000080112ba98,
> stages=[stage="Sync notification" (0 ms),stage="Async notification" (0
> ms),stage="Total time" (0 ms)],
> lsnr=org.apache.ignite.internal.distributionzones.DataNodesManager$$Lambda$1257/0x000000080112bcc0,
> stages=[stage="Sync notification" (0 ms),stage="Async notification" (0
> ms),stage="Total time" (0 ms)],
> lsnr=org.apache.ignite.internal.distributionzones.rebalance.DistributionZoneRebalanceEngine$$Lambda$1249/0x0000000801129f50,
> stages=[stage="Sync notification" (0 ms),stage="Async notification" (24811
> ms),stage="Total time" (24811 ms)]] {code}
>
>
>
> {noformat}
> [05:34:28]W: [:ignite-transactions:integrationTest]
> [2026-03-10T05:34:28,454][WARN
> ][%idrrt_tnrof_2%metastorage-watch-executor-0][WatchProcessor] Watch event
> processing timings
> [lsnr=org.apache.ignite.internal.placementdriver.AssignmentsTracker$$Lambda$1063/0x00000008010db970,
> stages=[stage="Sync notification" (0 ms),stage="Async notification" (0
> ms),stage="Total time" (0 ms)],
> lsnr=org.apache.ignite.internal.placementdriver.AssignmentsTracker$$Lambda$1062/0x00000008010db748,
> stages=[stage="Sync notification" (0 ms),stage="Async notification" (0
> ms),stage="Total time" (0 ms)],
> lsnr=org.apache.ignite.internal.partition.replicator.PartitionReplicaLifecycleManager$$Lambda$1278/0x000000080113da00,
> stages=[stage="Sync notification" (0 ms),stage="Async notification" (0
> ms),stage="Total time" (0 ms)],
> lsnr=org.apache.ignite.internal.partition.replicator.PartitionReplicaLifecycleManager$$Lambda$1279/0x000000080113dc28,
> stages=[stage="Sync notification" (0 ms),stage="Async notification" (907
> ms),stage="Total time" (907 ms)]]{noformat}
>
>
> In the test we got 5 nodes, and awaitPrimaryReplica timeouted after 60
> seconds.
> The issue was that from the 3 nodes of the majority only node 0 executed all
> MS events, the other two nodes were unable to keep up with node 0.
> {{onCreateZone}} was created at MS revision 404, At the same time nodes 1 and
> 2 had rev 321 as the last processed.
> We need to investigate it further to find out how to speed up processing.
>
> *What we currently see:*
> 5 nodes, every one subscribes in
> ZoneRebalanceUtil.createDistributionZonesDataNodesListener.
> There are 25 partitions in the default zone.
> Each node handles the even from createDistributionZonesDataNodesListener in
> triggerZonePartitionsRebalance and creates a MS event for each partition and
> waits in allOf. Once a single MS event is executed due to the iif condition,
> nevertheless we create 5 nodes * 25 partitions = 125 events.
> So out of 125 events we process only 25 (the other 100 get rejected as
> outdated), part of those 25 execute heavy things like starting the partitions
> (weakStartReplica). But since we have allOf, we need to wait until all of
> them are finished.
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)